Data Science Virtual Machine#
A Data Science Virtual Machine is an image based on netCubed Ubuntu Linux that offers a JupyterHub, RVirtual Machine and terminal interface from the browser (https). All applications are joined to Azure AD and users can get started with their BDTI account.
In order to access a Data Science Virtual Machine you will need the following details:
- Private IP address
Accessing Data Science Virtual Machine via Web interface#
- In your Amazon WorkSpace or on your local desktop after connecting via a bastion or VPN, go to the private IP address of the Data Science Virtual Machine.
-
Select an interface to connect to
-
Use your BDTI credentials to login and click on “Login”
-
When you successfully log in for the first time, your home directory will be automatically created based on your BDTI login.
Using Jupyter Notebooks to connect to EMR Vanilla#
When using Jupyter notebooks, multiple kernels are readily available:
The Python 3.7 kernel will execute Python code on the Data Science Virtual Machine. PySpark (Python), Spark (Scala) and SparkR (R) kernels will only work if you have a running EMR Vanilla cluster in your DSL and if you configured your spark configuration file in your home directory. Data Science Virtual Machine is configured with sparkmagic, this allows you to connect to external EMR Vanilla clusters through Livy, a REST interface for Spark.
You need to configure your local spark configuration at '''/home/[USERNAME]/.sparkmagic/config.json'''.
-
Open a terminal in JupyterHub:
-
Enter following command to edit the config file:
nano ~/.sparkmagic/config.json
- Copy an sparkmagic example configuration in the terminal:
You will see for the Python, Scala and R configuration that a value “localhost” is given, replace this by the value of your EMR Vanilla cluster’s IP address. Once done, press CTRL+X (exit), type Y (Overwrite) and hit Enter (Continue).
- If you would now open a PySpark notebook:
- You can now start executing PySpark applications:
More information can be found here.
Accessing Data Science Virtual Machine via SSH#
Data Science Virtual Machine is configured as a regular VM and is reachable via SSH on port 22. The sshd daemon is configured to work with your BDTI credentials.