Skip to content

Data Science Virtual Machine#

A Data Science Virtual Machine is an image based on netCubed Ubuntu Linux that offers a JupyterHub, RVirtual Machine and terminal interface from the browser (https). All applications are joined to Azure AD and users can get started with their BDTI account.

In order to access a Data Science Virtual Machine you will need the following details:

  • Private IP address

Accessing Data Science Virtual Machine via Web interface#

  • In your Amazon WorkSpace or on your local desktop after connecting via a bastion or VPN, go to the private IP address of the Data Science Virtual Machine.
  • Select an interface to connect to alt-text

  • Use your BDTI credentials to login and click on “Login” alt-text

  • When you successfully log in for the first time, your home directory will be automatically created based on your BDTI login.

Using Jupyter Notebooks to connect to EMR Vanilla#

When using Jupyter notebooks, multiple kernels are readily available:

alt-text

The Python 3.7 kernel will execute Python code on the Data Science Virtual Machine. PySpark (Python), Spark (Scala) and SparkR (R) kernels will only work if you have a running EMR Vanilla cluster in your DSL and if you configured your spark configuration file in your home directory. Data Science Virtual Machine is configured with sparkmagic, this allows you to connect to external EMR Vanilla clusters through Livy, a REST interface for Spark.

You need to configure your local spark configuration at '''/home/[USERNAME]/.sparkmagic/config.json'''.

  • Open a terminal in JupyterHub: alt-text

  • Enter following command to edit the config file:

    nano ~/.sparkmagic/config.json

alt-text

alt-text

You will see for the Python, Scala and R configuration that a value “localhost” is given, replace this by the value of your EMR Vanilla cluster’s IP address. Once done, press CTRL+X (exit), type Y (Overwrite) and hit Enter (Continue).

  • If you would now open a PySpark notebook:

alt-text

  • You can now start executing PySpark applications: alt-text

More information can be found here.

Accessing Data Science Virtual Machine via SSH#

Data Science Virtual Machine is configured as a regular VM and is reachable via SSH on port 22. The sshd daemon is configured to work with your BDTI credentials.