Amazon Data Science Virtual Machine#

User needs: Data query, Data processing, Data integration
User profiles: Data Analysts, Data Scientists, Data Engineers
User assumed knowledge: How to work with a command-line interface (unless you only want to access JupyterHub and RStudio)

Data Science Virtual Machine is an image based on netCubed Ubuntu Linux that offers a JupyterHub, RStudio and terminal interface from the browser (https). All applications are joined to Azure AD and users can get started with their EC Data Platform account.

In order to access Data Science Studio you will need the following details:

Private IP address

Accessing Data Science Studio via Web interface#

In your Amazon WorkSpace or on your local desktop after connecting via a bastion or VPN, go to the private IP address of the Data Science Studio.
Select an interface to connect to
Use your EC Data Platform credentials to login and click on “Login”
When you successfully log in for the first time, your home directory will be automatically created based on your EC Data Platform login.

Using Jupyter Notebooks to connect to EMR Vanilla#

When using Jupyter notebooks, multiple kernels are readily available:

alt-text

The Python 3.7 kernel will execute Python code on the Data Science Studio. PySpark (Python), Spark (Scala) and SparkR (R) kernels will only work if you have a running EMR Vanilla cluster in your DSL and if you configured your spark configuration file in your home directory. Data Science Studio is configured with sparkmagic, this allows you to connect to external EMR Vanilla clusters through Livy, a REST interface for Spark. You need to configure your local spark configuration at '''/home/[USERNAME]/.sparkmagic/config.json'''.

Open a terminal in JupyterHub:
Enter following command to edit the config file:
Copy an sparkmagic example configuration in the terminal:

alt-text

You will see for the Python, Scala and R configuration that a value “localhost” is given, replace this by the value of your EMR Vanilla cluster’s IP address. Once done, press CTRL+X (exit), type Y (Overwrite) and hit Enter (Continue).

If you would now open a PySpark notebook:

alt-text

You can now start executing PySpark applications:

More information can be found here

Accessing Data Science Studio via SSH#

Data Science Studio is configured as a regular VM and is reachable via SSH on port 22. The sshd daemon is configured to work with your EC Data Platform credentials.