Skip to content

Azure Data Science Virtual Machine#

  • User needs: Data integration, Data query, Data processing
  • User profiles: Data Scientists, Data Engineers
  • User assumed knowledge: command line interface knowledge

The Data Science Virtual Machine (DSVM) is a server on the Azure cloud platform, built specifically for doing data science. It has many popular data science tools preinstalled and preconfigured to jumpstart building intelligent applications for advanced analytics.

The server that is available for the EC Data Platform, has the Ubuntu 18.04 LTS as an operating system.

The list of features of preinstalled and preconfigured is too lengthy to list up. The complete list is available on (check the features for the “Linux DSM”, and NOT the “Ubuntu 18.04 DSVM”): https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/tools-included

Note: when using the data science VM, identity is managed on a per-machine basis, not on a user-basis (like in HDInsight). This means that access to the folder or files should be provided to the machine, not the user using the machine. To which folders the VM has access to, should ideally be agreed at the time of the deployment. However, the access can still be easily modified after. We choose to provide the VM with read/write/execute access to the User-read-write folder that was used before.

We will dive deeper in three parts:

In order to connect to the Data Science virtual machine, you need to be inside the virtual network of the EC Data Platform. User-friendly ways to get inside this network are

  • via Amazon WorkSpaces

  • via Azure Windows Virtual Desktop.

Other options are - using the AWS Client VPN (OpenVPN) to connect your local computer to the network, - or via SSH tunning using the Bastion Host.

Please see section Accessing a Data Science Lab environment to see all these options.

Connecting to the Data Science VM via SSH#

The example below assumes we are already connected to our Amazon WorkSpaces or Azure Windows Virtual Desktop environment, and all the steps should be done inside this environment.

Step 1: Open Putty (if not yet installed, it can be downloaded from: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html)

alt-text

Step 2: Provide the host name and the port you want to connect to

In Putty, fill in the Host IP address that you received at deployment. For my example, this is 172.16.3.6

Additionally, fill in the port number 22

Click on “Open”.

alt-text

A security alert will pop-up if it is the first time you connect to the DSVM. Press Yes.

alt-text

Step 3: authenticate with your Azure Active Directory credentials. Provide your username and password and press enter.

alt-text

You have now successfully connected to the Data Science virtual machine via SSH.

Further steps on how to work with the Data Science virtual machine via SSH, are out of the scope of this documentation.

Jupyter Notebooks#

Jupyter Notebooks are running on the server by default. There is no need to change any configuration in order to work with Jupyer Notebooks.

To be able to access the Jupyter Notebook server, you need to be inside the Data Platform network (via Amazon Workspace, Azure Windows Virtual Desktop or via the AWS client VPN). The following steps assume you are connected to the Data Platform network.

The following example will show you how to connect to the Jupyter notebooks and how to get data from the Azure Data Lake Gen2. We assume here that there is a csv file called “digital-agenda-scoreboard-key-indicators.csv” in the “User-read-write” folder, in our “HDInsight” container. If you do not have a file in this folder, please refer to Azure Data Lake user documentation to see how you can upload data to the Data Lake.

Step 1: Opening the Jupyter notebooks

To open the Jupyter Notebooks, navigate to the provided private IP address of the server and the port 8000. Additionally, make sure to add https in the beginning of the url. For our example, this results in: https://172.16.3.6:8000

The first time visiting this page, you will receive an indication that the connection is not private. This is an expected behavior, so for Chrome: click on “Advanced” and “Proceed to 172.16.3.6 (unsafe)”.

alt-text

You will be redirected to the Jupyter page.

alt-text

Step 2: Log in with your username and password. After, you will be redirected to the Jupyter Notebooks homepage.

alt-text

If you want a different view from this, you can change the url to with a /lab at the end of you username.

For our example, this results in: https://172.16.3.6:8000/user/ncattoir/lab

alt-text

Step 3: Create a new notebook

In order to create a new notebook, click on New and choose the programming language that you like (not HDInsight). For this example, we will be using Python 3.5

alt-text

Step 4: preconfigure the notebook and connect to the Data Lake

In order to make the connection with the Azure Data lake, we need to start with a couple of configurations. Paste the below code snippets in the cells in the Jupyter Notebook. - Import the necessary Azure modules:

from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import DataLakeDirectoryClient
from azure.storage.filedatalake import DataLakeFileClient
from azure.identity import ManagedIdentityCredential
  • Get the identity of the VM, in order to authenticate with Azure:
credential = ManagedIdentityCredential()
  • Get a connection with the directory in our Azure Data Lake:
directory = DataLakeDirectoryClient(account_url="https://uatprojectuatdatalake.dfs.core.windows.net/", credential=credential, file_system_name ='HDInsight', directory_name='/User-read-write')
  • We can check the access to the directory (and verify that our connection to the directory has succeeded):
directory.get_access_control()

Step 5: read data from the Data Lake - Create a connection to a file:

file = directory.get_file_client('digital-agenda-scoreboard-key-indicators.csv')
  • Download the file locally:
with open("./digital-agenda-scoreboard-key-indicators.csv", "wb") as my_file:
        download = file.download_file()
        download.readinto(my_file)
  • Create a data frame from the local file:
import pandas as pd
df=pd.read_csv('./ digital-agenda-scoreboard-key-indicators.csv ')
  • Print the files from the data frame:
df

Step 6: writing data to the Data Lake - Create csv from the data frame

df.to_csv('out.csv')
  • Create a connection to a file:
file = directory.get_file_client('digital-agenda-scoreboard-key-indicators-from-jupyter.csv')
  • Open the file locally:
local_file = open("./out.csv",'r')
file_contents = local_file.read()
  • Upload to the file
file.upload_data(file_contents, overwrite=True)

RStudio#

RStudio is running on the server by default. There is no need to change any configuration in order to work with Jupyter Notebooks.

To be able to access the RStudio server, you need to be inside the Data Platform network (via Amazon Workspace, Azure Windows Virtual Desktop or via the AWS client VPN). The following steps assume you are connected to the Data Platform network.

The following example will show you how to connect to the RStudio and how to get data from the Azure Data Lake Gen2. We assume here that there is a csv file called “digital-agenda-scoreboard-key-indicators.csv” in the “User-read-write” folder, in our “HDInsight” container. If you do not have a file in this folder, please refer to Azure Data Lake user documentation to see how you can upload data to the Data Lake.

Step 1: Opening the RStudio environment

To open the RStudio workbook, navigate to the provided private IP address of the server and the port 8787. Additionally, make sure to add http (not https!) in the beginning of the url.

For our example, this results in: http://172.16.3.6:8787

The first time visiting this page, you will receive an indication that the connection is not private. This is an expected behavior, so for Chrome: click on “Advanced” and “Proceed to 172.16.3.6 (unsafe)”.

alt-text

You will be redirected to the RStudio page.

alt-text

Step 2: Log in with your username and password. After, you will be redirected to the RStudio Workspace homepage.

alt-text

Step 3: preconfigure the workspace and connect to the Data Lake

In order to make the connection with the Azure Data lake, we need to start with a couple of configurations. Paste the below code snippets in the workspace in RStudio. - Import the necessary Azure libraries:

library(AzureAuth)

Reply “yes” when asked if they can store the authentication credentials in the directory

library(AzureStor)
  • Create a token from the VM identity:
token <- AzureAuth::get_managed_token("https://storage.azure.com", use_cache = FALSE)
  • Create a connection with the file system/container:
endpoint <- adls_endpoint("https://uatprojectuatdatalake.dfs.core.windows.net/", token = token)
fs <- adls_filesystem(endpoint, name = "HDInsight")
  • List the files to check if our connection works:
list_adls_files(fs, dir = "/", info = c("name"), recursive = FALSE)

Step 5: read data from the Data Lake

  • Download the file locally:
download_adls_file(fs, src = "User-read-write/digital-agenda-scoreboard-key-indicators.csv", dest = basename("digital-agenda-scoreboard-key-indicators.csv"), blocksize = 2^24, overwrite = TRUE, use_azcopy = FALSE)
  • Read the file to memory:
con <- rawConnection(raw(0), "r+")
download_adls_file(fs, src = "User-read-write/digital-agenda-scoreboard-key-indicators.csv", dest = con, blocksize = 2^24, overwrite = TRUE, use_azcopy = FALSE)
data <- readLines(con)

Step 6: write data to the Data Lake

  • Upload local file to the data lake:
upload_adls_file(fs, src = "digital-agenda-scoreboard-key-indicators.csv", dest = "User-read-write/test.csv", blocksize = 2^24, lease = NULL, use_azcopy = FALSE)

Troubleshooting#

I can connect to Jupyter, but cannot connect to RStudio, or vise versa.#

Please try the connection again using icognito mode in Chrome. If you can connect this way, it is a strong indication that local cookies has bad credentials saved. Please reset/delete your cookies again to be able to connect with Chrome again.