Azure Data Science Virtual Machine#
- User needs: Data integration, Data query, Data processing
- User profiles: Data Scientists, Data Engineers
- User assumed knowledge: command line interface knowledge
The Data Science Virtual Machine (DSVM) is a server on the Azure cloud platform, built specifically for doing data science. It has many popular data science tools preinstalled and preconfigured to jumpstart building intelligent applications for advanced analytics.
The server that is available for the EC Data Platform, has the Ubuntu 18.04 LTS as an operating system.
The list of features of preinstalled and preconfigured is too lengthy to list up. The complete list is available on (check the features for the “Linux DSM”, and NOT the “Ubuntu 18.04 DSVM”): https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/tools-included
Note: when using the data science VM, identity is managed on a per-machine basis, not on a user-basis (like in HDInsight). This means that access to the folder or files should be provided to the machine, not the user using the machine. To which folders the VM has access to, should ideally be agreed at the time of the deployment. However, the access can still be easily modified after. We choose to provide the VM with read/write/execute access to the User-read-write folder that was used before.
We will dive deeper in three parts:
-
How to launch Jupyter Notebooks and how to connect to the Data Lake from a Jupyter notebook
-
How to launch RStudio and how to connect to the Data Lake from an RStudio workspace
In order to connect to the Data Science virtual machine, you need to be inside the virtual network of the EC Data Platform. User-friendly ways to get inside this network are
-
via Amazon WorkSpaces
-
via Azure Windows Virtual Desktop.
Other options are - using the AWS Client VPN (OpenVPN) to connect your local computer to the network, - or via SSH tunning using the Bastion Host.
Please see section Accessing a Data Science Lab environment to see all these options.
Connecting to the Data Science VM via SSH#
The example below assumes we are already connected to our Amazon WorkSpaces or Azure Windows Virtual Desktop environment, and all the steps should be done inside this environment.
Step 1: Open Putty (if not yet installed, it can be downloaded from: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html)
Step 2: Provide the host name and the port you want to connect to
In Putty, fill in the Host IP address that you received at deployment. For my example, this is 172.16.3.6
Additionally, fill in the port number 22
Click on “Open”.
A security alert will pop-up if it is the first time you connect to the DSVM. Press Yes.
Step 3: authenticate with your Azure Active Directory credentials. Provide your username and password and press enter.
You have now successfully connected to the Data Science virtual machine via SSH.
Further steps on how to work with the Data Science virtual machine via SSH, are out of the scope of this documentation.
Jupyter Notebooks#
Jupyter Notebooks are running on the server by default. There is no need to change any configuration in order to work with Jupyer Notebooks.
To be able to access the Jupyter Notebook server, you need to be inside the Data Platform network (via Amazon Workspace, Azure Windows Virtual Desktop or via the AWS client VPN). The following steps assume you are connected to the Data Platform network.
The following example will show you how to connect to the Jupyter notebooks and how to get data from the Azure Data Lake Gen2. We assume here that there is a csv file called “digital-agenda-scoreboard-key-indicators.csv” in the “User-read-write” folder, in our “HDInsight” container. If you do not have a file in this folder, please refer to Azure Data Lake user documentation to see how you can upload data to the Data Lake.
Step 1: Opening the Jupyter notebooks
To open the Jupyter Notebooks, navigate to the provided private IP address of the server and the port 8000. Additionally, make sure to add https in the beginning of the url.
For our example, this results in: https://172.16.3.6:8000
The first time visiting this page, you will receive an indication that the connection is not private. This is an expected behavior, so for Chrome: click on “Advanced” and “Proceed to 172.16.3.6 (unsafe)”.
You will be redirected to the Jupyter page.
Step 2: Log in with your username and password. After, you will be redirected to the Jupyter Notebooks homepage.
If you want a different view from this, you can change the url to with a /lab at the end of you username.
For our example, this results in: https://172.16.3.6:8000/user/ncattoir/lab
Step 3: Create a new notebook
In order to create a new notebook, click on New and choose the programming language that you like (not HDInsight). For this example, we will be using Python 3.5
Step 4: preconfigure the notebook and connect to the Data Lake
In order to make the connection with the Azure Data lake, we need to start with a couple of configurations. Paste the below code snippets in the cells in the Jupyter Notebook. - Import the necessary Azure modules:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import DataLakeDirectoryClient
from azure.storage.filedatalake import DataLakeFileClient
from azure.identity import ManagedIdentityCredential
- Get the identity of the VM, in order to authenticate with Azure:
credential = ManagedIdentityCredential()
- Get a connection with the directory in our Azure Data Lake:
directory = DataLakeDirectoryClient(account_url="https://uatprojectuatdatalake.dfs.core.windows.net/", credential=credential, file_system_name ='HDInsight', directory_name='/User-read-write')
- We can check the access to the directory (and verify that our connection to the directory has succeeded):
directory.get_access_control()
Step 5: read data from the Data Lake - Create a connection to a file:
file = directory.get_file_client('digital-agenda-scoreboard-key-indicators.csv')
- Download the file locally:
with open("./digital-agenda-scoreboard-key-indicators.csv", "wb") as my_file:
download = file.download_file()
download.readinto(my_file)
- Create a data frame from the local file:
import pandas as pd
df=pd.read_csv('./ digital-agenda-scoreboard-key-indicators.csv ')
- Print the files from the data frame:
df
Step 6: writing data to the Data Lake - Create csv from the data frame
df.to_csv('out.csv')
- Create a connection to a file:
file = directory.get_file_client('digital-agenda-scoreboard-key-indicators-from-jupyter.csv')
- Open the file locally:
local_file = open("./out.csv",'r')
file_contents = local_file.read()
- Upload to the file
file.upload_data(file_contents, overwrite=True)
RStudio#
RStudio is running on the server by default. There is no need to change any configuration in order to work with Jupyter Notebooks.
To be able to access the RStudio server, you need to be inside the Data Platform network (via Amazon Workspace, Azure Windows Virtual Desktop or via the AWS client VPN). The following steps assume you are connected to the Data Platform network.
The following example will show you how to connect to the RStudio and how to get data from the Azure Data Lake Gen2. We assume here that there is a csv file called “digital-agenda-scoreboard-key-indicators.csv” in the “User-read-write” folder, in our “HDInsight” container. If you do not have a file in this folder, please refer to Azure Data Lake user documentation to see how you can upload data to the Data Lake.
Step 1: Opening the RStudio environment
To open the RStudio workbook, navigate to the provided private IP address of the server and the port 8787. Additionally, make sure to add http (not https!) in the beginning of the url.
For our example, this results in: http://172.16.3.6:8787
The first time visiting this page, you will receive an indication that the connection is not private. This is an expected behavior, so for Chrome: click on “Advanced” and “Proceed to 172.16.3.6 (unsafe)”.
You will be redirected to the RStudio page.
Step 2: Log in with your username and password. After, you will be redirected to the RStudio Workspace homepage.
Step 3: preconfigure the workspace and connect to the Data Lake
In order to make the connection with the Azure Data lake, we need to start with a couple of configurations. Paste the below code snippets in the workspace in RStudio. - Import the necessary Azure libraries:
library(AzureAuth)
Reply “yes” when asked if they can store the authentication credentials in the directory
library(AzureStor)
- Create a token from the VM identity:
token <- AzureAuth::get_managed_token("https://storage.azure.com", use_cache = FALSE)
- Create a connection with the file system/container:
endpoint <- adls_endpoint("https://uatprojectuatdatalake.dfs.core.windows.net/", token = token)
fs <- adls_filesystem(endpoint, name = "HDInsight")
- List the files to check if our connection works:
list_adls_files(fs, dir = "/", info = c("name"), recursive = FALSE)
Step 5: read data from the Data Lake
- Download the file locally:
download_adls_file(fs, src = "User-read-write/digital-agenda-scoreboard-key-indicators.csv", dest = basename("digital-agenda-scoreboard-key-indicators.csv"), blocksize = 2^24, overwrite = TRUE, use_azcopy = FALSE)
- Read the file to memory:
con <- rawConnection(raw(0), "r+")
download_adls_file(fs, src = "User-read-write/digital-agenda-scoreboard-key-indicators.csv", dest = con, blocksize = 2^24, overwrite = TRUE, use_azcopy = FALSE)
data <- readLines(con)
Step 6: write data to the Data Lake
- Upload local file to the data lake:
upload_adls_file(fs, src = "digital-agenda-scoreboard-key-indicators.csv", dest = "User-read-write/test.csv", blocksize = 2^24, lease = NULL, use_azcopy = FALSE)
Troubleshooting#
I can connect to Jupyter, but cannot connect to RStudio, or vise versa.#
Please try the connection again using icognito mode in Chrome. If you can connect this way, it is a strong indication that local cookies has bad credentials saved. Please reset/delete your cookies again to be able to connect with Chrome again.