Azure Data Lake#

User needs: Data storage
User profiles: Business User (Storage explorer only), Data Analyst (Storage explorer only), Data Scientist, Data Engineer
User assumed knowledge: command-line interface knowledge

Azure Data Lake Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. When researching documentation, it is important to include the “gen2” as the “gen1” is based on a different underlying storage system.

The main difference with a “general” Azure Blob storage account, is the addition of a hierarchical namespace. This allows the collection of objects/files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. This improves performance, management and security.

Data Lake Storage permissions#

Azure Data Lake Gen2 allows the management of POSIX permissions (read, write and execute) on all objects and directories in the Data Lake. Due to the different implementations, the access to the Data Lake is managed differently for the HDInsight cluster and the Data Science Virtual Machine.

The HDInsight cluster will authenticate with the Data Lake based on the user that is currently logged in to the cluster.

The Data Science VM will authenticate with the Data Lake based on an identity that has been assigned to the machine.

The access permissions for users to the folder structure could be designed up front, but can also be adjusted later on.

Connecting via the Storage Explorer GUI#

User profiles: Business User, Data Analyst, Data Scientist, Data Engineer Azure Storage Explorer provides a powerful, accessible experience for users to view, search and interact with data within your data lake (or in another kind of storage account).

As indicated above, Azure Storage Explorer is currently not installable on your local computer. It should be pre-installed in your Amazon WorkSpace environment. If not, when you are inside you Amazon WorkSpaces, you can download the Azure Storage Explorer from (Windows, macOS, Linux): https://azure.microsoft.com/en-us/features/storage-explorer/

If you do not want to use Amazon WorkSpaces, or if you want to transfer files from your local computer, use either the CLI (with the AzCopy tool) or the REST API. Connecting the Storage Explorer to the Data Lake

Setting up your Storage Explorer#

The following steps explain how you can connect to your Azure Data Lake Gen2, once you have the Azure Storage Explorer installed.

Step 1: In Storage Explorer, start by clicking on the “Manage Accounts” and click on “Add an account…”.

alt-text

Step 2: Click on Add an Azure Account and log in when a pop-up is received.

alt-text

When logged in through the pop-up, an account should appear in the Account management.

alt-text

Step 3: For a normal storage account, this would be enough. However, we are dealing with an Azure Data Lake, which means that we do not have access to the resource directly. The access is provided to us on the directory level of data lake containers. Due to this, we need to add the resource through Azure AD. In the Account Management, click again on “Add an account…” and then select “Add a resource via Azure Active Directory (Azure AD)”. Select your account and your Tenant (this should be prefilled automatically).

alt-text

Step 4: In the next step, make sure to select Blob Container (ADLS Gen2) as the resource type. Fill in the container path URL (see below). Finally, choose a name for your endpoint. Click Next, and Connect.

Please use the Container URL connection parameter that was provided to you at deployment. For our example, this is: https://storage_accoundt_name.dfs.core.windows.net/datalake_container_name.

alt-text

Step 5: After the last step, you should be redirected to the Explorer and you should note that a Blob Container has been added to your Local & Attached Storage Accounts (NOT under Data Lake Storage Gen1)

alt-text

Step 6: When double clicking on the blob container, you should see the directory tree in the right side of the explorer.

alt-text

After this step is done, you are fully set for executing data operations on the Data Lake.

Performing data operations in the Storage Explorer#

Once the Data Lake is connected to your Storage Explorer, you can start performing data operations on the Data Lake. Please consult your directory access structure to see which groups have access to which directories in the data lake. For the example below, we assume that our user has read/write/execute access to the /User-read-write directory.

During the example, we will use an example dataset “digital-agenda-scoreboard-key-indicators.csv” from https://digital-agenda-data.eu/datasets/digital_agenda_scoreboard_key_indicators.

Uploading a file: To upload a file to a folder on which you have Write permissions, just click the “Upload” button and select the file you which to add to the folder.

alt-text

Downloading a file: To download a file on which you have Read permissions, just select the file and click the “Download” button and select the folder in which you want to download the file.

alt-text

Deleting a file: To delete a file on which you have Write permissions, just select the file and click the “Delete” button.

alt-text

Connecting via the CLI#

User profiles: Data Scientist, Data Engineer

AzCopy is a command-line utility from Microsoft that you can use to copy blobs or files to or from a storage account, and thus also to the Data Lake.

In order to use this command-line utility, you first need to download and unzip the AzCopy tool. Navigate to https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10 and download the latest version. Next, unzip the file to a folder you can easily access. Additionally, you can also choose to add AzCopy to your environment path, so you can use the azcopy command in every folder on your computer.

Connecting AzCopy to the Data Lake#

Step 1: After unzipping, navigate to the inner most folder where you can see the azcopy.exe file:

alt-text

Step 2: In the address bar, type cmd and click enter. A command-line interface will open.

alt-text

Step 3: In the command line window, write the following command to authenticate with Azure: azcopy login

alt-text

Step 4: Navigate to https://microsoft.com/devicelogin and provide the code indicated in the above image.

After a successful login, your command-line interface should indicate that the login succeeded.

alt-text

Adding AzCopy to your environment variables#

To make sure that you do not always have to go to the folder where you downloaded AzCopy, you can add the file to your Windows environment variables, so you can use the command from any folder on your pc.

Step 1: Enter “env” in the start menu of Windows and open “Edit the system environment variables”.

alt-text

Step 2: Click on “Environment Variables”.

alt-text

Step 3: Click on “Path” and then on “Edit…”

alt-text

Step 4: Add the folder where you downloaded AzCopy, including “azcopy.exe” after “…WindowsApps;”. Click on “Ok”. For me, this resulted in: C:\User\ncattoir\Downloads\azcopy_windows_386_10.3.4\ azcopy_windows_386_10.3.4\azcopy.exe

alt-text

Performing data operations with AzCopy#

In the following examples, we will be using the following connection parameters. Please note that your connection parameters will be different, so pay attention to use the endpoints that were provided to you, and not the ones described in this manual: - Storage-account-name: uatprojectuatdatalake - Container-name: HDInsight

Uploading a file: To upload a file to a folder on which you have Write permissions, you need to use the following syntax: azcopy copy “<local-file-path>” “https://<storage-account-name>.dfs.core.windows.net/<container-name>/<blob-path>” - Example: if you want to upload a file that is on the “C:\Users\ncattoir\Downloads\myTextFile.txt” location on your local pc and want to put it in the User-read-write folder in the “HDInsight” container in the “uatprojectuatdatalake” Data lake, we would use the following command: azcopy copy "C:\Users\ncattoir\Downloads\myTextFile.txt“ "https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/blob1001.txt"

You can also upload a file by using a wildcard symbol (*) anywhere in the file path or file name. For example: 'C:\myDirectory*.txt', or C:\my**.txt.

Downloading a file: To download a file from a folder on which you have Read permissions, you need to use the following syntax: azcopy copy “https://<storage-account-name>.dfs.core.windows.net/<container-name>/<blob-path>” “<local-file-path>” - Example: if you want to download a file that is in the User-read-write folder in the “HDInsight” container in the “uatprojectuatdatalake” Data lake, and store it on the “C:\Users\ncattoir\Downloads\myTextFile.txt” location on your local pc, we would use the following command: azcopy copy "https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/blob1000.txt" "C:\Users\ncattoir\Downloads\myTextFile.txt"

Other data operations: other data operations are also available from the AzCopy command-line tool (e.g. remove, list, sync,…). Please consult the following webpage for an overview of the commands that can be used with AzCopy: https://docs.microsoft.com/en-us/azure/storage/common/storage-ref-azcopy?toc=/azure/storage/blobs/toc.json

Connecting via REST API#

A last option to execute data operations with the Azure Data Lake Gen2, is using the REST API. This method is quite a bit more advanced than the previous tools and is more oriented towards scripting or embedding data operations in applications.

Obtaining your access key#

In order to perform REST operations on the Data Lake, an access key is required. This access key is obtained by navigating to a specific URL, logging in to your account, and copying this access key from the redirected url.

Step 1: Navigate to the following url:

https://login.microsoftonline.com/34aec727-a2f5-40e3-be3a-15695c423c9a/oauth2/v2.0/authorize?client_id=357329e9-8e72-46ef-9b42-f34638d92b36&response_type=token&redirect_uri=https://localhost:44321&response_mode=fragment&scope=https%3A%2F%2Fstorage.azure.com%2Fuser_impersonation&state=12345 (Microsoft Edge does not work, Chrome does work)

Step 2: Navigating to this link provides you the option to log into your account. Use your provided email address and password.

alt-text

Step 3: After logging in, you will be redirected to a page that does not exist and starts with https://localhost:44321/#access_token=...

alt-text

Copy the token after the “#access_token=” onto the “&token_type=”. This is the access code.

Example: We get redirected to: https://localhost:44321/#access_token=**eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IllNRUxIVDBndmIwbXhvU0RvWWZvbWpxZmpZVSIsImtpZCI6IllNRUxIVDBndmIwbXhvU0RvWWZvbWpxZmpZVSJ9.WFpdpB_[…]BEq-sqTTSluvBQp0J2MkBoPnL1dK_WeW-ODkeOPR4OjvJCq9VsJejI1kvVJx6MQcTu0O1ulgO02YzYcqkFRVlAl2ObkB5h82nHumXcHmzvw**&token_type=Bearer&expires_in=3599&scope=https%3a%2f%2fstorage.azure.com%2fuser_impersonation&state=12345&session_state=3c36ec06-93bd-41a0-9640-5c02d11658cc

We copy the code after #access_token until &token_type=, this is our access token (in the example a part of the code has been removed, in practice this code will be a lot longer).

Performing data operations with the REST API#

In order to perform API calls to the Azure storage service, we need to use an API development tool like Postman, or execute the scripts manually from written scripts or application code.

For this example, we will use the Postman tool. More information on this tool can be found on: https://www.postman.com/. Additionally, the Postman tool can also be installed as a Chrome plug-in (https://chrome.google.com/webstore/detail/postman/fhbjgbiflinjbdggehcddcbncdddomop?hl=en).

Setting up Postman: In Postman, start by creating a New Request.

alt-text

Provide a name for your request and “Create a Collection” to store your request if you do not have a collection yet.

alt-text

You are now set up to add requests to your Postman collection. In the following examples, we use the following connection parameters: - Storage-account-name: uatprojectuatdatalake - Container-name: HDInsight

Uploading a file:
Uploading a file requires three distinct API requests: - PUT - PATCH – append - PATCH – flush

Step 1: PUT

By executing the PUT request, we “reserve” the namespace in the Data Lake, so that we can transfer bytes in a later stage. Add the following configurations to you Postman’s request: - Method: PUT - Request URL: https\://uatprojectuatdatalake.dfs.core.windows.net/HDInsight//\?resource=file - Here, you need to provide a filename to store your uploaded file in the Data Lake. Additionally, you can provide a folder name to store the file into in the Data Lake. - Example: https\://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/tweets.txt?resource=file - In the tab “Headers”, make sure the following key value pairs are present: - Authorization: Bearer \ - X-ms-version: 2018-11-09 - X-ms-date: \ - x-ms-blob-type: BlockBlob - Content-Length: 0 - Make sure to fill in the current date and add your access token (with the word “Bearer” and a space) The Postman configuration should look something like this:

alt-text

Next, click on Send in the upper-right corner. Check the “Status” code below to see if your request was successful.

alt-text

A successful response should indicate: “201 Created”

Step 2: PATCH - append

By executing the PATCH request with the “append” action, we are instructing that a file will be streamed to Data Lake. Add the following configurations to you Postman’s request: - Method: PATCH - Request URL: https\:// uatprojectuatdatalake.dfs.core.windows.net/HDInsight//\?action=append&position=0 - Here, you need to provide a filename and optional folder name that you used in your previous PUT request. - Example: https\://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/tweets.txt?action=append&position=0 - In the tab “Headers”, make sure the following key value pairs are present: - Authorization: Bearer \ - X-ms-version: 2018-11-09 - X-ms-date: \ - x-ms-blob-type: BlockBlob - Content-Type: \ -> using Postman, this is automatically filled in - Content-Length: \ -> using Postman, this is automatically filled in - Make sure to fill in the current date and add your access token (with the word “Bearer” and a space) Your request with “Headers” should look like this:

alt-text

Additionally, go to the “Body” tab and select “form-data”. Put your cursor in the Key field (see screenshot for more clarity) and select the “File” in the “Text” dropdown.

alt-text

Type “file” as key and click on “Select files” to select a file.

alt-text

After this is complete, click on “Send” in the upper right corner to send the request. Check the “Status” code below to see if your request was successful.

alt-text

A successful response should indicate: “202 Accepted”

Step 3: PATCH - flush

By executing the PATCH request with the “flush” action, we are indicating the end of the file and flushing the bytes to the file.

Add the following configurations to you Postman’s request: - Method: PATCH - Request URL: https:// uatprojectuatdatalake.dfs.core.windows.net/HDInsight//\?action=flush&position=\ - Here, you need to provide a filename and optional folder name that you used in your previous PUT request. - The content-length of the file can be gathered in application code, or by sending a put request with file in the body (as in the previous request) and the request url: https://postman-echo.com/put - Example: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/tweets.txt?action=flush&position=671604 - In the tab “Headers”, make sure the following key value pairs are present: - Authorization: Bearer \ - X-ms-version: 2018-11-09 - X-ms-date: \ - Make sure to fill in the current date and add your access token (with the word “Bearer” and a space)

The Postman configuration should look something like this:

alt-text

After this is complete, click on “Send” in the upper right corner to send the request. Check the “Status” code below to see if your request was successful.

alt-text

A successful response should indicate: “200 OK”

Your file has now been uploaded!

Downloading a file: To download a file from your Data Lake, use the following Postman configuration: - Method: GET - Request URL: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight//\ - Here, you need to provide a filename and optional folder name where the file that you want to download, is located. - Example: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/tweets.txt - In the tab “Headers”, make sure the following key value pairs are present: - Authorization: Bearer \ - X-ms-version: 2018-11-09 - X-ms-date: \ - Make sure to fill in the current date and add your access token (with the word “Bearer” and a space) Your Postman configuration should look like this:

alt-text Click on “Send and download” in the upper-right corner and choose the folder where you want to store the file.

alt-text

A successful response should indicate: “200 OK”

Your file has now been downloaded!

Deleting a file: Deleting a file is very similar to downloading a file. Instead of using the GET request method, we use the DELETE method. Use the following Postman configuration: - Method: DELETE - Request URL: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight//\ - Here, you need to provide a filename and optional folder name where the file that you want to download, is located. - Example: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/tweets.txt - In the tab “Headers”, make sure the following key value pairs are present: - Authorization: Bearer \ - X-ms-version: 2018-11-09 - X-ms-date: \ - Make sure to fill in the current date and add your access token (with the word “Bearer” and a space)

alt-text

Click on “Send” in the upper-right corner.

alt-text

A successful response should indicate: “200 OK”

Your file has now been deleted!