Azure Data Lake#
- User needs: Data storage
- User profiles: Business User (Storage explorer only), Data Analyst (Storage explorer only), Data Scientist, Data Engineer
- User assumed knowledge: command-line interface knowledge
Azure Data Lake Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. When researching documentation, it is important to include the “gen2” as the “gen1” is based on a different underlying storage system.
The main difference with a “general” Azure Blob storage account, is the addition of a hierarchical namespace. This allows the collection of objects/files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. This improves performance, management and security.
Data Lake Storage permissions#
Azure Data Lake Gen2 allows the management of POSIX permissions (read, write and execute) on all objects and directories in the Data Lake. Due to the different implementations, the access to the Data Lake is managed differently for the HDInsight cluster and the Data Science Virtual Machine.
The HDInsight cluster will authenticate with the Data Lake based on the user that is currently logged in to the cluster.
The Data Science VM will authenticate with the Data Lake based on an identity that has been assigned to the machine.
The access permissions for users to the folder structure could be designed up front, but can also be adjusted later on.
Connecting via the Storage Explorer GUI#
User profiles: Business User, Data Analyst, Data Scientist, Data Engineer Azure Storage Explorer provides a powerful, accessible experience for users to view, search and interact with data within your data lake (or in another kind of storage account).
As indicated above, Azure Storage Explorer is currently not installable on your local computer. It should be pre-installed in your Amazon WorkSpace environment. If not, when you are inside you Amazon WorkSpaces, you can download the Azure Storage Explorer from (Windows, macOS, Linux): https://azure.microsoft.com/en-us/features/storage-explorer/
If you do not want to use Amazon WorkSpaces, or if you want to transfer files from your local computer, use either the CLI (with the AzCopy tool) or the REST API. Connecting the Storage Explorer to the Data Lake
Setting up your Storage Explorer#
The following steps explain how you can connect to your Azure Data Lake Gen2, once you have the Azure Storage Explorer installed.
Step 1: In Storage Explorer, start by clicking on the “Manage Accounts” and click on “Add an account…”.
Step 2: Click on Add an Azure Account and log in when a pop-up is received.
When logged in through the pop-up, an account should appear in the Account management.
Step 3: For a normal storage account, this would be enough. However, we are dealing with an Azure Data Lake, which means that we do not have access to the resource directly. The access is provided to us on the directory level of data lake containers. Due to this, we need to add the resource through Azure AD. In the Account Management, click again on “Add an account…” and then select “Add a resource via Azure Active Directory (Azure AD)”. Select your account and your Tenant (this should be prefilled automatically).
Step 4: In the next step, make sure to select Blob Container (ADLS Gen2) as the resource type. Fill in the container path URL (see below). Finally, choose a name for your endpoint. Click Next, and Connect.
Please use the Container URL connection parameter that was provided to you at deployment. For our example, this is: https://storage_accoundt_name.dfs.core.windows.net/datalake_container_name.
Step 5: After the last step, you should be redirected to the Explorer and you should note that a Blob Container has been added to your Local & Attached Storage Accounts (NOT under Data Lake Storage Gen1)
Step 6: When double clicking on the blob container, you should see the directory tree in the right side of the explorer.
After this step is done, you are fully set for executing data operations on the Data Lake.
Performing data operations in the Storage Explorer#
Once the Data Lake is connected to your Storage Explorer, you can start performing data operations on the Data Lake. Please consult your directory access structure to see which groups have access to which directories in the data lake. For the example below, we assume that our user has read/write/execute access to the /User-read-write directory.
During the example, we will use an example dataset “digital-agenda-scoreboard-key-indicators.csv” from https://digital-agenda-data.eu/datasets/digital_agenda_scoreboard_key_indicators.
Uploading a file: To upload a file to a folder on which you have Write permissions, just click the “Upload” button and select the file you which to add to the folder.
Downloading a file: To download a file on which you have Read permissions, just select the file and click the “Download” button and select the folder in which you want to download the file.
Deleting a file: To delete a file on which you have Write permissions, just select the file and click the “Delete” button.
Connecting via the CLI#
User profiles: Data Scientist, Data Engineer
AzCopy is a command-line utility from Microsoft that you can use to copy blobs or files to or from a storage account, and thus also to the Data Lake.
In order to use this command-line utility, you first need to download and unzip the AzCopy tool. Navigate to https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10 and download the latest version. Next, unzip the file to a folder you can easily access. Additionally, you can also choose to add AzCopy to your environment path, so you can use the azcopy command in every folder on your computer.
Connecting AzCopy to the Data Lake#
Step 1: After unzipping, navigate to the inner most folder where you can see the azcopy.exe file:
Step 2: In the address bar, type cmd and click enter. A command-line interface will open.
Step 3: In the command line window, write the following command to authenticate with Azure:
azcopy login
Step 4: Navigate to https://microsoft.com/devicelogin and provide the code indicated in the above image.
After a successful login, your command-line interface should indicate that the login succeeded.
Adding AzCopy to your environment variables#
To make sure that you do not always have to go to the folder where you downloaded AzCopy, you can add the file to your Windows environment variables, so you can use the command from any folder on your pc.
Step 1: Enter “env” in the start menu of Windows and open “Edit the system environment variables”.
Step 2: Click on “Environment Variables”.
Step 3: Click on “Path” and then on “Edit…”
Step 4: Add the folder where you downloaded AzCopy, including “azcopy.exe” after “…WindowsApps;”. Click on “Ok”. For me, this resulted in: C:\User\ncattoir\Downloads\azcopy_windows_386_10.3.4\ azcopy_windows_386_10.3.4\azcopy.exe
Performing data operations with AzCopy#
In the following examples, we will be using the following connection parameters. Please note that your connection parameters will be different, so pay attention to use the endpoints that were provided to you, and not the ones described in this manual: - Storage-account-name: uatprojectuatdatalake - Container-name: HDInsight
Uploading a file: To upload a file to a folder on which you have Write permissions, you need to use the following syntax:
azcopy copy “<local-file-path>” “https://<storage-account-name>.dfs.core.windows.net/<container-name>/<blob-path>”
- Example: if you want to upload a file that is on the “C:\Users\ncattoir\Downloads\myTextFile.txt” location on your local pc and want to put it in the User-read-write folder in the “HDInsight” container in the “uatprojectuatdatalake” Data lake, we would use the following command:
azcopy copy "C:\Users\ncattoir\Downloads\myTextFile.txt“ "https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/blob1001.txt"
You can also upload a file by using a wildcard symbol (*) anywhere in the file path or file name. For example: 'C:\myDirectory*.txt', or C:\my**.txt.
Downloading a file: To download a file from a folder on which you have Read permissions, you need to use the following syntax:
azcopy copy “https://<storage-account-name>.dfs.core.windows.net/<container-name>/<blob-path>” “<local-file-path>”
- Example: if you want to download a file that is in the User-read-write folder in the “HDInsight” container in the “uatprojectuatdatalake” Data lake, and store it on the “C:\Users\ncattoir\Downloads\myTextFile.txt” location on your local pc, we would use the following command:
azcopy copy "https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/User-read-write/blob1000.txt" "C:\Users\ncattoir\Downloads\myTextFile.txt"
Other data operations: other data operations are also available from the AzCopy command-line tool (e.g. remove, list, sync,…). Please consult the following webpage for an overview of the commands that can be used with AzCopy: https://docs.microsoft.com/en-us/azure/storage/common/storage-ref-azcopy?toc=/azure/storage/blobs/toc.json
Connecting via REST API#
A last option to execute data operations with the Azure Data Lake Gen2, is using the REST API. This method is quite a bit more advanced than the previous tools and is more oriented towards scripting or embedding data operations in applications.
Obtaining your access key#
In order to perform REST operations on the Data Lake, an access key is required. This access key is obtained by navigating to a specific URL, logging in to your account, and copying this access key from the redirected url.
Step 1: Navigate to the following url:
https://login.microsoftonline.com/34aec727-a2f5-40e3-be3a-15695c423c9a/oauth2/v2.0/authorize?client_id=357329e9-8e72-46ef-9b42-f34638d92b36&response_type=token&redirect_uri=https://localhost:44321&response_mode=fragment&scope=https%3A%2F%2Fstorage.azure.com%2Fuser_impersonation&state=12345 (Microsoft Edge does not work, Chrome does work)
Step 2: Navigating to this link provides you the option to log into your account. Use your provided email address and password.
Step 3: After logging in, you will be redirected to a page that does not exist and starts with https://localhost:44321/#access_token=...
Copy the token after the “#access_token=” onto the “&token_type=”. This is the access code.
Example:
We get redirected to: https://localhost:44321/#access_token=**eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IllNRUxIVDBndmIwbXhvU0RvWWZvbWpxZmpZVSIsImtpZCI6IllNRUxIVDBndmIwbXhvU0RvWWZvbWpxZmpZVSJ9.WFpdpB_[…]BEq-sqTTSluvBQp0J2MkBoPnL1dK_WeW-ODkeOPR4OjvJCq9VsJejI1kvVJx6MQcTu0O1ulgO02YzYcqkFRVlAl2ObkB5h82nHumXcHmzvw**&token_type=Bearer&expires_in=3599&scope=https%3a%2f%2fstorage.azure.com%2fuser_impersonation&state=12345&session_state=3c36ec06-93bd-41a0-9640-5c02d11658cc
We copy the code after #access_token until &token_type=, this is our access token (in the example a part of the code has been removed, in practice this code will be a lot longer).
Performing data operations with the REST API#
In order to perform API calls to the Azure storage service, we need to use an API development tool like Postman, or execute the scripts manually from written scripts or application code.
For this example, we will use the Postman tool. More information on this tool can be found on: https://www.postman.com/. Additionally, the Postman tool can also be installed as a Chrome plug-in (https://chrome.google.com/webstore/detail/postman/fhbjgbiflinjbdggehcddcbncdddomop?hl=en).
Setting up Postman: In Postman, start by creating a New Request.
Provide a name for your request and “Create a Collection” to store your request if you do not have a collection yet.
You are now set up to add requests to your Postman collection. In the following examples, we use the following connection parameters: - Storage-account-name: uatprojectuatdatalake - Container-name: HDInsight
Uploading a file:
Uploading a file requires three distinct API requests:
- PUT
- PATCH – append
- PATCH – flush
Step 1: PUT
By executing the PUT request, we “reserve” the namespace in the Data Lake, so that we can transfer bytes in a later stage.
Add the following configurations to you Postman’s request:
- Method: PUT
- Request URL: https\://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/
Next, click on Send in the upper-right corner. Check the “Status” code below to see if your request was successful.
A successful response should indicate: “201 Created”
Step 2: PATCH - append
By executing the PATCH request with the “append” action, we are instructing that a file will be streamed to Data Lake.
Add the following configurations to you Postman’s request:
- Method: PATCH
- Request URL: https\:// uatprojectuatdatalake.dfs.core.windows.net/HDInsight/
- Additionally, go to the “Body” tab and select “form-data”. Put your cursor in the Key field (see screenshot for more clarity) and select the “File” in the “Text” dropdown.
Type “file” as key and click on “Select files” to select a file.
After this is complete, click on “Send” in the upper right corner to send the request. Check the “Status” code below to see if your request was successful.
A successful response should indicate: “202 Accepted”
Step 3: PATCH - flush
By executing the PATCH request with the “flush” action, we are indicating the end of the file and flushing the bytes to the file.
Add the following configurations to you Postman’s request:
- Method: PATCH
- Request URL: https:// uatprojectuatdatalake.dfs.core.windows.net/HDInsight/
The Postman configuration should look something like this:
After this is complete, click on “Send” in the upper right corner to send the request. Check the “Status” code below to see if your request was successful.
A successful response should indicate: “200 OK”
Your file has now been uploaded!
Downloading a file:
To download a file from your Data Lake, use the following Postman configuration:
- Method: GET
- Request URL: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/
Click on “Send and download” in the upper-right corner and choose the folder where you want to store the file.
A successful response should indicate: “200 OK”
Your file has now been downloaded!
Deleting a file:
Deleting a file is very similar to downloading a file. Instead of using the GET request method, we use the DELETE method.
Use the following Postman configuration:
- Method: DELETE
- Request URL: https://uatprojectuatdatalake.dfs.core.windows.net/HDInsight/
Click on “Send” in the upper-right corner.
A successful response should indicate: “200 OK”
Your file has now been deleted!