Azure Data Lake#
The Data Lake building block provides a data lake through the Azure Data Lake Storage Gen2. This is Microsoft’s enterprise big data analytics solution which offers a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Azure Blob storage is Microsoft’s object storage solution for the cloud, optimized for storing massive amounts of unstructured data. The Data Lake solution adds a hierarchical namespace to the Blob storage, to organise objects/files into a hierarchy of directories for efficient data access, enhancing performance management and security. The Azure Data Lake Storage is at the highest level organised in containers. Each container can contain an unlimited number of files, which can be organized in different directories. The access to each of the containers and directories can be restricted based on your use case. During the technical discussion, we will decide which groups will have access to which directories.
The Data Lake is linked to an Azure Storage Account which provides a unique namespace for your data. The Data Lake has a General Purpose V2 Storage account with the standard performance tier, allowing cost-effective data storage, with local redundancy, which copies the data synchronously three times within the primary region.
The Data Lake Storage Gen2 offers features to move data to different access tiers, to react to events, and for storage analytics.
Interfaces#
- Azure Storage REST API
- AzCopy is a command-line utility that you can use to copy blobs or files to or from a Data Lake.
- Azure Storage Client Library (available in different programming languages)
- Azure Storage Explorer: the Azure Storage Explorer is an intuitive graphical user interface tool for MacOS, Windows or Linux, that can be used to create and manage directories, files and permissions in the Data Lake building block. More information and a download are available here.
Configurations#
Different access tiers are available for the Data Lake, depending on the usage pattern of the data.
Access tiers#
Hot storage#
The Hot access tier is optimized for data that is frequently accessed (read from and written to), by example data that is staged for processing.
Cool Storage#
The Cool access tier is optimized for storing large amounts of data, which is infrequently accessed. The cool storage has a lower storage cost compared to hot storage but a higher access cost. When using the cool storage type, ideally the data should be stored for at least 30 days for price optimization.
Redundancy#
Locally Redundant Storage (LRS)#
When thips option is chosen, the data is stored multiple times (at least 3) in the same data center. The durability provided by Locally Redundant Storage is at least 99.999999999% (11 nines) over a given year.
Zone Redundant Storage (ZRS)#
When thips option is chosen, the data is stored multiple times (at least 3) over three availability zones in the same region. The durability provided by Zone Redundant Storage is at least 99.9999999999% (12 9's) over a given year.
Geo-Redundant Storage (GRS)#
When thips option is chosen, the data is stored multiple times (at least 3) in the same data center and additionally copied in a secondary region. The durability provided by Geo-redundant Storage is at least 99.99999999999999% (16 9's) over a given year.