Amazon EMR#

Amazon EMR is a service that provides a compute cluster with big data applications. The main supported applications are Apache Spark and Apache Hive.

Applications#

Apache Spark#

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. More information about Apache Spark can be found here.

Apache Hive#

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Hive contains the tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis. More information about Apache Hive can be found here.

EMRFS#

Amazon EMR has a feature EMRFS which enables clusters to not only read from the clusters' own local distributed storage (HDFS), but also to read and compute on data that resides in Amazon S3, one of the datalake options from EC Data Platform. Using the approach of separating computation and storage enables compute clusters to use different large datasets stored in a datalake instead of only using the datasets stored in the local distributed filesystem of the cluster. More information can be found in the following links:

Two components of this building block must be chosen: firstly, the configuration type, the hardware configuration (compute machine type + EBS).

Configuration Type#

The user must choose the configuration type of the cluster.

EMR Kerberos#

EMR Kerberos is EMR with Kerberos support. Kerberos is a network authentication protocol created by the Massachusetts Institute of Technology (MIT). Kerberos uses secret-key cryptography to provide strong authentication so that passwords or other credentials aren't sent over the network in an unencrypted format.

Highly secured Hadoop cluster with Kerberos
Central cluster management with Apache Hue (single user interface, no direct terminal access)
Access to Apache Hue is authenticated and authorized with Azure AD
Use Spark SQL, HiveQL, Pyspark, Spark interpreters and notebooks
Schedule job execution with Apache Oozie
Administer workloads via a user-friendly user interface

EMR Vanilla#

EMR Vanilla is an experimental environment to prototype Apache Spark and Hive applications. Users can interact with Apache Spark via JupyterHub & SparkMagic and with Apache Hive via JDBC.

Experiment with Spark and Hive on an Amazon EMR cluster.
Connect remotely to Spark via Livy
Connect remotely to Hive via JDBC

Interfaces#

EMR Kerberos Apache Hue offers a user interface with Spark SQL, HiveQL, Pyspark, Spark interpreters and notebooks.

EMR Vanilla API's and interfaces such as Livy and JDBC are available.

Hardware Configuration#

A cluster configuration requires selecting the group of machines to cluster. This group consists out of a Master node and several Core nodes. Optionally Task nodes can be added as well. Below a brief explanation per type of cluster node, followed by the available machine types together with their respective pricing.

Cluster configuration: Master, Core and Task Nodes#

Master Nodes#

The master node manages the cluster and typically runs master components of distributed applications. For example, the master node runs the YARN ResourceManager service to manage resources for applications, as well as the HDFS NameNode service. It also tracks the status of jobs submitted to the cluster and monitors the health of the instance groups.

To monitor the progress of a cluster and interact directly with applications, you can connect to the master node over SSH.

Core Nodes#

Core nodes are managed by the master node. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require. For example, a core node runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.

Task Nodes#

Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don't run the Data Node daemon, nor do they store data in HDFS.

Clustered Compute Machines#

The currently available standard compute machines for clustering are AWS EC2 instances of the m5 type (GHz Intel Xeon® Platinum 8175 processors with new Intel Advanced Vector Extension instruction set). The following list is supported:

Instance Size	vCPU	Memory (GiB)	Instance Storage (GiB)	Network Bandwidth (Gbps)	EBS Bandwidth (Mbps)
m5.xlarge	4	16	EBS-Only	Up to 10	Up to 3,500
m5.2xlarge	8	32	EBS-Only	Up to 10	Up to 3,500
m5.4xlarge	16	64	EBS-Only	Up to 10	3,500
m5.12xlarge	48	192	EBS-Only	10	7,000
m5.24xlarge	96	384	EBS-Only	25	14,000

Elastic Block Storage (EBS)#

Amazon Elastic Block Store (EBS) is a block storage service designed for use with Amazon Elastic Compute Cloud (EC2). The local storage of a selected compute machine (Standard Compute Machine or a GPU Compute Machine) will be served by EBS. Following types of EBS is supported:

Volume Type	Volume Size	Max Throughput/Volume
EBS General Purpose SSD (gp2)	1 GB - 16 TB	250 MB/s