Spark/Hive Cluster#
Amazon EMR#
- User needs: Data integration, Data query, Data processing
- User profiles: Data Scientists, Data Engineers
- User assumed knowledge: Basic knowledge on the concepts of Hadoop, Spark and Hive
Amazon EMR is the industry leading cloud-native big data platform for processing vast amounts of data quickly and cost-effectively at scale. Using open source tools such as Apache Spark and Apache Hive coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premises clusters.
In BDTI, Amazon EMR comes in two setups: Amazon EMR Vanilla and Amazon EMR Kerberos. Amazon EMR Vanilla is a setup that allows you to connect from remote nodes (such as Data Science Studio or KNIME Analytics Platform on your Amazon WorkSpace) and execute Hive or Spark jobs. Amazon EMR Kerberos is a highly secured EMR cluster that is only accessible through Apache Hue, a management application for Hive and Spark clusters. For both setups, tight integration with Amazon S3 is possible. Amazon EMR can access S3 files directly as if it would access local HDFS. It is advised that users of Amazon EMR familiarize themselves with the concepts of Hadoop, Spark and Hive.
Accessing Amazon EMR Vanilla#
In order to connect to EMR Vanilla, following details are necessary: - EMR IP address
Amazon EMR Vanilla is an EMR cluster that is configured with Spark and Hive. Both applications run on top of Hadoop and YARN. You can access Spark via an external node, this requires connecting to Livy, a REST interface. Data Science Studio comes with preinstalled configurations that allow easy access via Jupyter notebooks with Livy. KNIME Analytics Platform also has dedicated Big Data connectors that can access Spark via the Livy interface. Hive is accessible via Spark applications or via the JDBC interface. Spark applications can execute Spark SQL, this allows querying tables in Hive. If you are running Spark applications via Jupyter notebooks, Hive tables are accessible with this approach.
- More information about accessing Hive via Spark
- Livy endpoint for remote access: http://[IP address EMR]:8998, this endpoint is unauthenticated
- More information about Livy
Accessing Amazon EMR Kerberos#
Amazon EMR Kerberos is a highly secured EMR cluster with a Kerberos configuration. Users can only access this cluster via Apache Hue and by using their BDTI credentials.
- Apache Hue endpoint: https://[IP address EMR]:8888
- When visiting the Apache Hue endpoint, you will see the login screen of Apache Hue.
- You need to use your BDTI username and password.
- After successfully logging in to Apache Hue, the following user interface is shown:
- You can start executing Hive queries in the default interpreter, or switch over to a PySpark interpreter to execute Spark applications.
- PySpark interpreter:
- You can try executing a simple test application:
- You can upload files directly to HDFS:
-
Files uploaded to HDFS can be directly accessed by Apache Hive or Spark (hdfs:///user/[username]).
-
Files uploaded to Amazon S3 can be directly accessed by Apache Hive or Spark, if the EMR cluster has permissions to access the specified Amazon S3 Bucket. Amazon S3 data is accessible by using the following path in Spark or Hive: s3a://s3-bucket-name/path-to-data
# Read data in HDFS
df = spark.read.csv("hdfs:///your-folder/data.csv")
df.show()
# Read data in S3
df = spark.read.csv("s3a://your-bucket/data.csv")
df.show()