Scenario 4 - Big data analytics

Problem statement#

As a data scientist, I need to perform data analytics operations on a large amount of data.

Goals#

Load data from MinIO
Read the data
Perform data warehousing operations/queries on the data
Show results locally

Tools & Capabilities#

To meet the use case goals, the following tools from the portal will be leveraged:

Tool	Description	Key capability
Jupyter notebook + Spark	The Jupyter Notebook is a web application for creating and sharing documents that contain code, visualizations, and text. It can be used for data science, statistical modeling, machine learning, and much more. Used for spark.	- Trigger Spark execution - Perform advanced analytics
MinIO	MinIO offers high-performance, S3 compatible object storage. Native to Kubernetes, MinIO is the only object storage suite available on every public cloud, every Kubernetes distribution, the private cloud and the edge. MinIO is software-defined and is 100% open-source under GNU AGPL v3.	- load and store the data

Use case guide#

This document is meant to guide the user through Scenario 4 - Big data analytics, by presenting a high level overview of the main steps. As discussed in the use case description, the goal is to provide a tool that performs data analytics on a large amount of data.

Step 1: Initialize the resources. Launch three instances – Jupyter, Spark and MinIO - from the Service Catalog section of the Portal, and verify that their status is ACTIVE in the My Services section to double-check that the deployed instance is ready to be used.
Step 2: Load the data on MinIO. Download and extract the dataset, then open the MinIO instance from My Services, log in with the provided credentials, create a new bucket named "data," and upload the uncompressed dataset files to the bucket for storage and future access.
Step 3: Configure MinIO. In the MinIO console, create a service account, download the JSON file containing private endpoint URL, access key, and secret key for configuration purposes.
Step 4: Get Spark information. Access the Spark instance from the My Services section, log in using the specified credentials, and copy the displayed URL for use in the subsequent steps.
Step 5: Read the data on JupyterLab and trigger Spark execution. Access the Jupyter for Spark instance from My Services, log in using the provided credentials, create a new Jupyter for Spark notebook, execute commands to extract the Jupyter hostname (HOSTNAME), and use a provided reference notebook to perform data operations such as extracting data schema, running queries, and analyzing papers. Additionally, open the Spark instance, log in, and view logs for the currently running Spark application to monitor its execution.