Skip to content

Scenario 1 - AI & ML Learning

Problem statement#

As a data scientist, I need to train ML models on a large amount of data that are stored on my Linux VM.

Goals#

  • Download data from within the Linux machine.
  • Launch the Linux VM.
  • Create a Spark session through JupyterLab.
  • Trigger the Spark execution to process the data.

Tools & Capabilities#

In order to meet the use case goals, the following tools from the portal will be leveraged:

Tool Description Key capability
Linux VM A Linux virtual machine is a virtual machine (VM) that is running a distribution of Linux as the guest operating system (guest OS). Virtual Machine
JupyterLab + Spark The Jupyter Notebook is a web application for creating and sharing documents that contain code, visualizations, and text. It can be used for data science, statistical modeling, machine learning, and much more. Used for spark.
- Trigger Spark execution
- Perform advanced analytics

Use case guide#

This document is meant to guide the user through Scenario 1 - AI & Machine Learning. The guide will be a step-by-step tutorial towards such objective. Each subsection covers a step of the approach, namely:

  1. Download data from within the Linux machine. Connect to the deployed Linux Virtual Machine, navigate to the desired directory using terminal commands, download the dataset using tools like wget or curl, extract files if needed, verify the download, and proceed to utilize the dataset within the Linux environment.
  2. Launch the Linux VM. Access the Service Catalog section of the Portal, choose the Linux VM instance you wish to launch and access it from the My Services section of the Portal.
  3. Create a Spark session through JupyterLab. Create a Jupyter Notebook and configure its connection with a deployed Spark instance.
  4. Trigger the Spark execution to process the data. Within the JupyterLab environment, open the notebook created in Step 3, execute the Spark code to load, transform, and process the data using the Spark session, and monitor the execution progress and results, ensuring effective data processing using Spark.