Scenario 3 - Windows technologies

Problem statement#

As a data scientist, I need to train ML models on a large amount of data that are stored on my Windows VM.

Goals#

Download data from within the Windows VM
Create a Spark session through Powershell
Trigger the Spark execution to process the data

alt text

Tools & Capabilities#

In order to meet the use case goals, the following tools from the portal will be leveraged:

Tool	Description	Key capability
Windows VM	A Linux virtual machine is a virtual machine (VM) that is running a distribution of Linux as the guest operating system (guest OS).	Virtual Machine
Apache Spark	Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.	Data processing

Use case guide#

This document is meant to guide the user through Scenario 3 - Windows technologies. The guide will be a step-by-step tutorial towards such objective. Each subsection covers a step of the approach, namely:

Download data from within the Windows VM. Access the Windows VM through the My Services section of the portal, then use a programmatic method such as PowerShell or a command-line tool to initiate the download of the dataset directly from a specified URL or source location.
Create a Spark session through Powershell. Establish a connection to the deployed Spark Instance from within the Windows VM using PowerShell by providing the necessary Spark configuration parameters and commands to initialize a Spark session for data processing and analysis.
Trigger the Spark execution to process the data. Execute the necessary Spark commands and scripts within PowerShell to initiate the data processing tasks, leveraging the Spark session previously created, and monitor the progress and results of the data processing operations.