Scenario 3 - Windows technologies
Problem statement#
As a data scientist, I need to train ML models on a large amount of data that are stored on my Windows VM.
Goals#
- Download data from within the Windows VM
- Create a Spark session through Powershell
- Trigger the Spark execution to process the data
Tools & Capabilities#
In order to meet the use case goals, the following tools from the portal will be leveraged:
Tool | Description | Key capability |
---|---|---|
Windows VM | A Linux virtual machine is a virtual machine (VM) that is running a distribution of Linux as the guest operating system (guest OS). | Virtual Machine |
Apache Spark | Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. | Data processing |
Use case guide#
This document is meant to guide the user through Scenario 3 - Windows technologies. The guide will be a step-by-step tutorial towards such objective. Each subsection covers a step of the approach, namely:
- Download data from within the Windows VM. Access the Windows VM through the My Services section of the portal, then use a programmatic method such as PowerShell or a command-line tool to initiate the download of the dataset directly from a specified URL or source location.
- Create a Spark session through Powershell. Establish a connection to the deployed Spark Instance from within the Windows VM using PowerShell by providing the necessary Spark configuration parameters and commands to initialize a Spark session for data processing and analysis.
- Trigger the Spark execution to process the data. Execute the necessary Spark commands and scripts within PowerShell to initiate the data processing tasks, leveraging the Spark session previously created, and monitor the progress and results of the data processing operations.