Skip to content

Use case Guide - Scenario 4 - Big data analytics#

This document is meant to guide the user through Scenario 4 - Big data analytics. As disussed in the use case description, the goal is to provide a tool that performs data analytics on a large amount of data. The guide will be a step by step guide tutorial such objective. More in detail, each sub-section covers a step of the approach, namely:

  1. Step 1: Initialize the resources.
  2. Step 2: Load the data on MinIO.
  3. Step 3: Configure MinIO.
  4. Step 4: Get Spark information.
  5. Step 5: Read the data on JupyterLab and trigger Spark execution.

Use case code#

Code
Use case code - Scenario 4 - Big data analytics

Step 1: Initialize the resources#

As first step, the user should inizialize the required resources. More in particular, three instances should be launched:

  • MinIO
  • Jupyter for Spark
  • Spark

Initialize the MinIO instance#

  1. Go on the Service Catalog section of the Portal.
  2. Click on the button Launch on the MinIO badge.
  3. Assign a name to the instance and select a group from the ones available in the list.
  4. Select the Default configuration.
  5. Set your MinIO Admin Username and MinIO User Username. Copy the auto-generated password. This will be needed to access the instance in the later stage. (NB: Instance credentials are automatically saved and acessible on the My Data section of the portal).
  6. In the User Policies field, leave consoleAdmin only and cancel out the rest.
  7. In the Storage Resource field, assign 500Gi as value. This will allocate the MinIO instance sufficient storage space to store the use case data.
  8. Launch the instance by clicking on the launch button.

Initialize the Jupyter for Spark instance#

  1. Go on the Service Catalog section of the Portal.
  2. Click on the button Launch on the Jupyter for Spark badge.
  3. Assign a name to the instance and select a group from the ones available in the list.
  4. Select the Default configuration.
  5. Copy the auto-generated password. This will be needed to access the instance in the later stage. (NB: Instance credentials are automatically saved and acessible on the My Data section of the portal).
  6. Select the NFS PVC name corresponding to the DSL group selected at point 3.
  7. Launch the instance by clicking on the launch button.

Initialize the Spark instance#

  1. Go on the Service Catalog section of the Portal.
  2. Click on the button Launch on the Spark badge.
  3. Assign a name to the instance and select a group from the ones available in the list.
  4. Select the Default configuration.
  5. Set your Spark Username.
  6. Copy the auto-generated password. This will be needed to access the instance in the later stage. (NB: Instance credentials are automatically saved and acessible on the My Data section of the portal).
  7. The capacity allocated to the Spark instance needs to be sufficient to cope with the high workload request. Please make sure that an adequate capacity is assigned in the following configuration fields:

  8. Master CPU Resource Request

  9. Master Memory Resource Request
  10. Master CPU Resource Limits
  11. Master Memory Resource Limits
  12. Worker Replicas
  13. Worker CPU Resource Request
  14. Worker Memory Resource Request
  15. Worker CPU Resource Limits
  16. Worker Memory Resource Limits

As an example, the following configuration can be suitable for the proposed use case:

alt text

  1. Launch the instance by clicking on the launch button.

After having launch the instances, go on the My Services section of the portal to verify that all the deployed services are up and running (all three instances should be in status ACTIVE).

alt text

Step 2: Load the data on MinIO#

  1. Download the dataset from here. Make sure to extract all the files.
  2. Go on the My Services section of the Portal and Open the MinIO instance.
  3. Login into your MinIO instance with the access credentials defined in the configuration.
  4. In the MinIO console, go into the Buckets section and click on Create Bucket +.

alt text

  1. Assign a Bucket Name (i.e. data) and click ok Create Bucket.
  2. After the bucket has been created, access it from the bucket list. Then, click on the Upload button and Upload File.

alt text

  1. Select the dataset downloaded and uncompressed at point 1.

Step 3: Configure MinIO#

  1. In the MinIO console, click on Identity > Service Accounts > Create service account:

alt text

  1. Then, click on Create and after on Download for import:

alt text

  1. This will trigger the download of a JSON files, which can be reade through a notepad. This file contains the following information:
    • url: This is the private endpoint of the MinIO instance.
    • accessKey
    • secretKey

Step 4: Get Spark information#

  1. Go on the My Services section of the Portal and Open the Spark instance.
  2. Login into your Spark instance with the access credentials defined in the configuration.
  3. When accessing the Spark instance, copy paste the URL showed on the UI:

alt text

This information will be needed in the next step.

Step 5: Read the data on JupyterLab and trigger Spark execution#

  1. Go on the My Services section of the Portal and Open the Jupyter for Spark instance.
  2. Login into your Jupyter for Spark instance with the access credentials defined in the configuration.
  3. In the Jupyter navbar, click on File and then on New Launcher.
  4. Click on Python 3 (ipykernel) in the Notebook section to create a new Jupyter for Spark notebook.

alt text

  1. In the newly created notebook, execute the command "%env" and copy the value of Jupyer hostname (HOSTNAME):

alt text

This information will be needed to compile the Notebook.

  1. Create the notebook by using this notebook as reference. Make sure to configure what is needed accordingly to the comments present in the notebook.
  2. This script will run the following operations on the data:
    • Extract the data schema from all the parsed documents of the dataset;
    • Run queries on the data:
      • How many papers for each source?
      • Which author has written the most papers?
      • Which are the abstracts for the reported papers?
  3. Go on the My Services section of the Portal and Open the Spark instance.
  4. Login into your Spark instance with the access credentials defined in the configuration.
  5. Click on the currently Running Application:

alt text

  1. In the Logs column, click on stderr to visualize logs produced by each of the executors.

alt text