This page provides resources to help you further with BDTI. Links to webinars, Self-Paced Labs and other helpful resources for your data challenge.
Webinars have been organised to explain BDTI in more detail. Each webinar covers the goal of the BDTI platform, how to apply, pilot testimonials and hands-on demos. The following table includes links to previous recordings.
|CEF BDTI Webinar - Empowering Public Administrations - February 23, 2021||https://app.sli.do/event/9aoxewbh/live/questions||Download|
|CEF BDTI Webinar - Discover how BDTI can support Public Administrations - May 27, 2021||https://www.youtube.com/watch?v=WmUcB5k-yy8||Download|
|CEF BDTI - Technical Webinar - June 23, 2021||https://www.youtube.com/watch?v=uv6rWkF86pY||Download|
|DEP BDTI Webinar - Empowering Public Administrations: March 2, 2022||https://www.youtube.com/watch?v=Vyig44_fv88||Download|
|DEP BDTI Webinar - Data: From Hype to Action, service architecture and hands-on use case - May 10, 2022||https://www.youtube.com/watch?v=un8mU17_zlk||Download|
Self-Paced Labs are short tutorials developed using Jupyter Notebooks for an interactive learning experience. Users can read the text explanation and directly run the relevant Python code snippet. These short tutorials introduce BDTI users to popular Python Frameworks for Data Science use cases, and use open source data from data.europa.eu and Kaggle (an open source machine learning community) to demonstrate these Python Frameworks.
An overview of the available self-paced labs is provided in the table above.
The above decision tree recommends which available self-paced labs you should consult based on your use case:
As a first step, we recommend to go though the lab Onboarding: Data Science Studio if you are running Jupyter Notebooks on your Data Science Studio. This lab explains how you can, e.g., programmatically connect to other resources in your infrastructure. We recommend to go through the lab Onboarding: Local for any other Jupyter Notebook environment (e.g. on you local computer). This lab explains how you can, e.g., upload your data from your local computer to the S3 bucket of your infrastructure.
If you would like to explore other datasets with similar use case as yours, we recommend to go through the lab Available Datasets. This lab explains how you can consult open source data of data.europa.eu, eurostat, kaggle, etc, and how to download their datasets programmatically.
We recommend to go through Python: Data Exploration. This lab explains how you can use Python to import your data in Jupyter Notebooks, preprocess your data and how to preliminary analyze your data.
Big Data use cases usually are characterised by the "three Vs of Big Data": Data that has high volume (e.g., data above 30 GBs), data that has high velocity (e.g., streaming requirements), data that has high variety (a high variety of different data sources). If your use case involves Big Data, we recommend to go through the lab Apache Spark: Data Exploration or Apache Hive: Data Exploration, depending on the framework you would like to use.
Relational datasets are often represented using tabular format (using rows and columns) and static schemas (static naming to the columns of the tabular format). Relationships between columns can be described using diagrams with the help of primary keys and foreign keys. A sophisticated software system used to maintain relational databases is called a relational database management system (RDBMS). If your dataset is not relational of nature, it is called a non-relational dataset, which can be categorized in semi-structured or non-structured. If you are dealing with relational datasets, we recommend to go through the lab Databases: MySQL.
Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, and computer vision, etc. where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. If your use case involves machine learning, we recommend to go through the lab Machine Learning: Regression, Machine Learning: Classification, Machine Learning: Clustering or Machine Learning: Time Series Analysis, depending on the machine learning classification of your use case.
The repository is openly available here: https://git.fpfis.eu/public-datateam/dataplatform-self-paced-labs.
Relevant GitHub repositories#
Several GitHub repositories can be consulted for an introduction to solving data challenges:
- Python Data Science Handbook contains the entire Python Data Science Handbook (Jake VanderPlas), in the form of free Jupyter notebooks.
- Awesome Machine Learning is a curated list of awesome machine learning frameworks, libraries and software (by language).
- Data Science IPython Notebooks is huge collection of data science Python notebooks that covers a variety of topcis such as deep learning, big data and statistical algorithms.
- Awesome Data Science is a shortcut path to start studying Data Science.
BDTI Data Catalogue#
Please refer to the BDTI Data Catalogue for more information on publicly available datasets.