"Great works are performed not by strength but by perseverance."

- Samuel Johnson

Anthony Calek Anthony Calek

Developing a Data Platform for F1 Motorsport Historical Data Analysis and Reporting

Regarding the project detailed in this report, the F1 Motorsport Data Platform was developed as part of a comprehensive course on building complete data engineering solutions using Azure services. The platform leverages Azure Databricks as the primary service for data processing, enabling efficient extraction, transformation, and delivery of data for advanced analytics and machine learning. Azure Data Factory plays a crucial role in managing data ingestion and orchestrating complex data pipelines, ensuring seamless and reliable data flow.

An Azure Databricks-Focused Data Engineering Solution.

Another dispatch from the stay-at-home dad turned aspiring data engineer. F1 Motorsport is awash in datasets. Like most sports, statistics provide the context that teams, drivers, and fans rely on to make sense of what we are seeing and experiencing. Understanding why some teams and drivers consistently outperform others can only be achieved by transforming data points generated at race events into actionable insights. While everyone wants to know who won the race, some of us are keen to understand why.

Insights are drawn from analysis, which requires information. Information stems from data, and data necessitates a collection process. When done correctly, these processes appear seamless to the end-user, and that's the point. Most people driving across a bridge, except for structural engineers, don't necessarily think about how the cement and steel structure supports the cars and trains atop it. Similarly, data engineers ponder questions like, "How much latency is acceptable when collecting car telemetry data?" or "What type of data warehousing solutions are teams deploying for real-time analysis?" when watching this year's Monza race results. To most, these might be mundane questions, but to those like myself who appreciate the staggering amount of data generated at just one race, the technologies involved are fascinating.

Enough intro, here is the Solution Architecture and platform I developed:

Project Overview and Links

Regarding the project detailed in this report, the F1 Motorsport Data Platform was developed as part of a comprehensive course on building complete data engineering solutions using Azure services. The platform leverages Azure Databricks as the primary service for data processing, enabling efficient extraction, transformation, and delivery of data for advanced analytics and machine learning. Azure Data Factory plays a crucial role in managing data ingestion and orchestrating complex data pipelines, ensuring seamless and reliable data flow.

The storage architecture incorporates both Lakehouse and Delta Lake paradigms, offering scalable, high-performance data storage with ACID transaction support, enabling real-time analytics and ensuring data consistency. Data integrity and governance are meticulously managed through Unity Catalog, providing a unified governance solution that simplifies data discovery and policy management.

For end-user analysis and reporting, the platform integrates Databricks' native analytics capabilities with the robust reporting features of Power BI, offering a comprehensive solution for data visualization and business intelligence. Additionally, the platform supports a dedicated storage environment optimized for machine learning, facilitating the development, training, and deployment of machine learning models.

Overall, this platform exemplifies a robust, scalable, and secure data engineering solution, tailored to meet the complex needs of modern data-driven projects.

A quick note to employers viewing this project. The following project solution documentation was created by myself, using Adobe InDesign, Microsoft Power BI, and Azure Databricks. These documents highlight my ability to communicate both technical and analytical solutions to data engineering challenges effectively. They demonstrate my proficiency in using industry-standard tools and my capability to present complex data insights in a clear and visually appealing manner.

Project Requirements

The now-shuttered Ergast site provided F1 Motorsport race results, team, driver, and circuit records from the sport's inception in 1950 to the present. Ergast will continue offering API, HTTP, and CSV database access until its closure at the end of 2024. For this project, a more tailored view of these records was created, focusing on driver and team performance to showcase the most dominant competitors throughout the sport's history. Additionally, datasets comparing driver qualifying versus final race positions and circuit performance were developed to offer a comprehensive view of F1 Motorsports' key components.

The primary focus of this project was to implement a data engineering solution within the Microsoft Azure Databricks Service and the broader Azure platform, showcasing its diverse data engineering capabilities. Although the machine learning aspects were intentionally not realized, the platform's capacity to support ML modeling is complete and functional.

Integrating Delta Lake architecture with the data governance capabilities of Unity Catalog in the Azure Databricks environment offers enhanced control and security over data assets. Delta Lake ensures data reliability and performance with ACID transactions and scalable metadata handling, while Unity Catalog provides a unified governance solution for managing data access policies, lineage, and auditing across the Azure platform. This project was developed to enhance my understanding of the Databricks environment and the real-world application of regulatory compliance, which proved to be an invaluable aspect of the instruction.

Raw Datasets (Ergast.com)

Initially, HTTP connectors were used to ingest the datasets, managed by Data Factory. A Data Lake was developed to store the ingested data before transformations were conducted in Databricks. Users of Databricks will recognize the notebook environment's high customizability, seamless integration with the Azure ecosystem, and provision of scalable, cost-effective compute solutions designed for both teams and individual developers. Specifically, Databricks offers two types of compute: "All-Purpose Compute" for interactive workloads such as ad-hoc analysis and development, and "Job Compute" for automated, scheduled tasks and production workloads. This flexibility ensures that various computational needs are efficiently met within the platform.

Connecting or mounting storage within Databricks is straightforward, secure, and bolstered by Azure ecosystem integration. Ingesting and transformation processes were explored using PySpark and SparkSQL within the Azure Databricks environment. These tools facilitated efficient data manipulation and querying, leveraging both Python and SQL-based syntaxes for comprehensive data engineering tasks. Additionally, the notebook environment in Azure Databricks allows users to manipulate data using pure SQL when necessary, providing flexibility and ease of use for analysts who prefer SQL for data querying and transformation.

Azure Databricks - Solution Architecture

The F1 Data Platform I developed as part of the “Azure Databricks and Spark for Data Engineers” course, created by the esteemed Ramesh Retnasamy, is part of a series of three courses focused on data engineering on the Azure platform. I highly recommend his courses. The solution architecture evolved through several stages, emphasizing various aspects, techniques, strategies, and technologies to achieve the final solution.

As detailed in the above infographic, the solution builds upon the Lakehouse architecture, adding the open-source Delta Lake layer for both the ingestion and presentation layers, and incorporating Unity Catalog. This integration unifies governance and metadata management of Delta Lake tables, providing fine-grained access control, audit logging, and data lineage tracking. Although this level of control was not strictly necessary for the project, it was included to enhance the learners' skill sets.

A critical component of this architecture is the metastore, which serves as a centralized repository for all metadata related to the data stored in the platform. The metastore supports Unity Catalog by maintaining accurate and comprehensive metadata, enabling robust governance and management. It ensures ACID transactions and schema enforcement in Delta Lake, thereby enhancing data integrity and consistency. Furthermore, the metastore facilitates efficient data discovery and query performance, as it provides a catalog of datasets and schema information that query engines can quickly access. This centralized metadata management is essential for achieving the seamless integration and high level of control demonstrated in the F1 Data Platform.

Transformed Datasets (Delta Lake)

The resulting presentation layer reporting analysis datasets enable a compelling analysis of dominant teams and drivers. Initial analysis results are developed directly within Databricks and presented through a notebook dashboard and URL. The ability to create high-quality visual results within Databricks, without needing a more dynamic service like Power BI, underscores its strength as an integrated engineering solution. I also developed two additional datasets: one for F1 Racecourse Speed Analysis (2012-2020) and another comparing driver qualifying positions with final race positions for the 2020 season. The scope of the data analysis was primarily limited to reduce compute costs for myself.

Reporting and Analysis Datasets

The final deliverable, a Power BI report complete with navigation and four informative report pages, concludes the end-to-end objectives the platform aims to achieve. Power BI can connect directly with Databricks tables or schemas from the Databricks UI. For the purposes of this project documentation, snapshot versions of the Power BI report pages are included below. They include:

  1. Dominant Teams (Top 5 teams, 1950 - 2020)

  2. Dominant Drivers (Top 10 drivers, 1950 - 2020)

  3. 2020 Driver Qualifying Positions vs Final Race Positions

  4. Average Circuit Lap Times (2012 - 2020)

5 Most Dominant Teams (1950 - 2020)

10 Most Dominant Drivers (1950 - 2020)

2020 Driver Qualifying Position vs Final Position

Average Circuit Lap Times (Circuit Gilles Villenvuve - Selected)

Weiterlesen
Anthony Calek Anthony Calek

Designing a Data Platform for Covid-19 Analysis and Prediction in the EU and UK

This project focuses on developing a data platform for analyzing, reporting on and predicting the impact of the Covid-19 pandemic within the EU and UK, utilizing case numbers, mortality rates, hospital and ICU admissions, and testing statistics.

An Azure Data Factory-Focused Data Engineering Solution

What should a stay-at-home dad do when aiming to break into the well-established field of data engineering? The answer: a Udemy course, and not just any course, but a substantial one. Developed by the renowned Ramesh Retnasamy as part of his Data Engineering Series, Azure Data Factory For Data Engineers’ offers a comprehensive exploration of Azure Data Factory. It masterfully teaches students through the lens of a real-world project, which involves creating a data platform to report on and analyse the EU and UK's responses to the Covid-19 pandemic.

If you're even remotely curious about the Azure Platform and its data engineering capabilities, I highly recommend exploring the course through the provided link.

As part of the project, I've included documents detailing the project scope, datasets, employed technologies, and the developed solution architecture. Additionally, I will share a series of slides showcasing the Power BI report created for the course, which I have updated to align with my style guidelines.

If you are an employer or potential customer interested in viewing the actual report, please feel free to contact me directly, and I will be happy to make it available to you.

The documents displayed here (the blue and white ones) were personally created by me to better highlight my interest in the project and my approach to its presentation. While I'm not a graphic designer, I actively use my Adobe subscription, which includes InDesign, to enhance my work.

Project Overview

Back to the project. The notion that anyone beyond a select few with a special interest in data would dedicate more than a few hours to focusing on the Covid-19 Pandemic might seem like a risky assumption. For someone like myself, who has a special interest in data, a contemporary pandemic presents the ideal scenario to apply the skills necessary for roles in data analysis, leading towards a position in data engineering.

The goal of the data platform developed is to support three key scenarios: Reporting, Analysis, and Data Science. It was built with a focus on Azure Data Factory (ADF), leveraging its capabilities to integrate and orchestrate the necessary data to meet these needs. In practice, only the Reporting and Analysis components were fully realised. The Data Science aspect, particularly its potential for developing Machine Learning Models, exceeded the course's scope. I have no plans at the time of this writing to explore the Data Science potential as I focused more so on building my engineering knowledge.

Solution Requirements

I did my best to create a production-like summary of the essential operational components needed for the project. Having never needed to develop such documentation in my past experience, it likely offers a more high-level perspective. I replicated the data engineering solution architecture and the Continuous Integration/Continuous Deployment (CI/CD) architecture, aligning with the course's specifics. I opted against using Microsoft Icons primarily because they didn’t align well with the document's overall style.

The developed solution architecture is of production quality, albeit with a few modifications. It's improbable that all proposed transformation and analysis technologies—HDInsight, Databricks, ADF Data Flows, and Synapse Analytics—would be used concurrently due to overlapping functionalities, added complexity, and significant costs in both time and money. However, this doesn't rule out the possibility of their combined use in certain scenarios.

From a storage standpoint, the solution is entirely feasible, though it may appear somewhat constrained given the vast array of connection types available in ADF and the potential diversity of data sources.

This project's data sources comprised the European Center for Disease Prevention and Control’s Covid-19 datasets and EuroStat’s Population by Age dataset, The population data was ingested from Azure Blob Storage, while the Covid-19 datasets—including Cases and Deaths, Country Response, Hospitalisations, and Testing data—were ingested via HTTP.

Solution Architecture

The raw ingested data was loaded into Azure Data Lake Storage Gen2, with Azure Data Factory (ADF) orchestrating the ingestion and integration process. ADF also served as the platform for orchestrating the transformation process. Utilizing ADF Data Flows—a transformation technology featuring a visual interface designed for simple to medium-complexity transformations—the "Cases and Deaths" dataset was enhanced by adding country lookup information to link datasets together, as well as incorporating UK-specific data. Additionally, ADF Data Flows segmented the Hospital Admissions data into daily and weekly groupings, utilising a Dim Date lookup table to facilitate this process.

HDInsight was employed for transforming the “Testing” dataset through the use of a Hadoop cluster and Hive script. This technology was selected to demonstrate ADF's orchestration capabilities rather than to explore HDInsight's functionality in depth. Similarly, Databricks was utilized to transform the Population data ingested from the EuroStat site. Like HDInsight, Databricks could serve as a comprehensive solution for this project, underscoring its versatility in data transformation tasks.

After completing the transformation process, the presentation datasets were loaded into an Azure SQL Database, making them accessible for further analysis with Synapse Analytics and supporting the Power BI reporting components of the project. Additionally, a Gen2 Data Lake was established, and the presentation datasets were loaded into it to fulfil the Data Science requirements of the project.

ADF's triggers and monitoring capabilities were deployed to automate and ensure the health of the pipelines, as well as to investigate issues as they arise.

CI/CD Architecture

Finally, a detailed exploration into an advanced Continuous Integration/Continuous Deployment (CI/CD) solution was conducted. This initiative aimed to emulate scenarios that necessitate a robust and production-ready environment for development teams. Following Microsoft's latest CI/CD practices, and leveraging both Azure Data Factory and Azure DevOps, this solution encompasses Git integration, branch management, approval workflows, a build pipeline, and spans three key environments—Development, Testing, and Production—among other functionalities.

Project Conclusions and Links

The resulting Power BI report comprises three pages:

  1. Trends: This page delves into trends concerning hospital and ICU admissions, as well as cases and deaths.

  2. UK, France & Germany Trends: Specifically focuses on the trends in cases and deaths within these three countries.

  3. Testing: Concentrates on testing data, including country-specific testing information and comparisons of testing rates versus new confirmed cases.

Trends - Page (1)

UK, France & Germany Trends - Page (2)

Testing - Page (3)

Weiterlesen