Developing a Data Platform for F1 Motorsport Historical Data Analysis and Reporting

An Azure Databricks-Focused Data Engineering Solution.

Another dispatch from the stay-at-home dad turned aspiring data engineer. F1 Motorsport is awash in datasets. Like most sports, statistics provide the context that teams, drivers, and fans rely on to make sense of what we are seeing and experiencing. Understanding why some teams and drivers consistently outperform others can only be achieved by transforming data points generated at race events into actionable insights. While everyone wants to know who won the race, some of us are keen to understand why.

Insights are drawn from analysis, which requires information. Information stems from data, and data necessitates a collection process. When done correctly, these processes appear seamless to the end-user, and that's the point. Most people driving across a bridge, except for structural engineers, don't necessarily think about how the cement and steel structure supports the cars and trains atop it. Similarly, data engineers ponder questions like, "How much latency is acceptable when collecting car telemetry data?" or "What type of data warehousing solutions are teams deploying for real-time analysis?" when watching this year's Monza race results. To most, these might be mundane questions, but to those like myself who appreciate the staggering amount of data generated at just one race, the technologies involved are fascinating.

Enough intro, here is the Solution Architecture and platform I developed:

Project Overview and Links

Regarding the project detailed in this report, the F1 Motorsport Data Platform was developed as part of a comprehensive course on building complete data engineering solutions using Azure services. The platform leverages Azure Databricks as the primary service for data processing, enabling efficient extraction, transformation, and delivery of data for advanced analytics and machine learning. Azure Data Factory plays a crucial role in managing data ingestion and orchestrating complex data pipelines, ensuring seamless and reliable data flow.

The storage architecture incorporates both Lakehouse and Delta Lake paradigms, offering scalable, high-performance data storage with ACID transaction support, enabling real-time analytics and ensuring data consistency. Data integrity and governance are meticulously managed through Unity Catalog, providing a unified governance solution that simplifies data discovery and policy management.

For end-user analysis and reporting, the platform integrates Databricks' native analytics capabilities with the robust reporting features of Power BI, offering a comprehensive solution for data visualization and business intelligence. Additionally, the platform supports a dedicated storage environment optimized for machine learning, facilitating the development, training, and deployment of machine learning models.

Overall, this platform exemplifies a robust, scalable, and secure data engineering solution, tailored to meet the complex needs of modern data-driven projects.

A quick note to employers viewing this project. The following project solution documentation was created by myself, using Adobe InDesign, Microsoft Power BI, and Azure Databricks. These documents highlight my ability to communicate both technical and analytical solutions to data engineering challenges effectively. They demonstrate my proficiency in using industry-standard tools and my capability to present complex data insights in a clear and visually appealing manner.

Project Requirements

The now-shuttered Ergast site provided F1 Motorsport race results, team, driver, and circuit records from the sport's inception in 1950 to the present. Ergast will continue offering API, HTTP, and CSV database access until its closure at the end of 2024. For this project, a more tailored view of these records was created, focusing on driver and team performance to showcase the most dominant competitors throughout the sport's history. Additionally, datasets comparing driver qualifying versus final race positions and circuit performance were developed to offer a comprehensive view of F1 Motorsports' key components.

The primary focus of this project was to implement a data engineering solution within the Microsoft Azure Databricks Service and the broader Azure platform, showcasing its diverse data engineering capabilities. Although the machine learning aspects were intentionally not realized, the platform's capacity to support ML modeling is complete and functional.

Integrating Delta Lake architecture with the data governance capabilities of Unity Catalog in the Azure Databricks environment offers enhanced control and security over data assets. Delta Lake ensures data reliability and performance with ACID transactions and scalable metadata handling, while Unity Catalog provides a unified governance solution for managing data access policies, lineage, and auditing across the Azure platform. This project was developed to enhance my understanding of the Databricks environment and the real-world application of regulatory compliance, which proved to be an invaluable aspect of the instruction.

Raw Datasets (Ergast.com)

Initially, HTTP connectors were used to ingest the datasets, managed by Data Factory. A Data Lake was developed to store the ingested data before transformations were conducted in Databricks. Users of Databricks will recognize the notebook environment's high customizability, seamless integration with the Azure ecosystem, and provision of scalable, cost-effective compute solutions designed for both teams and individual developers. Specifically, Databricks offers two types of compute: "All-Purpose Compute" for interactive workloads such as ad-hoc analysis and development, and "Job Compute" for automated, scheduled tasks and production workloads. This flexibility ensures that various computational needs are efficiently met within the platform.

Connecting or mounting storage within Databricks is straightforward, secure, and bolstered by Azure ecosystem integration. Ingesting and transformation processes were explored using PySpark and SparkSQL within the Azure Databricks environment. These tools facilitated efficient data manipulation and querying, leveraging both Python and SQL-based syntaxes for comprehensive data engineering tasks. Additionally, the notebook environment in Azure Databricks allows users to manipulate data using pure SQL when necessary, providing flexibility and ease of use for analysts who prefer SQL for data querying and transformation.

Azure Databricks - Solution Architecture

The F1 Data Platform I developed as part of the “Azure Databricks and Spark for Data Engineers” course, created by the esteemed Ramesh Retnasamy, is part of a series of three courses focused on data engineering on the Azure platform. I highly recommend his courses. The solution architecture evolved through several stages, emphasizing various aspects, techniques, strategies, and technologies to achieve the final solution.

As detailed in the above infographic, the solution builds upon the Lakehouse architecture, adding the open-source Delta Lake layer for both the ingestion and presentation layers, and incorporating Unity Catalog. This integration unifies governance and metadata management of Delta Lake tables, providing fine-grained access control, audit logging, and data lineage tracking. Although this level of control was not strictly necessary for the project, it was included to enhance the learners' skill sets.

A critical component of this architecture is the metastore, which serves as a centralized repository for all metadata related to the data stored in the platform. The metastore supports Unity Catalog by maintaining accurate and comprehensive metadata, enabling robust governance and management. It ensures ACID transactions and schema enforcement in Delta Lake, thereby enhancing data integrity and consistency. Furthermore, the metastore facilitates efficient data discovery and query performance, as it provides a catalog of datasets and schema information that query engines can quickly access. This centralized metadata management is essential for achieving the seamless integration and high level of control demonstrated in the F1 Data Platform.

Transformed Datasets (Delta Lake)

The resulting presentation layer reporting analysis datasets enable a compelling analysis of dominant teams and drivers. Initial analysis results are developed directly within Databricks and presented through a notebook dashboard and URL. The ability to create high-quality visual results within Databricks, without needing a more dynamic service like Power BI, underscores its strength as an integrated engineering solution. I also developed two additional datasets: one for F1 Racecourse Speed Analysis (2012-2020) and another comparing driver qualifying positions with final race positions for the 2020 season. The scope of the data analysis was primarily limited to reduce compute costs for myself.

Reporting and Analysis Datasets

The final deliverable, a Power BI report complete with navigation and four informative report pages, concludes the end-to-end objectives the platform aims to achieve. Power BI can connect directly with Databricks tables or schemas from the Databricks UI. For the purposes of this project documentation, snapshot versions of the Power BI report pages are included below. They include:

  1. Dominant Teams (Top 5 teams, 1950 - 2020)

  2. Dominant Drivers (Top 10 drivers, 1950 - 2020)

  3. 2020 Driver Qualifying Positions vs Final Race Positions

  4. Average Circuit Lap Times (2012 - 2020)

5 Most Dominant Teams (1950 - 2020)

10 Most Dominant Drivers (1950 - 2020)

2020 Driver Qualifying Position vs Final Position

Average Circuit Lap Times (Circuit Gilles Villenvuve - Selected)

Weiter
Weiter

Designing a Data Platform for Covid-19 Analysis and Prediction in the EU and UK