Designing a Data Platform for Covid-19 Analysis and Prediction in the EU and UK

An Azure Data Factory-Focused Data Engineering Solution

What should a stay-at-home dad do when aiming to break into the well-established field of data engineering? The answer: a Udemy course, and not just any course, but a substantial one. Developed by the renowned Ramesh Retnasamy as part of his Data Engineering Series, Azure Data Factory For Data Engineers’ offers a comprehensive exploration of Azure Data Factory. It masterfully teaches students through the lens of a real-world project, which involves creating a data platform to report on and analyse the EU and UK's responses to the Covid-19 pandemic.

If you're even remotely curious about the Azure Platform and its data engineering capabilities, I highly recommend exploring the course through the provided link.

As part of the project, I've included documents detailing the project scope, datasets, employed technologies, and the developed solution architecture. Additionally, I will share a series of slides showcasing the Power BI report created for the course, which I have updated to align with my style guidelines.

If you are an employer or potential customer interested in viewing the actual report, please feel free to contact me directly, and I will be happy to make it available to you.

The documents displayed here (the blue and white ones) were personally created by me to better highlight my interest in the project and my approach to its presentation. While I'm not a graphic designer, I actively use my Adobe subscription, which includes InDesign, to enhance my work.

Project Overview

Back to the project. The notion that anyone beyond a select few with a special interest in data would dedicate more than a few hours to focusing on the Covid-19 Pandemic might seem like a risky assumption. For someone like myself, who has a special interest in data, a contemporary pandemic presents the ideal scenario to apply the skills necessary for roles in data analysis, leading towards a position in data engineering.

The goal of the data platform developed is to support three key scenarios: Reporting, Analysis, and Data Science. It was built with a focus on Azure Data Factory (ADF), leveraging its capabilities to integrate and orchestrate the necessary data to meet these needs. In practice, only the Reporting and Analysis components were fully realised. The Data Science aspect, particularly its potential for developing Machine Learning Models, exceeded the course's scope. I have no plans at the time of this writing to explore the Data Science potential as I focused more so on building my engineering knowledge.

Solution Requirements

I did my best to create a production-like summary of the essential operational components needed for the project. Having never needed to develop such documentation in my past experience, it likely offers a more high-level perspective. I replicated the data engineering solution architecture and the Continuous Integration/Continuous Deployment (CI/CD) architecture, aligning with the course's specifics. I opted against using Microsoft Icons primarily because they didn’t align well with the document's overall style.

The developed solution architecture is of production quality, albeit with a few modifications. It's improbable that all proposed transformation and analysis technologies—HDInsight, Databricks, ADF Data Flows, and Synapse Analytics—would be used concurrently due to overlapping functionalities, added complexity, and significant costs in both time and money. However, this doesn't rule out the possibility of their combined use in certain scenarios.

From a storage standpoint, the solution is entirely feasible, though it may appear somewhat constrained given the vast array of connection types available in ADF and the potential diversity of data sources.

This project's data sources comprised the European Center for Disease Prevention and Control’s Covid-19 datasets and EuroStat’s Population by Age dataset, The population data was ingested from Azure Blob Storage, while the Covid-19 datasets—including Cases and Deaths, Country Response, Hospitalisations, and Testing data—were ingested via HTTP.

Solution Architecture

The raw ingested data was loaded into Azure Data Lake Storage Gen2, with Azure Data Factory (ADF) orchestrating the ingestion and integration process. ADF also served as the platform for orchestrating the transformation process. Utilizing ADF Data Flows—a transformation technology featuring a visual interface designed for simple to medium-complexity transformations—the "Cases and Deaths" dataset was enhanced by adding country lookup information to link datasets together, as well as incorporating UK-specific data. Additionally, ADF Data Flows segmented the Hospital Admissions data into daily and weekly groupings, utilising a Dim Date lookup table to facilitate this process.

HDInsight was employed for transforming the “Testing” dataset through the use of a Hadoop cluster and Hive script. This technology was selected to demonstrate ADF's orchestration capabilities rather than to explore HDInsight's functionality in depth. Similarly, Databricks was utilized to transform the Population data ingested from the EuroStat site. Like HDInsight, Databricks could serve as a comprehensive solution for this project, underscoring its versatility in data transformation tasks.

After completing the transformation process, the presentation datasets were loaded into an Azure SQL Database, making them accessible for further analysis with Synapse Analytics and supporting the Power BI reporting components of the project. Additionally, a Gen2 Data Lake was established, and the presentation datasets were loaded into it to fulfil the Data Science requirements of the project.

ADF's triggers and monitoring capabilities were deployed to automate and ensure the health of the pipelines, as well as to investigate issues as they arise.

CI/CD Architecture

Finally, a detailed exploration into an advanced Continuous Integration/Continuous Deployment (CI/CD) solution was conducted. This initiative aimed to emulate scenarios that necessitate a robust and production-ready environment for development teams. Following Microsoft's latest CI/CD practices, and leveraging both Azure Data Factory and Azure DevOps, this solution encompasses Git integration, branch management, approval workflows, a build pipeline, and spans three key environments—Development, Testing, and Production—among other functionalities.

Project Conclusions and Links

The resulting Power BI report comprises three pages:

  1. Trends: This page delves into trends concerning hospital and ICU admissions, as well as cases and deaths.

  2. UK, France & Germany Trends: Specifically focuses on the trends in cases and deaths within these three countries.

  3. Testing: Concentrates on testing data, including country-specific testing information and comparisons of testing rates versus new confirmed cases.

Trends - Page (1)

UK, France & Germany Trends - Page (2)

Testing - Page (3)

Zurück
Zurück

Developing a Data Platform for F1 Motorsport Historical Data Analysis and Reporting