Video Link:-
https://github.com/sourav03561/covid-19-data-analysis/assets/46227372/dc31344c-cc7e-43f9-bb86-4daee9b46846
This project focuses on analyzing COVID-19 data from multiple regions: USA, Europe, and Asia. The data is collected, processed, and visualized using a combination of REST APIs, relational data sources, and data processing tools.
- SOURAV SARKAR
- PRANSHU Gautam
- USA: REST API (JSON)
- Europe: ECDC API (JSON)
- Asia: MySQL (CSV sourced from Kaggle)
dags/: Directory to store all DAGs (Directed Acyclic Graphs).Collect_Data/: Subdirectory for data extraction, processing, and indexing.covid_19_usa.py: Extracts data from the USA REST API and saves it as usa.json.covid_19_europe.py: Extracts data from the Europe REST API and saves it as europe.json.data_processing.py: Extracts data from the MySQL database and saves it as asia.json. Combines usa.json, europe.json, and asia.json into a single world.parquet file using Apache Spark.elastic_index.py: Indexes the world.parquet file in Elasticsearch.
my_first_dag.py: Airflow DAG to automate the entire data collection, processing, and indexing workflow.
- USA Data Extraction: Uses
covid_19_usa.pyto fetch data from the REST API. - Europe Data Extraction: Uses
covid_19_europe.pyto fetch data from the ECDC API. - Asia Data Extraction: Uses
data_processing.pyto fetch data from a MySQL database. - Data Aggregation: Combines data from all regions into
world.parquet. - Indexing in Elasticsearch: Uses
elastic_index.pyto index data in Elasticsearch.
Visualizations and KPIs are created using Kibana.
The workflow is automated using my_first_dag.py.
This project demonstrates a comprehensive approach to collecting, processing, and visualizing COVID-19 data from multiple regions using modern data engineering tools and techniques.