Skip to content

SeaBebop/TekkenSubreddit-ETL-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tekken Subreddit ETL Pipeline

Hello! This project demonstrates the process of extracting data from the Tekken subreddit, transforming it, and visualizing the results. The data is collected using the PRAW library, stored in AWS S3, processed using AWS Glue, and visualized using Looker Studio.

image

Project Workflow

  • Data Extraction:
    • Used PRAW (Python Reddit API Wrapper) to extract data from the Tekken subreddit. Extracted fields include Title, Score, Upvote_Ratio, Number_of_Comments, Url, Author, Flair, and Created_UTC.

The extracted data was converted as a CSV file.

  • Data Storage:

    • Uploaded the CSV file to an S3 bucket on AWS for storage.
  • Data Cataloging:

    • Created a Glue Crawler to automatically catalog the CSV file stored in S3.
    • The Glue Crawler generated a table in the AWS Glue database based on the CSV data.
  • Data Transformation (ETL):

    • Used AWS Glue's visual editor to perform the following ETL operations:
    • Remove Nulls: Identified and removed columns containing only null values.
    • Timestamp Transformation: Converted the UNIX timestamp (Created_UTC) to a human-readable date format.
    • SQL Query: Executed a SQL query to sort the data by the Year column.
    • The transformed data was then written back to an S3 bucket as a CSV file.
  • Data Visualization:

    • Uploaded the final processed CSV file to Looker Studio.
    • Created three charts to visualize the data:
    • Trend of Score: Analyzed the trend of scores over time.
    • Metadata: Provided an overview of key metadata from the posts.
    • Most Scored by Flair: Visualized which flairs received the highest scores.

image

The Looker Studio dashboard can be accessed here: https://lookerstudio.google.com/reporting/c58f420a-e8e4-4bc8-b876-e6b487e9210d/page/qrBAE

About

AWS Glue ETL transformation of tekken subreddit data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages