Used in support of this post: http://toddwschneider.com/posts/chicago-taxi-data/
Code to download, process, and analyze Chicago's publicly available taxi data.
Something of a companion to the nyc-taxi-data repo. The repos share some similar code and structure, but do not explicitly depend on each other.
1. Install PostgreSQL and PostGIS
Both are available via Homebrew on Mac OS X
Note: the raw data is a single uncompressed ~40GB .csv file, it will take a little while to download!
./download_raw_data.sh
./initialize_database.sh
./import_trip_data.sh
New data is available monthly. Once you've run the full setup, in future you can download and process only the latest data by running
./update_trips_data.sh
This has the advantage of not downloading the entire 40 GB dataset every time you want to update a new month
prepare_analysis.sql
and analysis.R
scripts to do analysis in Postgres and R
- Chicago includes anonymous medallion id, New York does not
- Chicago does not include precise location coordinates, only census tracts and community areas (and even then, only sometimes)
- Chicago does not include precise timestamps, instead rounds pickups and drop offs to 15-minute intervals
- Chicago does not include any data from ridesharing companies like Uber and Lyft
- Chicago contains just over 100 million rows, making it significantly smaller than NYC's 1.3 billion rows
- Chicago requires significantly less preprocessing and has fewer unexplained data abnormalities than the NYC data
The last two points in particular suggest that the Chicago dataset is easier to work with than the NYC dataset
- Chicago daily weather data from the NCDC
- Chicago community area and census tract shapefiles from the City of Chicago
- NYC yellow taxi monthly data from the NYC Taxi & Limousine Commission
- Cubs home schedules from Baseball Reference
[email protected], or open a GitHub issue