This project focuses on analyzing IoT data from environmental and traffic sensors to predict traffic patterns using machine learning. The system combines real-time data streaming with historical data analysis to provide insights into traffic behavior based on environmental conditions.
IoT-Data-Analysis/
βββ data/ # Data files
β βββ environ_MS83200MS_nowind_3m-10min.json
β βββ traffic_raw_siemens_light-veh.json
β βββ traffic_raw_siemens_heavy-veh.json
βββ logs/ # Log files
βββ models/ # Trained models
βββ outputs/ # Generated outputs
β βββ streaming/ # Real-time analysis results
βββ src/ # Source code
β βββ data_acquisition.py # Data loading and preprocessing
β βββ data_processing.py # Data processing utilities
β βββ data_analysis.py # Data analysis and visualization
β βββ model_training.py # Model training and evaluation
β βββ mqtt_streaming.py # Real-time data streaming
βββ requirements.txt # Project dependencies
- Source: MS83200MS sensor
- Format: JSON
- Variables:
- Temperature (Β°C): Range 1.54Β°C to 23.20Β°C
- Humidity (%): Range 32.01% to 97.32%
- Radiation (W/mΒ²): Range 6.04 to 240.73
- Pressure (hPa): Range 999.27 to 1036.47
- Sunshine (minutes): Range 0.00 to 599.55
- Precipitation (mm): Range 0.00 to 0.23
- Source: Siemens sensors
- Format: JSON
- Categories:
- Light vehicles:
- Mean: 46.06 vehicles
- Range: 0 to 144 vehicles
- Zero values: 2303 instances
- Heavy vehicles:
- Mean: 3.60 vehicles
- Range: 0 to 12 vehicles
- Zero values: 2873 instances
- Light vehicles:
- Metrics:
- Vehicle count per 10-minute interval
- Timestamp-based measurements
- 90 days of data (September to November 2018)
src/data_acquisition.py
-
Environmental Data:
- Source: MS83200MS sensor
- Format: JSON files
- Variables: temperature, humidity, radiation, pressure, sunshine, precipitation
- Frequency: 10-minute intervals
-
Traffic Data:
- Source: Siemens sensors
- Format: JSON files
- Categories: light vehicles, heavy vehicles
- Frequency: 10-minute intervals
-
Real-time Data:
- MQTT streaming for live sensor data
- API endpoints for real-time updates
- WebSocket connections for continuous data flow
src/data_processing.py
-
Cleaning:
- Missing value handling (drop or forward fill)
- Outlier detection and removal
- Data type conversion
- Timestamp standardization
-
Feature Engineering:
- Time-based features (hour, day_of_week, is_weekend)
- Environmental interactions (temp_humidity, radiation_pressure)
- Rolling statistics (1-hour windows)
- Lag features for temporal patterns
-
Data Integration:
- Combining environmental and traffic data
- Time alignment and resampling
- Feature normalization and scaling
- Target variable creation
-
Storage Formats:
- Processed data saved as CSV/JSON
- Model outputs in PNG/PDF formats
- Logs in text format
- Models in pickle format
-
Directory Structure:
IoT-Data-Analysis/ βββ data/ # Raw and processed data β βββ raw/ # Original sensor data β βββ processed/ # Cleaned and transformed data βββ models/ # Trained models and metadata βββ outputs/ # Analysis results β βββ streaming/ # Real-time analysis β βββ training/ # Model training outputs β βββ advanced_analysis/ # Detailed analysis results βββ logs/ # System and processing logs
src/model_training.py
-
Feature Selection:
- Environmental variables
- Time-based features
- Interaction terms
- Rolling statistics
-
Target Variable:
- Binary classification (high/low traffic)
- Dynamic threshold based on historical data
- Class balancing techniques
-
Data Splitting:
- Train-test split (80-20)
- Time-based validation
- Cross-validation folds
-
Model Types:
- Random Forest Classifier
- Neural Network (PyTorch)
- Logistic Regression
-
Training Process:
- Hyperparameter tuning (GridSearchCV)
- Cross-validation
- Early stopping
- Model checkpointing
-
Evaluation Metrics:
- Accuracy, Precision, Recall, F1
- ROC curves and AUC
- Confusion matrices
- Feature importance
-
Model Saving:
- Serialized model files
- Model metadata
- Feature importance plots
- Performance metrics
-
Real-time Prediction:
- MQTT integration
- API endpoints
- Batch prediction capabilities
- Model versioning
src/data_analysis.py
- Statistical summaries
- Correlation analysis
- Time series patterns
- Distribution analysis
The system provides comprehensive visualization capabilities for both real-time and historical data analysis.
outputs/streaming/traffic_patterns.png
- Hourly Patterns: Shows peak traffic hours and daily variations
- Daily Patterns: Reveals weekday vs weekend traffic differences
- Weekly Patterns: Displays traffic distribution across days of the week
- Distribution Analysis:
- Light vehicles show higher variability (std: 38.84)
- Heavy vehicles have more consistent patterns (std: 3.20)
- Box Plots: Visualizes the statistical distribution of vehicle counts
outputs/streaming/environmental_time_series.png
- Daily Trends: Shows daily variations in environmental conditions
- Variable Ranges:
- Temperature: 1.54Β°C to 23.20Β°C (avg std: 3.14)
- Humidity: 32.01% to 97.32% (avg std: 13.28)
- Pressure: 999.27 to 1036.47 hPa (avg std: 2.09)
- Radiation: 6.04 to 240.73 W/mΒ² (avg std: 170.58)
- Sunshine: 0.00 to 599.55 minutes (avg std: 189.95)
- Precipitation: 0.00 to 0.23 mm (avg std: 0.02)
outputs/streaming/environmental_correlations.png
- Strong Correlations:
- Humidity vs Temperature: -0.63
- Radiation vs Sunshine: 0.74
- Humidity vs Radiation: -0.58
- Weak Correlations:
- Pressure vs Temperature: -0.14
- Precipitation vs Temperature: -0.04
- Pressure vs Radiation: 0.03
outputs/streaming/traffic_vs_environment.png
- Temperature Impact:
- Light vehicles: 0.254 correlation
- Heavy vehicles: 0.255 correlation
- Humidity Impact:
- Light vehicles: -0.365 correlation
- Heavy vehicles: -0.365 correlation
- Precipitation Impact:
- Light vehicles: -0.021 correlation
- Heavy vehicles: -0.024 correlation
- Pressure Impact:
- Light vehicles: -0.083 correlation
- Heavy vehicles: -0.081 correlation
The system processes and visualizes IoT data in real-time, providing comprehensive insights into traffic and environmental patterns.
-
Traffic Analysis Visualizations:
- Combined analysis of light and heavy vehicles
- Daily traffic patterns
- Traffic distribution by vehicle type
- Peak hours identification
-
Environmental Analysis Visualizations:
- Temperature trends
- Humidity patterns
- Radiation levels
- Pressure variations
- Sunshine duration
- Precipitation data
-
Environmental Patterns:
- Daily environmental patterns
- Seasonal variations
- Weather trends
-
Combined Analysis:
- Impact of weather on traffic
- Environmental influence on vehicle flow
- Combined pattern analysis
outputs/advanced_analysis/advanced_time_series_[variable].png
Where [variable] can be:
- heavy_vehicles
- light_vehicles
- humidity
- precipitation
- pressure
- radiation
- sunshine
- temperature
Example:
outputs/advanced_analysis/advanced_time_series_heavy_vehicles.png
- Available Time Series Plots:
- Heavy vehicles:
advanced_time_series_heavy_vehicles.png
- Light vehicles:
advanced_time_series_light_vehicles.png
- Humidity:
advanced_time_series_humidity.png
- Precipitation:
advanced_time_series_precipitation.png
- Pressure:
advanced_time_series_pressure.png
- Radiation:
advanced_time_series_radiation.png
- Sunshine:
advanced_time_series_sunshine.png
- Temperature:
advanced_time_series_temperature.png
- Heavy vehicles:
outputs/advanced_analysis/cross_correlations.png
- Lag Analysis:
- Correlation patterns across different time lags
- Lead-lag relationships between variables
- Maximum correlation identification
- Key Features:
- 24-hour lag window
- Multiple variable comparison
- Statistical significance indicators
- Lagged effect visualization
-
Time Series Patterns:
- Strong daily and weekly seasonality in traffic data
- Environmental variables show clear diurnal patterns
- Trend components reveal long-term changes
- Stationarity analysis shows data characteristics
-
Cross-Correlation:
- Reveals delayed effects of weather on traffic
- Identifies optimal prediction windows
- Shows complex inter-variable relationships
- Highlights lagged dependencies
src/data_visualization.py
- Time series plots
- Correlation heatmaps
- Distribution plots
- Model performance visualizations
-
Data Flow:
Raw Data β ETL Pipeline β ML Pipeline β Analysis β Visualization
-
Real-time Processing:
Sensor Data β MQTT Stream β Real-time Processing β Live Predictions
-
Batch Processing:
Historical Data β ETL β Model Training β Analysis β Reports
-
ETL Pipeline:
# Load and process data from data_acquisition import load_environmental_data, load_traffic_data from data_processing import clean_data, resample_data, combine_iot_data # Extract env_data = load_environmental_data() traffic_data = load_traffic_data() # Transform cleaned_data = clean_data(env_data) resampled_data = resample_data(traffic_data) combined_data = combine_iot_data(cleaned_data, resampled_data)
-
ML Pipeline:
# Train and evaluate model from model_training import IoTModel # Initialize model model = IoTModel() # Prepare data X, y = model.prepare_data(combined_data) # Train model model.train_model(X, y, model_type="random_forest") # Make predictions predictions = model.predict(new_data)
-
Analysis Pipeline:
# Analyze and visualize from data_analysis import analyze_traffic_data, analyze_environmental_data from data_visualization import visualize_all # Run analysis traffic_stats = analyze_traffic_data(traffic_data) env_stats = analyze_environmental_data(env_data) # Generate visualizations visualize_all(traffic_stats, env_stats)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Sources β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Acquisition Layer β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ€
β File-based Data β Real-time Data β
β βββββββββββ βββββββββββ β βββββββββββ βββββββββββ βββββββββββ β
β β JSON β β CSV β β β MQTT β β API β β WebS β β
β ββββββ¬βββββ ββββββ¬βββββ β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
βββββββββΌβββββββββββββββΌββββββββ΄ββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Processing Layer β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Cleaning β Feature Engineering β
β βββββββββββ βββββββββββ β βββββββββββ βββββββββββ βββββββββββ β
β β Missing β β Outlier β β β Time β β Env β β Rolling β β
β β Values β β Detect β β β Featuresβ β Intr β β Stats β β
β ββββββ¬βββββ ββββββ¬βββββ β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
βββββββββΌβββββββββββββββΌββββββββ΄ββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Model Training Layer β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Preparation β Model Training β
β βββββββββββ βββββββββββ β βββββββββββ βββββββββββ βββββββββββ β
β β Feature β β Target β β β Random β β Neural β β LogReg β β
β β Select β β Creationβ β β Forest β β Network β β Model β β
β ββββββ¬βββββ ββββββ¬βββββ β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
βββββββββΌβββββββββββββββΌββββββββ΄ββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Analysis Layer β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ€
β Statistical Analysis β Visualization β
β βββββββββββ βββββββββββ β βββββββββββ βββββββββββ βββββββββββ β
β β Basic β βAdvanced β β β Time β β Dist β β Model β β
β β Stats β β Analysisβ β β Series β β Plots β β Perf β β
β ββββββ¬βββββ ββββββ¬βββββ β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
βββββββββΌβββββββββββββββΌββββββββ΄ββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Output Layer β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ€
β Reports β Real-time Output β
β βββββββββββ βββββββββββ β βββββββββββ βββββββββββ βββββββββββ β
β β PDF β β CSV β β β Live β β Alerts β β API β β
β β Reports β β Data β β β Viz β β β β Endpts β β
β βββββββββββ βββββββββββ β βββββββββββ βββββββββββ βββββββββββ β
βββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ
-
Data Sources:
- Environmental sensor data (MS83200MS)
- Traffic sensor data (Siemens)
- Real-time MQTT streams
- API endpoints
- WebSocket connections
-
Data Processing:
- Data cleaning and validation
- Feature engineering
- Time-based features
- Environmental interactions
- Rolling statistics
-
Model Training:
- Feature selection
- Target variable creation
- Multiple model types
- Hyperparameter tuning
- Cross-validation
-
Analysis:
- Statistical analysis
- Advanced time series analysis
- Pattern detection
- Anomaly identification
-
Output:
- Reports and visualizations
- Real-time predictions
- API endpoints
- Alert system
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Data β β Data β β Model β β Analysis β
β Sources ββββββΆβ Processing ββββββΆβ Training ββββββΆβ Layer β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Real-time β β Feature β β Model β β Output β
β Streams β β Store β β Registry β β Generation β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Sensor β β MQTT β β Real-time β β Live β
β Data ββββββΆβ Broker ββββββΆβ Processing ββββββΆβ Predictionsβ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Data β β Message β β Feature β β Dashboard β
β Validation β β Queue β β Extraction β β Updates β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Historical β β Data β β Model β β Analysis β
β Data ββββββΆβ Processing ββββββΆβ Training ββββββΆβ Generation β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Data β β Feature β β Model β β Report β
β Loading β β Engineeringβ β Evaluation β β Generation β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
- Time-based feature engineering
- Environmental interaction features
- Rolling window statistics
- Data cleaning and normalization
- Random Forest Classifier
- Hyperparameter tuning with GridSearchCV
- Cross-validation
- Feature importance analysis
The model training process generates several outputs in the outputs/training/
directory:
outputs/training/roc_curve.png
- Receiver Operating Characteristic (ROC) curves for both training and test sets
- Area Under Curve (AUC) scores for model evaluation
- Overfitting detection through AUC comparison
outputs/training/classifier_performance.png
- Performance metrics comparison between training and test sets:
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion matrices for both datasets
- Classification reports with detailed metrics
- Visual comparison of model performance
outputs/training/feature_importance.png
- Relative importance of each feature in the model
- Sorted feature importance scores
- Visual representation of feature contributions
outputs/training/training_metrics.png
- Accuracy, precision, recall, and F1 scores
- Confusion matrices for both training and test sets
- Performance comparison between datasets
outputs/training/training_history.png
- Loss curves for training and validation
- Accuracy progression over epochs
- Early stopping indicators
models/
βββ model.pkl # Trained model
βββ model_metadata.json # Model configuration and metrics
-
View model performance:
python src/model_training.py
-
Generated outputs can be found in:
outputs/training/
-
Model files are saved in:
models/
-
Install dependencies:
pip install -r requirements.txt
-
Run data analysis:
python src/data_analysis.py
-
Generate visualizations:
python src/data_visualization.py
-
Train model:
python src/model_training.py
-
Data Analysis:
- Implement anomaly detection
- Add more sophisticated time series analysis
- Include weather forecast integration
-
Model Enhancement:
- Experiment with deep learning models
- Add real-time prediction capabilities
- Implement ensemble methods
-
Visualization:
- Create interactive dashboards
- Add real-time monitoring capabilities
- Implement geospatial visualization
- Clone the repository:
git clone https://github.com/yourusername/IoT-Data-Analysis.git
cd IoT-Data-Analysis
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
The system generates several types of outputs:
-
Model Files
- Trained model (
.pkl
) - Model metadata (
.json
) - Feature importance plots
- ROC curves
- Trained model (
-
Analysis Results
- Traffic pattern visualizations
- Environmental correlation plots
- Statistical summaries
-
Logs
- Training progress
- Model performance metrics
- Error tracking
- Python 3.8+
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- paho-mqtt
- pathlib
- logging
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Siemens for traffic data
- MS83200MS sensor for environmental data
- Open-source community for libraries and tools