Data quality is critical for accurate analysis and decision-making. The datavalr package provides an essential toolkit for validating datasets, diagnosing common data issues, and performing quality control checks. Whether you are working with real-world messy data or aiming to ensure data integrity in predictive modeling, datavalr offers robust and intuitive functions for exploratory data analysis and cleaning.
The goal of datavalr is to diagnose primarily data entry errors, flagging them for collaborators to fix before further statistical analysis. The package can denote outliers, misclassification of data frame columns, showcase missingness in data frames, and diagnose other data entry errors.
- Diagnose Empty Values: Quickly identify columns with empty strings, assessing their prevalence and offering a data frame with NA’s replacing the empty strings.
- Summarize Columns: Generate insights about column data types and missingness
- Track Outliers: Detect and visualize possible mice outliers by flagging irregular weight trends over time.
- Validate data consistency: Ensure chronological order and logical consistency of time series mice data.
- Audit Betting Metrics: Evaluate and flag anomalies in betting metrics like spreads and over/under lines.
Designed with ease of use in mind, datavalr streamlines data validation and enhances your workflow.
You can install the development version of datavalr from GitHub with:
# install.packages("pak")
pak::pak("oscar-mercado/datavalr")
Otherwise install the devtools
library and run the following code:
devtools::install_github("oscar-mercado/datavalr")
Here are some examples to demonstrate the core functionality of datavalr:
library(datavalr)
df <- footballbetting23
summarize_columns(df)
#> Column Class Missing_Values
#> schedule_date schedule_date character 0
#> schedule_season schedule_season integer 0
#> schedule_week schedule_week character 0
#> schedule_playoff schedule_playoff logical 0
#> team_home team_home character 0
#> score_home score_home integer 0
#> score_away score_away integer 0
#> team_away team_away character 0
#> team_favorite_id team_favorite_id character 0
#> spread_favorite spread_favorite numeric 0
#> over_under_line over_under_line numeric 0
#> stadium stadium character 0
#> stadium_neutral stadium_neutral logical 0
#> weather_temperature weather_temperature integer 180
#> weather_wind_mph weather_wind_mph integer 180
#> weather_humidity weather_humidity integer 277
#> weather_detail weather_detail character 0
#> over_under over_under character 0
The first example above loads the footballbetting23
dataset in the
library and uses the summarize_columns()
function. This generates a
comprehensive data table summarizing the columns of our dataframe
including data types and missing values for each column. This helps
collaborators determine data entry errors if they want columns to have
different classes or further analyze the missingness in the data.
Through further exploration of the football data frame, I observe
something unique: the weather_detail
column contains empty strings as
possible values but are not counted as missing
#> [1] "" "indoor"
In this case, we can use the diagnose_empty_values()
function to
determine the count of empty string values and replace them with NA’s if
collaborators desire:
diagnosis <- diagnose_empty_values(df)
head(diagnosis$Modified_Data)
#> schedule_date schedule_season schedule_week schedule_playoff
#> 1 9/7/2023 2023 1 FALSE
#> 2 9/10/2023 2023 1 FALSE
#> 3 9/10/2023 2023 1 FALSE
#> 4 9/10/2023 2023 1 FALSE
#> 5 9/10/2023 2023 1 FALSE
#> 6 9/10/2023 2023 1 FALSE
#> team_home score_home score_away team_away team_favorite_id
#> 1 Kansas City Chiefs 20 21 Detroit Lions KC
#> 2 Atlanta Falcons 24 10 Carolina Panthers ATL
#> 3 Baltimore Ravens 25 9 Houston Texans BAL
#> 4 Chicago Bears 20 38 Green Bay Packers CHI
#> 5 Cleveland Browns 24 3 Cincinnati Bengals CIN
#> 6 Denver Broncos 16 17 Las Vegas Raiders DEN
#> spread_favorite over_under_line stadium
#> 1 -4.5 53.0 GEHA Field at Arrowhead Stadium
#> 2 -3.5 39.5 Mercedes-Benz Stadium
#> 3 -9.5 43.5 M&T Bank Stadium
#> 4 -1.0 41.5 Soldier Field
#> 5 -2.0 47.5 FirstEnergy Stadium
#> 6 -3.5 43.0 Empower Field at Mile High
#> stadium_neutral weather_temperature weather_wind_mph weather_humidity
#> 1 FALSE NA NA NA
#> 2 FALSE 72 0 NA
#> 3 FALSE NA NA NA
#> 4 FALSE NA NA NA
#> 5 FALSE NA NA NA
#> 6 FALSE NA NA NA
#> weather_detail over_under
#> 1 <NA> under
#> 2 indoor under
#> 3 <NA> under
#> 4 <NA> over
#> 5 <NA> under
#> 6 <NA> under
diagnosis$Empty_Values_Summary
#> weather_detail
#> 188
For more advanced examples and explanations, please refer to the package vignette:
vignette("datavalr")
Contributions are welcome! Feel free to fork the repository, submit issues, or propose enhancements. Contact me at [email protected] for any inquiries.
This package is licensed under GPL (>= 3). See the LICENSE file for more details.