Skip to content
This repository was archived by the owner on Feb 22, 2023. It is now read-only.
This repository was archived by the owner on Feb 22, 2023. It is now read-only.

[FYI] Filtering Benchmark #138

@dhirschfeld

Description

@dhirschfeld

If I convert a pa.Table to a pandas DataFrame I pay for the cost of conversion up front but then it seems operations such as filtering are 2x faster than on a fletcher DataFrame:

In [37]: %time df = tbl.to_pandas()
Wall time: 602 ms

In [38]: df.dtypes
Out[38]: 
date_inserted_utc    datetime64[ns]
date_created_utc     datetime64[ns]
issue_date_utc       datetime64[ns]
data_provider                 int64
weather_station               int64
weather_variable              int64
value_date_utc       datetime64[ns]
value                       float64
dtype: object

In [39]: %time wv1 = df[df['weather_variable'] == 1]
Wall time: 423 ms
In [34]: %time df = fr.pandas_from_arrow(tbl)
Wall time: 2 ms

In [35]: df.dtypes
Out[35]: 
date_inserted_utc    fletcher_chunked[timestamp[us]]
date_created_utc     fletcher_chunked[timestamp[us]]
issue_date_utc       fletcher_chunked[timestamp[us]]
data_provider                fletcher_chunked[int64]
weather_station              fletcher_chunked[int64]
weather_variable             fletcher_chunked[int64]
value_date_utc       fletcher_chunked[timestamp[us]]
value                       fletcher_chunked[double]
dtype: object

In [36]: %time wv1 = df[df['weather_variable'] == 1]
Wall time: 897 ms

Just posting here in case benchmarks on real data are of interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions