Skip to content

read_csv: date_parser called once with arrays and then many times with strings (from each single row) as arguments #9376

Closed
@cmeeren

Description

@cmeeren

Pandas version info:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: EN

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.0
statsmodels: 0.5.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

Consider the following script, containing data to be loaded, a date parsing function, and pd.read_csv():

from __future__ import print_function
import StringIO
import pandas as pd
import datetime as dt

text = '''
YYYY DOY HR MN  1  2   3      4      5       6     7
2013   1  0  0 71   1 100   5571      0 99999.9 999.9
2013   1  0  1 99 999 999 999999 999999 99999.9 999.9
2013   1  0  2 71   5 100   5654     19  -348.2   9.1
2013   1  0  3 71   1 100   5647      0  -350.6   9.5
2013   1  0  4 71   1 100   5693      0  -351.2   9.4
'''

def parse_date(year, doy, hour, minute):
    print(year, type(year))
    try:  # array arguments
        year = year.astype('datetime64[Y]')
        doy = doy.astype('datetime64[D]')
        hour = hour.astype('datetime64[h]')
        minute = minute.astype('datetime64[m]')
        return year + doy + hour + minute  # return None also gives the same final DataFrame
    except:  # string arguments
        year = int(year)
        doy = int(doy)
        hour = int(hour)
        minute = int(minute)
        return dt.datetime(year, 1, 1, hour, minute) + dt.timedelta(doy - 1)


data = pd.read_csv(StringIO.StringIO(text), delim_whitespace=True,
                   date_parser=parse_date, parse_dates=[[0, 1, 2, 3]], index_col=0)

The print statement in parse_date outputs

['2013' '2013' '2013' '2013' '2013'] <type 'numpy.ndarray'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>

With parse_date as written above, the DataFrame is loaded properly, even when the first return statement in parse_date is changed to return None:

>>> data
                      1    2    3       4       5        6      7
YYYY_DOY_HR_MN                                                   
2013-01-01 00:00:00  71    1  100    5571       0  99999.9  999.9
2013-01-01 00:01:00  99  999  999  999999  999999  99999.9  999.9
2013-01-01 00:02:00  71    5  100    5654      19   -348.2    9.1
2013-01-01 00:03:00  71    1  100    5647       0   -350.6    9.5
2013-01-01 00:04:00  71    1  100    5693       0   -351.2    9.4

It strikes me as odd that the date parser is called with both whole columns and individual rows. I could not find any documentation regarding this behaviour. It immediately raises two questions in my mind:

  • Why is the date parser called with both whole columns as arguments, and then called with strings from every single row as arguments? Is this a bug? It seems pointless to require the user to design for both kinds of arguments, and apparently not even use the return value from the array call.
  • Wouldn't it be much better performance wise to only call the date parser once with the arrays (whole columns) as arguments, enabling the user to process the whole index in one go if possible?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions