read_csv: date_parser called once with arrays and then many times with strings (from each single row) as arguments

Pandas version info:

```
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: EN

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.0
statsmodels: 0.5.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
```

Consider the following script, containing data to be loaded, a date parsing function, and `pd.read_csv()`:

``` python
from __future__ import print_function
import StringIO
import pandas as pd
import datetime as dt

text = '''
YYYY DOY HR MN  1  2   3      4      5       6     7
2013   1  0  0 71   1 100   5571      0 99999.9 999.9
2013   1  0  1 99 999 999 999999 999999 99999.9 999.9
2013   1  0  2 71   5 100   5654     19  -348.2   9.1
2013   1  0  3 71   1 100   5647      0  -350.6   9.5
2013   1  0  4 71   1 100   5693      0  -351.2   9.4
'''

def parse_date(year, doy, hour, minute):
    print(year, type(year))
    try:  # array arguments
        year = year.astype('datetime64[Y]')
        doy = doy.astype('datetime64[D]')
        hour = hour.astype('datetime64[h]')
        minute = minute.astype('datetime64[m]')
        return year + doy + hour + minute  # return None also gives the same final DataFrame
    except:  # string arguments
        year = int(year)
        doy = int(doy)
        hour = int(hour)
        minute = int(minute)
        return dt.datetime(year, 1, 1, hour, minute) + dt.timedelta(doy - 1)


data = pd.read_csv(StringIO.StringIO(text), delim_whitespace=True,
                   date_parser=parse_date, parse_dates=[[0, 1, 2, 3]], index_col=0)
```

The `print` statement in `parse_date` outputs

```
['2013' '2013' '2013' '2013' '2013'] <type 'numpy.ndarray'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
```

With `parse_date` as written above, the DataFrame is loaded properly, even when the first `return` statement in `parse_date` is changed to `return None`:

```
>>> data
                      1    2    3       4       5        6      7
YYYY_DOY_HR_MN                                                   
2013-01-01 00:00:00  71    1  100    5571       0  99999.9  999.9
2013-01-01 00:01:00  99  999  999  999999  999999  99999.9  999.9
2013-01-01 00:02:00  71    5  100    5654      19   -348.2    9.1
2013-01-01 00:03:00  71    1  100    5647       0   -350.6    9.5
2013-01-01 00:04:00  71    1  100    5693       0   -351.2    9.4

```

It strikes me as odd that the date parser is called with both whole columns and individual rows. I could not find any documentation regarding this behaviour. It immediately raises two questions in my mind:
- Why is the date parser called with both whole columns as arguments, and then called with strings from every single row as arguments? Is this a bug? It seems pointless to require the user to design for both kinds of arguments, and apparently not even use the return value from the array call.
- Wouldn't it be much better performance wise to only call the date parser once with the arrays (whole columns) as arguments, enabling the user to process the whole index in one go if possible?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv: date_parser called once with arrays and then many times with strings (from each single row) as arguments #9376

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

read_csv: date_parser called once with arrays and then many times with strings (from each single row) as arguments #9376

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions