Closed
Description
Pandas version info:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: EN
pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.0
statsmodels: 0.5.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
Consider the following script, containing data to be loaded, a date parsing function, and pd.read_csv()
:
from __future__ import print_function
import StringIO
import pandas as pd
import datetime as dt
text = '''
YYYY DOY HR MN 1 2 3 4 5 6 7
2013 1 0 0 71 1 100 5571 0 99999.9 999.9
2013 1 0 1 99 999 999 999999 999999 99999.9 999.9
2013 1 0 2 71 5 100 5654 19 -348.2 9.1
2013 1 0 3 71 1 100 5647 0 -350.6 9.5
2013 1 0 4 71 1 100 5693 0 -351.2 9.4
'''
def parse_date(year, doy, hour, minute):
print(year, type(year))
try: # array arguments
year = year.astype('datetime64[Y]')
doy = doy.astype('datetime64[D]')
hour = hour.astype('datetime64[h]')
minute = minute.astype('datetime64[m]')
return year + doy + hour + minute # return None also gives the same final DataFrame
except: # string arguments
year = int(year)
doy = int(doy)
hour = int(hour)
minute = int(minute)
return dt.datetime(year, 1, 1, hour, minute) + dt.timedelta(doy - 1)
data = pd.read_csv(StringIO.StringIO(text), delim_whitespace=True,
date_parser=parse_date, parse_dates=[[0, 1, 2, 3]], index_col=0)
The print
statement in parse_date
outputs
['2013' '2013' '2013' '2013' '2013'] <type 'numpy.ndarray'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
2013 <type 'str'>
With parse_date
as written above, the DataFrame is loaded properly, even when the first return
statement in parse_date
is changed to return None
:
>>> data
1 2 3 4 5 6 7
YYYY_DOY_HR_MN
2013-01-01 00:00:00 71 1 100 5571 0 99999.9 999.9
2013-01-01 00:01:00 99 999 999 999999 999999 99999.9 999.9
2013-01-01 00:02:00 71 5 100 5654 19 -348.2 9.1
2013-01-01 00:03:00 71 1 100 5647 0 -350.6 9.5
2013-01-01 00:04:00 71 1 100 5693 0 -351.2 9.4
It strikes me as odd that the date parser is called with both whole columns and individual rows. I could not find any documentation regarding this behaviour. It immediately raises two questions in my mind:
- Why is the date parser called with both whole columns as arguments, and then called with strings from every single row as arguments? Is this a bug? It seems pointless to require the user to design for both kinds of arguments, and apparently not even use the return value from the array call.
- Wouldn't it be much better performance wise to only call the date parser once with the arrays (whole columns) as arguments, enabling the user to process the whole index in one go if possible?