Skip to content

ENH: Raise ParserWarning when length of names does not match length of data #38587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jun 16, 2021
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d98c6fd
ENH: Raise ParserWarning when length of names does not match length o…
phofl Dec 19, 2020
26b07b2
Fix bugs from strg+z
phofl Dec 19, 2020
7dd3f1b
Refactor code
phofl Dec 19, 2020
70d5c1c
Refactor if else
phofl Dec 19, 2020
76abd33
Add okwarning
phofl Dec 19, 2020
31929f4
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Dec 23, 2020
3813435
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Jan 3, 2021
5b688f7
Allow trailing commas
phofl Jan 3, 2021
56cdd18
Fix dtype bug
phofl Jan 4, 2021
ac15a30
Fix npdev bug
phofl Jan 4, 2021
4b08ab6
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Jan 4, 2021
387b5fa
Add missing init file
phofl Jan 4, 2021
53cac93
Remove empty file
phofl Jan 4, 2021
5d142fe
Add warning
phofl Jan 4, 2021
764e002
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Jan 17, 2021
8bd631a
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Feb 19, 2021
b21b795
Merge master
phofl Feb 19, 2021
5c19c9f
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Mar 2, 2021
928ad4f
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Apr 20, 2021
eb77157
Fix typing
phofl Apr 20, 2021
9dce995
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl May 14, 2021
16faf35
Change test
phofl May 14, 2021
4b3f63a
Remove warning
phofl May 14, 2021
ca2f026
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl May 23, 2021
fa6fed0
Adress comments
phofl May 23, 2021
afb023f
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Jun 3, 2021
95770d1
Merge branch 'master' of https://github.com/pandas-dev/pandas into 21768
phofl Jun 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -753,6 +753,7 @@ the end of each data line, confusing the parser. To explicitly disable the
index column inference and discard the last column, pass ``index_col=False``:

.. ipython:: python
:okwarning:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to warn, should the docs here then have to be updated to reflect this change?

(but is this actually going to warn? Below I read "One set of trailing commas is allowed.", which is the case here?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this raised a Warning earlier before allowing one set of trailing commas


data = "a,b,c\n4,apple,bat,\n8,orange,cow,"
print(data)
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Other enhancements
- Improve error message when ``usecols`` and ``names`` do not match for :func:`read_csv` and ``engine="c"`` (:issue:`29042`)
- Improved consistency of error message when passing an invalid ``win_type`` argument in :class:`Window` (:issue:`15969`)
- :func:`pandas.read_sql_query` now accepts a ``dtype`` argument to cast the columnar data from the SQL database based on user input (:issue:`10285`)
- :func:`read_csv` now raising ``ParserWarning`` if length of header or given names does not match length of data when ``usecols`` is not specified (:issue:`21768`)
- Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`)
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`)
Expand Down
26 changes: 26 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1844,6 +1844,28 @@ def _do_date_conversions(self, names, data):

return names, data

def _check_data_length(self, columns: List[str], data: List[np.ndarray]):
"""Checks if length of data is equal to length of column names. One set of
trailing commas is allowed.

Parameters
----------
columns: list of column names
data: list of array-likes containing the data column-wise

"""
if not self.index_col and len(columns) != len(data) and columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check that data is actually null? IOW when would this situation happen when len(columns) > len(data) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(columns) > len(data) is caught at another place I think.
We run in there when len(columns) < len(data). In case of one set of trailing commas we have len(columns) + 1 = len(data). To see if we really have trailing commas we have to check if array is empty. If array is not empty we do not have trailing commas but data which will be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have trailing commas but data which will be dropped.

ok ideally we should put these kinds of checks in the same place that is happening if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad wording, with caught I meant if we got more columns than len(data), these columns are inserted all nans.

if len(columns) == len(data) - 1 and np.all(
(is_object_dtype(data[-1]) and data[-1] == "") | isna(data[-1])
):
return
warnings.warn(
"Length of header or names does not match length of data. This leads "
"to a loss of data with index_col=False.",
ParserWarning,
stacklevel=6,
)


class CParserWrapper(ParserBase):
def __init__(self, src: FilePathOrBuffer, **kwds):
Expand Down Expand Up @@ -2128,6 +2150,8 @@ def read(self, nrows=None):

# columns as list
alldata = [x[1] for x in data]
if self.usecols is None:
self._check_data_length(names, alldata)

data = {k: v for k, (i, v) in zip(names, data)}

Expand Down Expand Up @@ -2516,6 +2540,8 @@ def _exclude_implicit_index(self, alldata):
if self._col_indices is not None and len(names) != len(self._col_indices):
names = [names[i] for i in sorted(self._col_indices)]

self._check_data_length(names, alldata)

return {name: alldata[i + offset] for i, name in enumerate(names)}, names

# legacy
Expand Down
Empty file.
5 changes: 3 additions & 2 deletions pandas/tests/io/parser/common/test_common_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import pytest

from pandas._libs.tslib import Timestamp
from pandas.errors import EmptyDataError, ParserError
from pandas.errors import EmptyDataError, ParserError, ParserWarning

from pandas import DataFrame, Index, Series, compat
import pandas._testing as tm
Expand Down Expand Up @@ -660,7 +660,8 @@ def test_no_header_two_extra_columns(all_parsers):
ref = DataFrame([["foo", "bar", "baz"]], columns=column_names)
stream = StringIO("foo,bar,baz,bam,blah")
parser = all_parsers
df = parser.read_csv(stream, header=None, names=column_names, index_col=False)
with tm.assert_produces_warning(ParserWarning):
df = parser.read_csv(stream, header=None, names=column_names, index_col=False)
tm.assert_frame_equal(df, ref)


Expand Down