Skip to content

[SVCS-531] Separate csv and tsv function and remove use of sniff #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

AddisonSchiller
Copy link
Contributor

refs: https://openscience.atlassian.net/browse/SVCS-531

Purpose

Csv.sniff could cause random characters or spaces to be used
as the delimiter. Separating these functions and using a hard coded
dialect fixes this display problem.

Summary of changes

Broke up the csv/tsv function into two new ones, with the bulk of it in a helper function.
Instead of using Csv.sniff, just use csv.excel, or csv.excel_tab to set the delimiter. This stops things like spaces, characters, numbers etc from being detected as the delimiter.

QA/Testing notes

There is a zip file of csv/tsv files on the JIRA ticket that display the error. Trying them on staging/prod will show you what the error looks like. These display errors should not be present with the changes.

Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, looks good. I will locally test it.
Double check tests.

@coveralls
Copy link

coveralls commented Nov 16, 2017

Coverage Status

Coverage increased (+0.2%) to 68.227% when pulling 5704d33 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

@AddisonSchiller
Copy link
Contributor Author

@cslzchen Added new test file to look over

@coveralls
Copy link

coveralls commented Nov 16, 2017

Coverage Status

Coverage increased (+0.2%) to 68.227% when pulling 9554e49 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, there is some error handling issues (not from your code though) but worth taking a look at. 🔥 🔥 .

@coveralls
Copy link

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.2%) to 68.245% when pulling 50fd471 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

@coveralls
Copy link

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.3%) to 68.336% when pulling 20405f6 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

Csv.sniff could cause random characters or spaces to be used
as the delimiter. Separating  these functions and using a hard coded
dialect fixes this display problem.
@coveralls
Copy link

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.3%) to 68.336% when pulling 93235e1 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

@coveralls
Copy link

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.3%) to 68.336% when pulling 93235e1 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works as expected. 🎆 🎆 Move to PCR.

@coveralls
Copy link

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.3%) to 68.336% when pulling b1083c5 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

Copy link
Member

@felliott felliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomBaxter, sending this one to you. Needs some fixups:

  • some .seek()s need to switch back to .read()s
  • what are we losing by removing sniffing from the csv detector?
  • remove quoting for tsv
  • minor error handling de-duplication

:return: tuple of table headers and data
"""
data = fp.read(2048)
data = fp.seek(2048)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seek just returns the offset the pointer was advanced to. This should probably be read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308


def tsv_stdlib(fp):
data = fp.seek(2048)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seek => read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

dialect = csv.Sniffer().sniff(data)
except csv.Error:
dialect = csv.excel
else:
_set_dialect_quote_attrs(dialect, data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I've ever seen a tsv with quoting in it. Has anyone else? Maybe we leave quoting alone until it's reported as an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

# on certain exceptions
except Exception as e:
raise TabularRendererError('Cannot render file as csv/tsv. '
'The file may be empty or corrupt',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: indentation is weird here.

'The file may be empty or corrupt',
code=HTTPStatus.BAD_REQUEST,
extension='csv') from e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is identical to the error raised in the next stanza, could we just throw the error instead?

rows = [row for row in reader]
except csv.Error as e:
if any("field larger than field limit" in errorMsg for errorMsg in e.args):
raise TabularRendererError(
'This file contains a field too large to render. '
'Please download and view it locally.',
code=400,
code=HTTPStatus.BAD_REQUEST,
extension='csv',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both the csv and tsv parser call this, can we make sure the correct extension is being passed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

fp.seek(0)
# set the dialect instead of sniffing for it.
# sniffing can cause things like spaces or characters to be the delimiter
dialect = csv.excel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I'm not sure how I feel about this. I like that it solves the issue of really long lines, but it bugs me a bit that we're throwing out support for alternative delimiters. If we use the csv.excel dialect, do we still support tab- and pipe-delimited text? If not, can we document that in a comment, so we'll know what to fix if we encounter it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

Copy link
Member

@TomBaxter TomBaxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been merged in Complete. PR308

rows = [row for row in reader]
except csv.Error as e:
if any("field larger than field limit" in errorMsg for errorMsg in e.args):
raise TabularRendererError(
'This file contains a field too large to render. '
'Please download and view it locally.',
code=400,
code=HTTPStatus.BAD_REQUEST,
extension='csv',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

:return: tuple of table headers and data
"""
data = fp.read(2048)
data = fp.seek(2048)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

fp.seek(0)
# set the dialect instead of sniffing for it.
# sniffing can cause things like spaces or characters to be the delimiter
dialect = csv.excel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308


def tsv_stdlib(fp):
data = fp.seek(2048)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

dialect = csv.Sniffer().sniff(data)
except csv.Error:
dialect = csv.excel
else:
_set_dialect_quote_attrs(dialect, data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete. PR308

@cslzchen
Copy link
Contributor

cslzchen commented Jan 5, 2018

PR closed and replaced by #308.

@cslzchen cslzchen closed this Jan 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants