-
Notifications
You must be signed in to change notification settings - Fork 71
[SVCS-531] Separate csv and tsv function and remove use of sniff #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SVCS-531] Separate csv and tsv function and remove use of sniff #285
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, looks good. I will locally test it.
Double check tests.
@cslzchen Added new test file to look over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, there is some error handling issues (not from your code though) but worth taking a look at. 🔥 🔥 .
Csv.sniff could cause random characters or spaces to be used as the delimiter. Separating these functions and using a hard coded dialect fixes this display problem.
94d40a2
to
93235e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and works as expected. 🎆 🎆 Move to PCR.
d9dc5f3
to
b1083c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomBaxter, sending this one to you. Needs some fixups:
- some
.seek()
s need to switch back to.read()
s - what are we losing by removing sniffing from the csv detector?
- remove quoting for tsv
- minor error handling de-duplication
:return: tuple of table headers and data | ||
""" | ||
data = fp.read(2048) | ||
data = fp.seek(2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seek
just returns the offset the pointer was advanced to. This should probably be read
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
|
||
def tsv_stdlib(fp): | ||
data = fp.seek(2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seek
=> read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
dialect = csv.Sniffer().sniff(data) | ||
except csv.Error: | ||
dialect = csv.excel | ||
else: | ||
_set_dialect_quote_attrs(dialect, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I've ever seen a tsv with quoting in it. Has anyone else? Maybe we leave quoting alone until it's reported as an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
# on certain exceptions | ||
except Exception as e: | ||
raise TabularRendererError('Cannot render file as csv/tsv. ' | ||
'The file may be empty or corrupt', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: indentation is weird here.
'The file may be empty or corrupt', | ||
code=HTTPStatus.BAD_REQUEST, | ||
extension='csv') from e | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is identical to the error raised in the next stanza, could we just throw the error instead?
rows = [row for row in reader] | ||
except csv.Error as e: | ||
if any("field larger than field limit" in errorMsg for errorMsg in e.args): | ||
raise TabularRendererError( | ||
'This file contains a field too large to render. ' | ||
'Please download and view it locally.', | ||
code=400, | ||
code=HTTPStatus.BAD_REQUEST, | ||
extension='csv', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both the csv and tsv parser call this, can we make sure the correct extension is being passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
fp.seek(0) | ||
# set the dialect instead of sniffing for it. | ||
# sniffing can cause things like spaces or characters to be the delimiter | ||
dialect = csv.excel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I'm not sure how I feel about this. I like that it solves the issue of really long lines, but it bugs me a bit that we're throwing out support for alternative delimiters. If we use the csv.excel
dialect, do we still support tab- and pipe-delimited text? If not, can we document that in a comment, so we'll know what to fix if we encounter it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has been merged in Complete. PR308
rows = [row for row in reader] | ||
except csv.Error as e: | ||
if any("field larger than field limit" in errorMsg for errorMsg in e.args): | ||
raise TabularRendererError( | ||
'This file contains a field too large to render. ' | ||
'Please download and view it locally.', | ||
code=400, | ||
code=HTTPStatus.BAD_REQUEST, | ||
extension='csv', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
:return: tuple of table headers and data | ||
""" | ||
data = fp.read(2048) | ||
data = fp.seek(2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
fp.seek(0) | ||
# set the dialect instead of sniffing for it. | ||
# sniffing can cause things like spaces or characters to be the delimiter | ||
dialect = csv.excel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
|
||
def tsv_stdlib(fp): | ||
data = fp.seek(2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
dialect = csv.Sniffer().sniff(data) | ||
except csv.Error: | ||
dialect = csv.excel | ||
else: | ||
_set_dialect_quote_attrs(dialect, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete. PR308
PR closed and replaced by #308. |
refs: https://openscience.atlassian.net/browse/SVCS-531
Purpose
Csv.sniff could cause random characters or spaces to be used
as the delimiter. Separating these functions and using a hard coded
dialect fixes this display problem.
Summary of changes
Broke up the csv/tsv function into two new ones, with the bulk of it in a helper function.
Instead of using Csv.sniff, just use csv.excel, or csv.excel_tab to set the delimiter. This stops things like spaces, characters, numbers etc from being detected as the delimiter.
QA/Testing notes
There is a zip file of csv/tsv files on the JIRA ticket that display the error. Trying them on staging/prod will show you what the error looks like. These display errors should not be present with the changes.