[SVCS-531] Separate csv and tsv function and remove use of sniff #285

AddisonSchiller · 2017-10-11T13:50:56Z

refs: https://openscience.atlassian.net/browse/SVCS-531

Purpose

Csv.sniff could cause random characters or spaces to be used
as the delimiter. Separating these functions and using a hard coded
dialect fixes this display problem.

Summary of changes

Broke up the csv/tsv function into two new ones, with the bulk of it in a helper function.
Instead of using Csv.sniff, just use csv.excel, or csv.excel_tab to set the delimiter. This stops things like spaces, characters, numbers etc from being detected as the delimiter.

QA/Testing notes

There is a zip file of csv/tsv files on the JIRA ticket that display the error. Trying them on staging/prod will show you what the error looks like. These display errors should not be present with the changes.

cslzchen

As discussed, looks good. I will locally test it.
Double check tests.

coveralls · 2017-11-16T19:21:41Z

Coverage increased (+0.2%) to 68.227% when pulling 5704d33 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

AddisonSchiller · 2017-11-16T19:41:10Z

@cslzchen Added new test file to look over

coveralls · 2017-11-16T19:42:30Z

Coverage increased (+0.2%) to 68.227% when pulling 9554e49 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

cslzchen

As discussed, there is some error handling issues (not from your code though) but worth taking a look at. 🔥 🔥 .

coveralls · 2017-11-22T17:42:11Z

Coverage increased (+0.2%) to 68.245% when pulling 50fd471 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

coveralls · 2017-11-22T18:03:52Z

Coverage increased (+0.3%) to 68.336% when pulling 20405f6 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

Csv.sniff could cause random characters or spaces to be used as the delimiter. Separating these functions and using a hard coded dialect fixes this display problem.

coveralls · 2017-11-22T18:13:11Z

Coverage increased (+0.3%) to 68.336% when pulling 93235e1 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

coveralls · 2017-11-22T18:15:11Z

Coverage increased (+0.3%) to 68.336% when pulling 93235e1 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

cslzchen

Looks good and works as expected. 🎆 🎆 Move to PCR.

coveralls · 2017-11-22T19:02:49Z

Coverage increased (+0.3%) to 68.336% when pulling b1083c5 on AddisonSchiller:feature/tsv-svcs-531 into 8bb2dd4 on CenterForOpenScience:develop.

felliott

@TomBaxter, sending this one to you. Needs some fixups:

some .seek()s need to switch back to .read()s
what are we losing by removing sniffing from the csv detector?
remove quoting for tsv
minor error handling de-duplication

felliott · 2017-12-12T19:19:51Z

mfr/extensions/tabular/libs/stdlib_tools.py

-    :return: tuple of table headers and data
-    """
-    data = fp.read(2048)
+    data = fp.seek(2048)


seek just returns the offset the pointer was advanced to. This should probably be read.

Complete. PR308

felliott · 2017-12-12T19:26:37Z

mfr/extensions/tabular/libs/stdlib_tools.py


+def tsv_stdlib(fp):
+    data = fp.seek(2048)


seek => read

Complete. PR308

felliott · 2017-12-12T19:28:42Z

mfr/extensions/tabular/libs/stdlib_tools.py

-        dialect = csv.Sniffer().sniff(data)
-    except csv.Error:
-        dialect = csv.excel
-    else:
        _set_dialect_quote_attrs(dialect, data)


I don't think I've ever seen a tsv with quoting in it. Has anyone else? Maybe we leave quoting alone until it's reported as an issue.

Complete. PR308

felliott · 2017-12-13T03:14:04Z

mfr/extensions/tabular/libs/stdlib_tools.py

+    # on certain exceptions
+    except Exception as e:
+        raise TabularRendererError('Cannot render file as csv/tsv. '
+                           'The file may be empty or corrupt',


Nitpick: indentation is weird here.

felliott · 2017-12-13T03:18:15Z

mfr/extensions/tabular/libs/stdlib_tools.py

+                                       'The file may be empty or corrupt',
+                                       code=HTTPStatus.BAD_REQUEST,
+                                       extension='csv') from e
+


Since this is identical to the error raised in the next stanza, could we just throw the error instead?

felliott · 2017-12-13T03:27:37Z

mfr/extensions/tabular/libs/stdlib_tools.py

        rows = [row for row in reader]
    except csv.Error as e:
        if any("field larger than field limit" in errorMsg for errorMsg in e.args):
            raise TabularRendererError(
                'This file contains a field too large to render. '
                'Please download and view it locally.',
-                code=400,
+                code=HTTPStatus.BAD_REQUEST,
                extension='csv',


Since both the csv and tsv parser call this, can we make sure the correct extension is being passed?

Complete. PR308

felliott · 2017-12-13T03:54:00Z

mfr/extensions/tabular/libs/stdlib_tools.py

    fp.seek(0)
+    # set the dialect instead of sniffing for it.
+    # sniffing can cause things like spaces or characters to be the delimiter
+    dialect = csv.excel


Hmmm. I'm not sure how I feel about this. I like that it solves the issue of really long lines, but it bugs me a bit that we're throwing out support for alternative delimiters. If we use the csv.excel dialect, do we still support tab- and pipe-delimited text? If not, can we document that in a comment, so we'll know what to fix if we encounter it?

Complete. PR308

TomBaxter

This PR has been merged in Complete. PR308

TomBaxter · 2017-12-21T16:59:11Z

mfr/extensions/tabular/libs/stdlib_tools.py

        rows = [row for row in reader]
    except csv.Error as e:
        if any("field larger than field limit" in errorMsg for errorMsg in e.args):
            raise TabularRendererError(
                'This file contains a field too large to render. '
                'Please download and view it locally.',
-                code=400,
+                code=HTTPStatus.BAD_REQUEST,
                extension='csv',


Complete. PR308

TomBaxter · 2017-12-21T19:31:54Z

mfr/extensions/tabular/libs/stdlib_tools.py

-    :return: tuple of table headers and data
-    """
-    data = fp.read(2048)
+    data = fp.seek(2048)


Complete. PR308

TomBaxter · 2017-12-21T19:32:03Z

mfr/extensions/tabular/libs/stdlib_tools.py

    fp.seek(0)
+    # set the dialect instead of sniffing for it.
+    # sniffing can cause things like spaces or characters to be the delimiter
+    dialect = csv.excel


Complete. PR308

TomBaxter · 2017-12-21T19:32:08Z

mfr/extensions/tabular/libs/stdlib_tools.py


+def tsv_stdlib(fp):
+    data = fp.seek(2048)


Complete. PR308

TomBaxter · 2017-12-21T19:32:14Z

mfr/extensions/tabular/libs/stdlib_tools.py

-        dialect = csv.Sniffer().sniff(data)
-    except csv.Error:
-        dialect = csv.excel
-    else:
        _set_dialect_quote_attrs(dialect, data)


Complete. PR308

cslzchen · 2018-01-05T15:10:11Z

PR closed and replaced by #308.

cslzchen added the Code Review label Nov 15, 2017

cslzchen reviewed Nov 15, 2017

View reviewed changes

cslzchen requested changes Nov 21, 2017

View reviewed changes

AddisonSchiller added 2 commits November 22, 2017 13:12

Separate csv and tsv function and remove use of sniff

617306f

Csv.sniff could cause random characters or spaces to be used as the delimiter. Separating these functions and using a hard coded dialect fixes this display problem.

Add tests

0ff8589

AddisonSchiller force-pushed the feature/tsv-svcs-531 branch from 94d40a2 to 93235e1 Compare November 22, 2017 18:12

cslzchen approved these changes Nov 22, 2017

View reviewed changes

cslzchen added Final Review and removed Code Review labels Nov 22, 2017

Better error handling

b1083c5

AddisonSchiller force-pushed the feature/tsv-svcs-531 branch from d9dc5f3 to b1083c5 Compare November 22, 2017 18:18

felliott requested changes Dec 13, 2017

View reviewed changes

TomBaxter reviewed Dec 21, 2017

View reviewed changes

cslzchen closed this Jan 5, 2018

[SVCS-531] Separate csv and tsv function and remove use of sniff #285

[SVCS-531] Separate csv and tsv function and remove use of sniff #285

Uh oh!

Conversation

AddisonSchiller commented Oct 11, 2017

Purpose

Summary of changes

QA/Testing notes

Uh oh!

cslzchen left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AddisonSchiller commented Nov 16, 2017

Uh oh!

coveralls commented Nov 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cslzchen left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cslzchen left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felliott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomBaxter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

coveralls commented Nov 16, 2017 •

edited

Loading

coveralls commented Nov 16, 2017 •

edited

Loading

coveralls commented Nov 22, 2017 •

edited

Loading

coveralls commented Nov 22, 2017 •

edited

Loading

coveralls commented Nov 22, 2017 •

edited

Loading

coveralls commented Nov 22, 2017 •

edited

Loading

coveralls commented Nov 22, 2017 •

edited

Loading