Skip to content

Fix: Allow \r in unquoted fields when row separator doesn't contain \r #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jsxs0
Copy link
Contributor

@jsxs0 jsxs0 commented Jun 5, 2025

Fixes #60

This has been bugging me for a while - the CSV parser was rejecting \r characters in unquoted fields even when the row separator was something completely different like \n or a custom separator.

For example, this would fail unnecessarily:

CSV.parse("field1,field\rwith\rcr,field3\n", row_sep: "\n")

The problem was in prepare_unquoted where we were hardcoding "\r\n" instead of checking what the actual row separator was.

What changed:

  • Now we only exclude characters that are actually part of the row separator
  • If your row separator is \n, then \r is allowed in unquoted fields
  • If your row separator is \r\n, then both \r and \n are still properly excluded
  • Quoted fields work exactly the same as before

Testing:

  • Updated the tests that were expecting the old behavior
  • Added comprehensive tests for different row separator scenarios
  • All existing tests still pass

This makes the parser more flexible while keeping it safe for the cases where \r should actually be restricted.

no_unquoted_values = "\r\n".encode(@encoding)
# Only exclude characters that are actually part of the row separator
# instead of hardcoding "\r\n"
row_separator_chars = @row_separator.chars.map { |c| Regexp.escape(c) }.join
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need Regexp.escape here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need chars.map.join?
Why can't we use Regexp.escape(@row_separator)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Using Regexp.escape(@row_separator)? now. Thanks!

@jsxs0 jsxs0 requested a review from kou June 5, 2025 07:09
Comment on lines 678 to 680
# Only exclude characters that are actually part of the row separator
# instead of hardcoding "\r\n"
no_unquoted_values = Regexp.escape(@row_separator).encode(@encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix indent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@jsxs0 jsxs0 requested a review from kou June 5, 2025 07:35
@@ -139,27 +139,24 @@ def test_non_regex_edge_cases
end

def test_malformed_csv_cr_first_line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update test name?

Comment on lines 142 to 143
# With the fix for accepting \r without quote when row separator doesn't include \r,
# this should now parse successfully when row_sep is "\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this comment with suitable test name.

Suggested change
# With the fix for accepting \r without quote when row separator doesn't include \r,
# this should now parse successfully when row_sep is "\n"

# With the fix for accepting \r without quote when row separator doesn't include \r,
# this should now parse successfully when row_sep is "\n"
result = CSV.parse_line("1,2\r,3", row_sep: "\n")
assert_equal(["1", "2\r", "3"], result)
end

def test_malformed_csv_cr_middle_line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Comment on lines 149 to 150
# With the fix for accepting \r without quote when row separator doesn't include \r,
# this should now parse successfully (default row_sep is "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

# this should now parse successfully (default row_sep is "\n")
result = CSV.parse("\n" + "\r")
# This should parse as an empty first row and a second row with just "\r"
assert_equal([[], ["\r"]], result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is for invalid cases.
Can we move this to other test case?


def test_reject_cr_when_row_separator_includes_cr
# When row separator includes \r (like \r\n), \r should still be rejected in unquoted fields
data = "field1,field2,field3\r\nrow2,data,here\r\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use invalid data something like field1\r,... for this case?

assert_equal(expected, CSV.parse(data, row_sep: "\r\n"))
end

def test_reject_cr_when_row_separator_is_cr_only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_reject_cr_when_row_separator_is_cr_only
def test_unquoted_cr_with_cr_row_separator


def test_reject_cr_when_row_separator_is_cr_only
# When row separator is just \r, \r should be rejected in unquoted fields
data = "field1,field2,field3\rrow2,data,here\r"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

assert_equal(expected, CSV.parse(data, row_sep: "|", liberal_parsing: true))
end

def test_quoted_fields_with_cr_and_custom_row_separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_quoted_fields_with_cr_and_custom_row_separator
def test_quoted_cr_with_custom_row_separator

end

def test_liberal_parsing_with_custom_row_separator
# Test liberal parsing mode with custom row separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment? I feel that test name is described well. So I feel that this is redundant.

@jsxs0 jsxs0 requested a review from kou June 5, 2025 08:23
@jsxs0
Copy link
Contributor Author

jsxs0 commented Jun 5, 2025

@kou I've made all the changes asked.

Comment on lines 8 to 15
def test_unquoted_cr_with_lf_row_separator
data = "field1,field\rwith\rcr,field3\nrow2,data,here\n"
expected = [
["field1", "field\rwith\rcr", "field3"],
["row2", "data", "here"]
]
assert_equal(expected, CSV.parse(data, row_sep: "\n"))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same (concept) test as the changed test in test_general.rb?
If so, we don't need this (or the test in test_general.rb).

assert_equal(expected, CSV.parse(data, row_sep: "|", liberal_parsing: true))
end

def test_quoted_cr_with_custom_row_separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have test_quoted_cr... in TestUnquortedCR.

BTW, do we need to create test_unquoted_cr.rb? Can we move tests in this file to test_general.rb or something?

@jsxs0
Copy link
Contributor Author

jsxs0 commented Jun 5, 2025

@kou Thank you. I have:

  1. Removed the duplicate test
  2. Moved the unique tests to test_general.rb
  3. Fixed the misnamed test
  4. Deleted the separate test file

Comment on lines 159 to 171
def test_unquoted_cr_with_crlf_row_separator
data = "field1\r,field2,field3\r\nrow2,data,here\r\n"
assert_raise(CSV::MalformedCSVError) do
CSV.parse(data, row_sep: "\r\n")
end
assert_equal("Unquoted fields do not allow new line <\"\\r\"> in line 1.",
error.message)
end

def test_malformed_csv_cr_middle_line
csv = <<-CSV
line,1,abc
line,2,"def\nghi"
def test_unquoted_cr_rejected_when_included_in_row_separator
data = "field1,field\r2,field3\r\nrow2,data,here\r\n"
assert_raise(CSV::MalformedCSVError) do
CSV.parse(data, row_sep: "\r\n")
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they the same concept tests?
Can we remove test_unquoted_cr_rejected_when_included_in_row_separator?

Comment on lines 173 to 180
def test_liberal_parsing_with_unquoted_cr_and_custom_row_separator
data = "field1,field\rwith\rcr,field3|row2,data,here|"
expected = [
["field1", "field\rwith\rcr", "field3"],
["row2", "data", "here"]
]
assert_equal(expected, CSV.parse(data, row_sep: "|", liberal_parsing: true))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to test/csv/parse/test_liberal_parsing and remove the liberal_parsing_with_ part from the test name?

assert_equal(expected, CSV.parse(data, row_sep: "|"))
end

def test_unquoted_cr_rejected_when_included_in_row_separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the same naming rule for parse error case.

Suggested change
def test_unquoted_cr_rejected_when_included_in_row_separator
def test_unquoted_cr_with_crlf_row_separator


def test_unquoted_cr_rejected_when_included_in_row_separator
data = "field1,field\r2,field3\r\nrow2,data,here\r\n"
assert_raise(CSV::MalformedCSVError) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also check the error message like other tests?

Suggested change
assert_raise(CSV::MalformedCSVError) do
message = "..."
assert_raise(CSV::MalformedCSVError.new(message, 1)) do

Comment on lines 678 to 679
# Only exclude characters that are actually part of the row separator
# instead of hardcoding "\r\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we can remove this comment.
I feel that it's useful for commit message (the PR description in this repository) because it describes why we do this change but it may not be useful for readers of new code. (Nobody will not try using "\r\n" here.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept \r without quote if row separator doesn't include \r
2 participants