Skip to content

Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13

@lcn2

Description

@lcn2

Is there an existing issue for this?

  • I have searched for existing issues and did not find anything like this

Describe the bug

Certain strings seem to not be decoded properly by the jstrdecode(1) tool, or so it seems.

SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.

Some background

Consider this JSON document containing just a JSON string:

"œßåé"

When you enter that JSON document into https://jsonlint.com/ is shows that the JSON is valid.

Moreover, the jparse(1) tool also shows that the JSON document is valid:

$ jparse -J 3 -s '"œßåé"'
JSON tree[3]:	lvl: 0	type: JTYPE_STRING	len{p,c:q}: 8	value:	"\xc5\x93\xc3\x9f\xc3\xa5\xc3\xa9"
in parse_json(): JSON debug[1]: valid JSON

it is GOOD that the JSON parser is happy with this JSON document.

FYI: Here is the hexdump(1) shows about the string (we include the enclosing double quotes and trailing new for consistency with the JSON file):

$ echo '"œßåé"' | hexdump -C
00000000  22 c5 93 c3 9f c3 a5 c3  a9 22 0a                 |"........".|
0000000b

Here is how jstrencode(1) processes the string:

$ jstrencode '"œßåé"'
\"œßåé\"

This is CORRECT as double quotes are back-slashed (i.e., ") is correct because within a JSON string, all double quotes MUST be backslashed.

Now if we use -Q, the enclosing double quotes are ignored:

$ jstrencode -Q '"œßåé"'
œßåé

If we consult https://codebeautify.org/json-encode-online, and we put "œßåé"' on the input side, we see that output side shows the same string.

Evidence of a decoding bug

Now, some JSON tools, such as jsp (see https://github.com/kjozsa/jsp) encodes "œßåé"' as:

$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"

When we put the "\u0153\u00df\u00e5\u00e9" string into the input side of https://codebeautify.org/json-encode-online, the output side shows "œßåé".

However if we give the "\u0153\u00df\u00e5\u00e9" string to jstrdecode(1), we get something odd:

$ jstrdecode "\u0153\u00df\u00e5\u00e9"
S���

Using hexdump(1) we see that the expected decoded output:

$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009

However this is what jstrdecode(1) produces:

$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006

This suggests that the JSON string decoding may be incorrect.

What you expect

We expect that the output of

jstrdecode "\u0153\u00df\u00e5\u00e9"

to be the same as:

echo 'œßåé`

Environment

- OS:SO
- Device:eciveD
- Compiler:relipmoC

bug_report.sh output

n/a

Anything else?

Consider the result of using -v 3:

$ jstrdecode -v 3 "\u0153\u00df\u00e5\u00e9"
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: processing arg: 0: <\u0153\u00df\u00e5\u00e9>
debug[3]: arg length: 24
debug[3]: decode length: 5
S���

The debug messages look OK, but the output is not:

$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006

which is not the same as the expected decoded output:

$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009

So it appears that strings with various \u<hex><hex><hex><hex> might not be decoded correctly.

UPDATE 0C

SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.

UPDATE 0D - todo list

It appears that the decoding issue itself is resolved. However a few things have to be done before this can be marked fixed.

  • We should make sure that an error or warning is issued when invalid UTF-8 codes are given.
  • We need to make sure that the unicode boolean is updated properly.
  • We need to fix the test cases in jstr_test.sh that had to be temporarily disabled.

Final todo:

  • Clean up the files json_utf8.c and json_utf8.h to include only things we need, making sure to keep credit.

I believe that with those done, as long as things look good, we can mark this as complete.

Side todo:

  • Sync this all to mkiocccentry (this can be done before this of course).
  • If all is good, update the IOCCC website repo to use jstrdecode instead of jsp.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Low priorityLow priority at this timebugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions