-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Is there an existing issue for this?
- I have searched for existing issues and did not find anything like this
Describe the bug
Certain strings seem to not be decoded properly by the jstrdecode(1) tool, or so it seems.
SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.
Some background
Consider this JSON document containing just a JSON string:
"œßåé"
When you enter that JSON document into https://jsonlint.com/ is shows that the JSON is valid.
Moreover, the jparse(1) tool also shows that the JSON document is valid:
$ jparse -J 3 -s '"œßåé"'
JSON tree[3]: lvl: 0 type: JTYPE_STRING len{p,c:q}: 8 value: "\xc5\x93\xc3\x9f\xc3\xa5\xc3\xa9"
in parse_json(): JSON debug[1]: valid JSONit is GOOD that the JSON parser is happy with this JSON document.
FYI: Here is the hexdump(1) shows about the string (we include the enclosing double quotes and trailing new for consistency with the JSON file):
$ echo '"œßåé"' | hexdump -C
00000000 22 c5 93 c3 9f c3 a5 c3 a9 22 0a |"........".|
0000000bHere is how jstrencode(1) processes the string:
$ jstrencode '"œßåé"'
\"œßåé\"This is CORRECT as double quotes are back-slashed (i.e., ") is correct because within a JSON string, all double quotes MUST be backslashed.
Now if we use -Q, the enclosing double quotes are ignored:
$ jstrencode -Q '"œßåé"'
œßåé
If we consult https://codebeautify.org/json-encode-online, and we put "œßåé"' on the input side, we see that output side shows the same string.
Evidence of a decoding bug
Now, some JSON tools, such as jsp (see https://github.com/kjozsa/jsp) encodes "œßåé"' as:
$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"When we put the "\u0153\u00df\u00e5\u00e9" string into the input side of https://codebeautify.org/json-encode-online, the output side shows "œßåé".
However if we give the "\u0153\u00df\u00e5\u00e9" string to jstrdecode(1), we get something odd:
$ jstrdecode "\u0153\u00df\u00e5\u00e9"
S���Using hexdump(1) we see that the expected decoded output:
$ echo 'œßåé' | hexdump -C
00000000 c5 93 c3 9f c3 a5 c3 a9 0a |.........|
00000009However this is what jstrdecode(1) produces:
$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000 01 53 df e5 e9 0a |.S....|
00000006This suggests that the JSON string decoding may be incorrect.
What you expect
We expect that the output of
jstrdecode "\u0153\u00df\u00e5\u00e9"to be the same as:
echo 'œßåé`Environment
- OS:SO
- Device:eciveD
- Compiler:relipmoCbug_report.sh output
n/a
Anything else?
Consider the result of using -v 3:
$ jstrdecode -v 3 "\u0153\u00df\u00e5\u00e9"
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: processing arg: 0: <\u0153\u00df\u00e5\u00e9>
debug[3]: arg length: 24
debug[3]: decode length: 5
S���
The debug messages look OK, but the output is not:
$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000 01 53 df e5 e9 0a |.S....|
00000006
which is not the same as the expected decoded output:
$ echo 'œßåé' | hexdump -C
00000000 c5 93 c3 9f c3 a5 c3 a9 0a |.........|
00000009So it appears that strings with various \u<hex><hex><hex><hex> might not be decoded correctly.
UPDATE 0C
SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.
UPDATE 0D - todo list
It appears that the decoding issue itself is resolved. However a few things have to be done before this can be marked fixed.
- We should make sure that an error or warning is issued when invalid UTF-8 codes are given.
- We need to make sure that the unicode boolean is updated properly.
- We need to fix the test cases in jstr_test.sh that had to be temporarily disabled.
Final todo:
- Clean up the files json_utf8.c and json_utf8.h to include only things we need, making sure to keep credit.
I believe that with those done, as long as things look good, we can mark this as complete.
Side todo:
- Sync this all to mkiocccentry (this can be done before this of course).
- If all is good, update the IOCCC website repo to use jstrdecode instead of jsp.