Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly

### Is there an existing issue for this?

- [X] I have searched for existing issues and did not find anything like this

### Describe the bug

Certain strings seem to not be decoded properly by the `jstrdecode(1)` tool, or so it seems.

**SUGGESTION**: See [GH-issuecomment-2350854096](https://github.com/xexyl/jparse/issues/13#issuecomment-2350854096) for some updated commentary on this issue.

# Some background

Consider this JSON document containing just a JSON string:

```
"œßåé"
```

When you enter that JSON document into `https://jsonlint.com/` is shows that the JSON is valid. 

Moreover, the `jparse(1)` tool also shows that the JSON document is valid:

```sh
$ jparse -J 3 -s '"œßåé"'
JSON tree[3]:	lvl: 0	type: JTYPE_STRING	len{p,c:q}: 8	value:	"\xc5\x93\xc3\x9f\xc3\xa5\xc3\xa9"
in parse_json(): JSON debug[1]: valid JSON
```

it is **GOOD** that the JSON parser is happy with this JSON document.

FYI: Here is the `hexdump(1)` shows about the string (we include the enclosing double quotes and trailing new for consistency with the JSON file):

```sh
$ echo '"œßåé"' | hexdump -C
00000000  22 c5 93 c3 9f c3 a5 c3  a9 22 0a                 |"........".|
0000000b
```

Here is how `jstrencode(1)` processes the string:

```sh
$ jstrencode '"œßåé"'
\"œßåé\"
```

This is **CORRECT** as double quotes are back-slashed (i.e., **\"**) is **correct** because within a JSON string, all double quotes **MUST** be backslashed.

Now if we use `-Q`, the enclosing double quotes are ignored:

```
$ jstrencode -Q '"œßåé"'
œßåé
```

If we consult `https://codebeautify.org/json-encode-online`, and we put `"œßåé"'` on the input side, we see that output side shows the same string.

# Evidence of a decoding bug

Now, some JSON tools, such as `jsp` (see https://github.com/kjozsa/jsp) encodes `"œßåé"'` as:

```sh
$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"
```

When we put the `"\u0153\u00df\u00e5\u00e9"` string into the input side of [https://codebeautify.org/json-encode-online](https://codebeautify.org/json-encode-online), the output side shows `"œßåé"`.

However if we give the  `"\u0153\u00df\u00e5\u00e9"` string to `jstrdecode(1)`, we get something odd:

```sh
$ jstrdecode "\u0153\u00df\u00e5\u00e9"
S���
```

Using `hexdump(1)` we see that the expected decoded output:

```sh
$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009
```

However this is what `jstrdecode(1)` produces:

```sh
$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006
```

This suggests that the JSON string decoding may be incorrect.


### What you expect

We expect that the output of

```sh
jstrdecode "\u0153\u00df\u00e5\u00e9"
```

to be the same as:

```sh
echo 'œßåé`
```


### Environment

```markdown
- OS:SO
- Device:eciveD
- Compiler:relipmoC
```


### bug_report.sh output

n/a

### Anything else?

Consider the result of using `-v 3`:

```
$ jstrdecode -v 3 "\u0153\u00df\u00e5\u00e9"
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: processing arg: 0: <\u0153\u00df\u00e5\u00e9>
debug[3]: arg length: 24
debug[3]: decode length: 5
S���
```

The debug messages look OK, but the output is not:

```
$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006
```

which is not the same as the expected decoded output:

```sh
$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009
```

So it appears that strings with various `\u<hex><hex><hex><hex>` might not be decoded correctly.

## UPDATE 0C

**SUGGESTION**: See [GH-issuecomment-2350854096](https://github.com/xexyl/jparse/issues/13#issuecomment-2350854096) for some updated commentary on this issue.


## UPDATE 0D - todo list

It appears that the decoding issue itself is resolved. However a few things have to be done before this can be marked fixed.

- [x] We should make sure that an error or warning is issued when invalid UTF-8 codes are given.
- [x] We need to make sure that the unicode boolean is updated properly.
- [x] We need to fix the test cases in jstr_test.sh that had to be temporarily disabled.

### Final todo:

- [x] Clean up the files json_utf8.c and json_utf8.h to include only things we need, making sure to keep credit.

I believe that with those done, as long as things look good, we can mark this as complete. 

### Side todo:

- [x] Sync this all to mkiocccentry (this can be done before this of course).
- [x] If all is good, update the IOCCC website repo to use jstrdecode instead of jsp.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13

Is there an existing issue for this?

Describe the bug

Some background

Evidence of a decoding bug

What you expect

Environment

bug_report.sh output

Anything else?

UPDATE 0C

UPDATE 0D - todo list

Final todo:

Side todo:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13

Description

Is there an existing issue for this?

Describe the bug

Some background

Evidence of a decoding bug

What you expect

Environment

bug_report.sh output

Anything else?

UPDATE 0C

UPDATE 0D - todo list

Final todo:

Side todo:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions