If a Unicode character is split by container runtime, we should merge it when recombining #39653

h0cheung · 2025-04-25T09:37:33Z

Component(s)

receiver/filelog, pkg/stanza

Is your feature request related to a problem? Please describe.

The default log driver of container runtimes, such as Docker, Containerd etc., may split logs by bytes instead of runes.
So, a Unicode may be split into two logs.
For example, the original log by application:

<content with 16KB -1 bytes>方...

The Unicode of "方": \xE6\x96\xB9

The output if running in Containerd:

2025-01-01T00:00:00.000000000+00:00 stdout P <content with 16KB -1 bytes>\xE6
2025-01-01T00:00:00.000000000+00:00 stdout F \x96\xB9...

note: "\x96" means a byte with value 0x96.

The output if running in Docker:

{"log":"<content with 16KB -1 bytes>\ufffd","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
{"log":"\ufffd\ufffd...","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}

The collected message:

<content with 16KB -1 bytes>���...

Describe the solution you'd like

For Docker, it seems that we can do nothing.
But for Containerd (maybe also CRI-O), we should try to merge the bytes to get the original Unicode.

Firstly, when decoding, all invalid bytes will be replaced with �. This can be prevented by setting encoding to nop. Maybe this should be default behavior.
Then, the regex parser, which is used by container parser, will run into error as it only support string. We should add support for []byte including invalid UTF-8 bytes.
Finally, after recombining by container parser, we should replace invalid bytes if still exist to ensure that the final output is valid UTF-8 string, and to prevent other operators running into error.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-04-25T09:37:57Z

Pinging code owners:

receiver/filelog: @djaglowski @andrzej-stencel
pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

andrzej-stencel · 2025-05-05T08:51:09Z

For Docker, it seems that we can do nothing.

Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right?

h0cheung · 2025-05-06T02:34:48Z

For Docker, it seems that we can do nothing.

Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right?

All bytes of the Unicode character has been replaced by \ufffd in the output. So we can't get the original bytes.

andrzej-stencel · 2025-05-13T20:08:24Z

For Docker, it seems that we can do nothing.

Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right?

All bytes of the Unicode character has been replaced by \ufffd in the output. So we can't get the original bytes.

Oh now I see, you provided the example output, sorry I missed that. So Docker already replaces the split non-Unicode bytes and writes \ufffd to the files, while CRI-O and containerd write the original bytes.

…cing UTF-8 bytes with \uFFFD (open-telemetry#39661) #### Description `pkg/stanza` decodes input bytes using `unicode.UTF8`, which replaces any input bytes that are not part of a well-formed UTF-8 code sequence with `utf8.RuneError`. This replacement is not what we expect. The `Decoder` in `golang.org/x/text/encoding` is used to convert bytes to UTF-8. So, if the user specifies that the input encoding is compatible with UTF-8, we don't need to use `encoding.UTF8` and should use `encoding.Nop` to avoid `utf8.RuneError`. This PR introduces `utf8-raw` encoding. It behaves the same way as `encoding.Nop` but is differentiated from `nop` encoding which we treat in a special way. #### Link to tracking issue Fixes open-telemetry#39653 #### Testing - update test to ensure encoding not to replace invalid ut8 bytes - add a test to ensure recombine combine splited utf8 characters correctly --------- Co-authored-by: Curtis Robert <[email protected]> Co-authored-by: Andrzej Stencel <[email protected]>

h0cheung added enhancement New feature or request needs triage New item requiring triage labels Apr 25, 2025

github-actions bot added receiver/filelog pkg/stanza labels Apr 25, 2025

djaglowski removed the needs triage New item requiring triage label Apr 25, 2025

haoqixu mentioned this issue Apr 25, 2025

[pkg/stanza][textutils]: introduce utf8-raw encoding to avoid replacing UTF-8 bytes with \uFFFD #39661

Merged

github-actions bot mentioned this issue Apr 29, 2025

Weekly Report: 2025-04-22 - 2025-04-29 #39708

Closed

andrzej-stencel changed the title ~~If a Unicode character is splited by container runtime, we should merge it when recombing.~~ If a Unicode character is split by container runtime, we should merge it when recombining May 5, 2025

djaglowski closed this as completed in #39661 May 20, 2025

djaglowski closed this as completed in c37a07f May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

If a Unicode character is split by container runtime, we should merge it when recombining #39653

If a Unicode character is split by container runtime, we should merge it when recombining #39653

h0cheung commented Apr 25, 2025 •

edited

Loading

github-actions bot commented Apr 25, 2025

Uh oh!

andrzej-stencel commented May 5, 2025

Uh oh!

h0cheung commented May 6, 2025 •

edited

Loading

Uh oh!

andrzej-stencel commented May 13, 2025

Uh oh!

If a Unicode character is split by container runtime, we should merge it when recombining #39653

If a Unicode character is split by container runtime, we should merge it when recombining #39653

Comments

h0cheung commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Apr 25, 2025

Uh oh!

andrzej-stencel commented May 5, 2025

Uh oh!

h0cheung commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrzej-stencel commented May 13, 2025

Uh oh!

h0cheung commented Apr 25, 2025 •

edited

Loading

h0cheung commented May 6, 2025 •

edited

Loading