Skip to content

If a Unicode character is split by container runtime, we should merge it when recombining #39653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
h0cheung opened this issue Apr 25, 2025 · 4 comments · Fixed by #39661
Closed

Comments

@h0cheung
Copy link
Contributor

h0cheung commented Apr 25, 2025

Component(s)

receiver/filelog, pkg/stanza

Is your feature request related to a problem? Please describe.

The default log driver of container runtimes, such as Docker, Containerd etc., may split logs by bytes instead of runes.
So, a Unicode may be split into two logs.
For example, the original log by application:

<content with 16KB -1 bytes>方...

The Unicode of "方": \xE6\x96\xB9

The output if running in Containerd:

2025-01-01T00:00:00.000000000+00:00 stdout P <content with 16KB -1 bytes>\xE6
2025-01-01T00:00:00.000000000+00:00 stdout F \x96\xB9...
  • note: "\x96" means a byte with value 0x96.

The output if running in Docker:

{"log":"<content with 16KB -1 bytes>\ufffd","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}
{"log":"\ufffd\ufffd...","stream":"stdout","time":"2025-01-01T00:00:00.000000000Z"}

The collected message:

<content with 16KB -1 bytes>���...

Describe the solution you'd like

For Docker, it seems that we can do nothing.
But for Containerd (maybe also CRI-O), we should try to merge the bytes to get the original Unicode.

  • Firstly, when decoding, all invalid bytes will be replaced with �. This can be prevented by setting encoding to nop. Maybe this should be default behavior.
  • Then, the regex parser, which is used by container parser, will run into error as it only support string. We should add support for []byte including invalid UTF-8 bytes.
  • Finally, after recombining by container parser, we should replace invalid bytes if still exist to ensure that the final output is valid UTF-8 string, and to prevent other operators running into error.

Describe alternatives you've considered

No response

Additional context

No response

@h0cheung h0cheung added enhancement New feature or request needs triage New item requiring triage labels Apr 25, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski djaglowski removed the needs triage New item requiring triage label Apr 25, 2025
@andrzej-stencel andrzej-stencel changed the title If a Unicode character is splited by container runtime, we should merge it when recombing. If a Unicode character is split by container runtime, we should merge it when recombining May 5, 2025
@andrzej-stencel
Copy link
Member

For Docker, it seems that we can do nothing.

Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right?

@h0cheung
Copy link
Contributor Author

h0cheung commented May 6, 2025

For Docker, it seems that we can do nothing.

Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right?

All bytes of the Unicode character has been replaced by \ufffd in the output. So we can't get the original bytes.

@andrzej-stencel
Copy link
Member

For Docker, it seems that we can do nothing.

Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right?

All bytes of the Unicode character has been replaced by \ufffd in the output. So we can't get the original bytes.

Oh now I see, you provided the example output, sorry I missed that. So Docker already replaces the split non-Unicode bytes and writes \ufffd to the files, while CRI-O and containerd write the original bytes.

dragonlord93 pushed a commit to dragonlord93/opentelemetry-collector-contrib that referenced this issue May 23, 2025
…cing UTF-8 bytes with \uFFFD (open-telemetry#39661)

#### Description
`pkg/stanza` decodes input bytes using `unicode.UTF8`, which replaces
any input bytes that are not part of a well-formed UTF-8 code sequence
with `utf8.RuneError`. This replacement is not what we expect.

The `Decoder` in `golang.org/x/text/encoding` is used to convert bytes
to UTF-8. So, if the user specifies that the input encoding is
compatible with UTF-8, we don't need to use `encoding.UTF8` and should
use `encoding.Nop` to avoid `utf8.RuneError`.

This PR introduces `utf8-raw` encoding. It behaves the same way as
`encoding.Nop` but is differentiated from `nop` encoding which we treat
in a special way.

#### Link to tracking issue
Fixes open-telemetry#39653

#### Testing
- update test to ensure encoding not to replace invalid ut8 bytes
- add a test to ensure recombine combine splited utf8 characters
correctly

---------

Co-authored-by: Curtis Robert <[email protected]>
Co-authored-by: Andrzej Stencel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants