-
Notifications
You must be signed in to change notification settings - Fork 2.8k
If a Unicode character is split by container runtime, we should merge it when recombining #39653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Why cannot this be fixed when using Docker as container runtime? We can recombine lines split by Docker just as well as logs split by CRI-O or containerd, right? |
All bytes of the Unicode character has been replaced by \ufffd in the output. So we can't get the original bytes. |
Oh now I see, you provided the example output, sorry I missed that. So Docker already replaces the split non-Unicode bytes and writes |
…cing UTF-8 bytes with \uFFFD (open-telemetry#39661) #### Description `pkg/stanza` decodes input bytes using `unicode.UTF8`, which replaces any input bytes that are not part of a well-formed UTF-8 code sequence with `utf8.RuneError`. This replacement is not what we expect. The `Decoder` in `golang.org/x/text/encoding` is used to convert bytes to UTF-8. So, if the user specifies that the input encoding is compatible with UTF-8, we don't need to use `encoding.UTF8` and should use `encoding.Nop` to avoid `utf8.RuneError`. This PR introduces `utf8-raw` encoding. It behaves the same way as `encoding.Nop` but is differentiated from `nop` encoding which we treat in a special way. #### Link to tracking issue Fixes open-telemetry#39653 #### Testing - update test to ensure encoding not to replace invalid ut8 bytes - add a test to ensure recombine combine splited utf8 characters correctly --------- Co-authored-by: Curtis Robert <[email protected]> Co-authored-by: Andrzej Stencel <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
Component(s)
receiver/filelog, pkg/stanza
Is your feature request related to a problem? Please describe.
The default log driver of container runtimes, such as Docker, Containerd etc., may split logs by bytes instead of runes.
So, a Unicode may be split into two logs.
For example, the original log by application:
The Unicode of "方": \xE6\x96\xB9
The output if running in Containerd:
The output if running in Docker:
The collected message:
Describe the solution you'd like
For Docker, it seems that we can do nothing.
But for Containerd (maybe also CRI-O), we should try to merge the bytes to get the original Unicode.
nop
. Maybe this should be default behavior.[]byte
including invalid UTF-8 bytes.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: