Skip to content

Configurable delay to consider a line complete for the file source #18343

@s-at-ik

Description

@s-at-ik

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

A third party program (on which I have little to no control) outputs its logs in multi-line xml. If that wasn't bad enough, the logs always start with a newline but do not end with a newline.

The log will look something like this:


<?xml version="1.0" encoding="UTF-8"?><a><lot><of></of></lot><tags/></a>

<?xml version="1.0"?>
<a>
<lot>
  <of></of>
</lot>
<tags/>
</a>

<?xml version="1.0" encoding="UTF-8"?><a><lot><of>
</of></lot><tags/></a>

<?xml version="1.0"?>
<a><lot><of></of></lot><tags/></a>

Attempted Solutions

I've written a set of transform to read these log which works fine when sourcing a static file with a newline at the end:

sources:
    source_logs_xml:
        type: file
        include:
            - /var/log/log.xml
        multiline:
          start_pattern: "^<\?xml"
          mode: halt_with
          condition_pattern: "</a>$"
          timeout_ms: 1000
transforms:
    drop_empty_xml_logs:
      type: filter
      inputs:
        - source_logs_xml
      condition: "!is_nullish(.message)"
    transform_xml_logs:
        inputs:
            - drop_empty_xml_logs
        source: |-
            . |= object!(parse_xml!(.message))
            del(.message)
        type: remap

However when reading the actual log file, due to the lack of newline character on the last line, the input received by the transform looks like (ignoring empty logs):

<?xml version="1.0" encoding="UTF-8"?><a><lot><of></of></lot><tags/></a>
<?xml version="1.0"?>
<a>
<lot>
  <of></of>
</lot>
<tags/>
</a>
<?xml version="1.0" encoding="UTF-8"?><a><lot><of>
</of></lot><tags/></a>
<?xml version="1.0"?>
<a><lot><of></of></lot><tags/></a>

As you'd expected, not much ends up as valid xml.

Proposal

As suggested by @jszwedko in #18341 , a configurable timeout for vector to consider the current line complete would alleviate this problem. In my use case it could be set to a value slightly lower than multiline.timeout_ms to ensure the last line gets properly included.

Alternatively multiline could include the current line buffer when its timeout expires, but I find this solution less elegant and less logical.

References

#18341

Version

vector 0.32.1 (x86_64-unknown-linux-gnu 9965884 2023-08-21 14:52:38.330227446)

Metadata

Metadata

Assignees

No one assigned

    Labels

    source: fileAnything `file` source relatedtype: featureA value-adding code addition that introduce new functionality.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions