feat(aws_s3 sink): Add Apache Parquet encoder support #24372

rorylshanks · 2025-12-12T13:07:06Z

Summary

This PR adds Apache Parquet encoding support to the AWS S3 sink, enabling Vector to write columnar Parquet files optimized for analytics workloads.

Parquet is a columnar storage format that provides efficient compression and encoding, making it ideal for long-term storage and query performance with tools like AWS Athena, Apache Spark, and Presto. This implementation allows users to write properly formatted Parquet files with configurable schemas, compression, and row group sizing.

Key features:

Complete Parquet encoder implementation with Apache Arrow integration
YAML schema configuration support (field names → data types)
Support for all common data types (strings, integers, floats, timestamps, booleans, etc.)
Configurable compression algorithms (snappy, gzip, zstd, lz4, brotli)
Row group size control for query parallelization
Nullable field support
Comprehensive test suite (9 unit tests)
Full documentation for schema configuration and Parquet options

Vector configuration

sources:
  events:
    type: kafka
    bootstrap_servers: "kafka:9092"
    topics:
      - events

transforms:
  prepare:
    inputs:
      - events
    type: remap
    source: |
      parsed = parse_json(.message) ?? {}
      .uuid = parsed.uuid
      .properties = parsed.properties
      

sinks:
  s3_events:
    type: aws_s3
    inputs:
      - prepare
    bucket: my-bucket
    region: us-east-1
    compression: none  # Parquet handles compression internally

    batch:
      max_events: 50000
      timeout_secs: 60

    encoding:
      codec: parquet
      schema:
        timestamp: timestamp_microsecond
        uuid: utf8
        properties: utf8

      parquet:
        compression: zstd
        allow_nullable_fields: true

How did you test this PR?

I tested it against production Kafka data, and it produced correctly formatted Parquet files in S3.

Change Type

Bug fix
New feature (Parquet encoder for AWS S3 sink)
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Closes #1374

github-actions · 2025-12-12T13:07:31Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

rorylshanks · 2025-12-12T13:09:52Z

I have read the CLA Document and I hereby sign the CLA

…t - can be tuned in config

…ut defaulted to off

rorylshanks requested review from a team as code owners December 12, 2025 13:07

github-actions bot added domain: sinks Anything related to the Vector's sinks domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation labels Dec 12, 2025

rorylshanks changed the title ~~Added parquet encoding to Vector AWS S3 Output~~ feat(aws_s3 sink): Add Apache Parquet encoder support Dec 12, 2025

drichards-87 self-assigned this Dec 12, 2025

drichards-87 approved these changes Dec 12, 2025

View reviewed changes

drichards-87 removed their assignment Dec 12, 2025

rorylshanks added 5 commits December 13, 2025 23:56

Added parquet encoding to Vector AWS S3 Output

c120c86

Added schema def

1b630aa

Added parquet to default features

5a7f6bf

Added changelog item for parquet

52d4c8e

Pre-allocate buffer for parquet output based on herustic 2kb per even…

672999d

…t - can be tuned in config

rorylshanks force-pushed the add-parquet branch from 7c16cdd to 672999d Compare December 13, 2025 22:56

Added ability to write bloom filters to parquet files. Configurable b…

b54228f

…ut defaulted to off

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(aws_s3 sink): Add Apache Parquet encoder support #24372

feat(aws_s3 sink): Add Apache Parquet encoder support #24372

rorylshanks commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025 •

edited

Loading

Uh oh!

rorylshanks commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(aws_s3 sink): Add Apache Parquet encoder support #24372

Are you sure you want to change the base?

feat(aws_s3 sink): Add Apache Parquet encoder support #24372

Conversation

rorylshanks commented Dec 12, 2025

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Uh oh!

github-actions bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rorylshanks commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Dec 12, 2025 •

edited

Loading