Skip to content

Conversation

@rorylshanks
Copy link

Summary

This PR adds Apache Parquet encoding support to the AWS S3 sink, enabling Vector to write columnar Parquet files optimized for analytics workloads.

Parquet is a columnar storage format that provides efficient compression and encoding, making it ideal for long-term storage and query performance with tools like AWS Athena, Apache Spark, and Presto. This implementation allows users to write properly formatted Parquet files with configurable schemas, compression, and row group sizing.

Key features:

  • Complete Parquet encoder implementation with Apache Arrow integration
  • YAML schema configuration support (field names → data types)
  • Support for all common data types (strings, integers, floats, timestamps, booleans, etc.)
  • Configurable compression algorithms (snappy, gzip, zstd, lz4, brotli)
  • Row group size control for query parallelization
  • Nullable field support
  • Comprehensive test suite (9 unit tests)
  • Full documentation for schema configuration and Parquet options

Vector configuration

sources:
  events:
    type: kafka
    bootstrap_servers: "kafka:9092"
    topics:
      - events

transforms:
  prepare:
    inputs:
      - events
    type: remap
    source: |
      parsed = parse_json(.message) ?? {}
      .uuid = parsed.uuid
      .properties = parsed.properties
      

sinks:
  s3_events:
    type: aws_s3
    inputs:
      - prepare
    bucket: my-bucket
    region: us-east-1
    compression: none  # Parquet handles compression internally

    batch:
      max_events: 50000
      timeout_secs: 60

    encoding:
      codec: parquet
      schema:
        timestamp: timestamp_microsecond
        uuid: utf8
        properties: utf8

      parquet:
        compression: zstd
        allow_nullable_fields: true

How did you test this PR?

I tested it against production Kafka data, and it produced correctly formatted Parquet files in S3.

Change Type

  • Bug fix
  • New feature (Parquet encoder for AWS S3 sink)
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Closes #1374

@rorylshanks rorylshanks requested review from a team as code owners December 12, 2025 13:07
@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation labels Dec 12, 2025
@rorylshanks rorylshanks changed the title Added parquet encoding to Vector AWS S3 Output feat(aws_s3 sink): Add Apache Parquet encoder support Dec 12, 2025
@github-actions
Copy link

github-actions bot commented Dec 12, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@rorylshanks
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@drichards-87 drichards-87 self-assigned this Dec 12, 2025
@drichards-87 drichards-87 removed their assignment Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support parquet columnar format in the aws_s3 sink

2 participants