Skip to content

[otal-arrow-rust] Adaptive array builders optimize dictionary upgrade #536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
albertlockett opened this issue Jun 4, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request performance rust Pull requests that update Rust code

Comments

@albertlockett
Copy link
Member

In #473 we added basic functionality for adaptive array builder but currently when upgrading the dictionary, we copy all the values.

Ideally we'd just be able to get access to the underlying key builder of the dictionary builder, finish/cast it to the new index type, and then create a new dictionary builder with the same values/internal state + the new key builder. As far as I know there's not currently a way to do this in arrow-rs.

TODO

  • confirm with arrow-rs community if this is possible
  • open any necessary issues in arrow-rs for this capability
  • implement changes in arrow-rs
  • use optimized implementation of dictionary builder conversion in dictionary upgrade.
@albertlockett
Copy link
Member Author

related PR on arrow-rs apache/arrow-rs#7611

github-merge-queue bot pushed a commit that referenced this issue Jun 5, 2025
)

Part of: #533

Very rough implementation of adaptive array builders. This my "rust"
version of the builder's we've implemented in golang here:
https://github.com/open-telemetry/otel-arrow/blob/main/go/pkg/otel/common/schema/builder/record.go

The idea behind these is that when we're encoding OTAP records, we often
want to dynamically create columns in some record batch that that either
aren't added to the record batch (if all the values are null), or are
dictionary encoded with the smallest possible index, or are the native
array if the dictionary index would overflow. (Some of this was alluded
to in yesterday's SIG meeting).

The intended usage is something like this:
```rs
use otel_arrow_rust::encode::record::array::StringArrayBuilder;

let mut str_builder = StringArrayBuilder::new(ArrayOptions {
    nullable: true,
    dictionary_options: Some(DictionaryOptions {
        min_cardinality: u8::MAX.into(),
        max_cardinality: u16::MAX,
    }),
});

// maybe append some values
str_builder.append_value(&"a".to_string());

let result = str_builder.finish();

let mut fields = Vec::new();
let mut columns = Vec::new();

if let Some(result) = result {
  fields.push(Field:new("str", result.data_type, true));
  columns.push(result.array);
}

let record_batch = RecordBatch::try_new(
    Arc::new(Schema::new(fields)),
    columns
)
.expect("should work");
```

Followup work includes:
- null support #534
- additional datatype support:
#535
- optimize the conversion between Dict<u8> -> Dict<u16>
#536

---------

Co-authored-by: Laurent Quérel <[email protected]>
Co-authored-by: Laurent Quérel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance rust Pull requests that update Rust code
Projects
Status: No status
Development

No branches or pull requests

1 participant