Skip to content

Fix inconsistent schema projection in ListingTable when file order varies by tracking schema source #16305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 6, 2025

Which issue does this PR close?

Rationale for this change

Understanding the origin of a schema (whether it was inferred or explicitly specified) is important for debugging, reproducibility, and behavioral consistency in systems like DataFusion that operate on dynamic data sources. Previously, this information was not available in the ListingTable or its configuration, making it hard to reason about schema behavior.

What changes are included in this PR?

  • Schema-source metadata

    • Introduced a new SchemaSource enum to track the origin of a schema (None, Inferred, Specified).
    • Extended ListingTableConfig and ListingTable to carry and expose this metadata.
    • Ensured that schema inference logic respects an explicitly set schema and does not overwrite it.
    • Added public accessors (schema_source()) to inspect schema origin in both config and table.
  • Imports cleanup

    • Reorganized imports in table.rs for clarity and consistency.
  • Test refactoring & additions

    • Refactored single-file scan/statistics tests:
      • Kept and cleaned up read_single_file.
      • Unified Parquet-stats coverage into a single test_table_stats_behaviors.
    • Consolidated file-listing tests into a parameterized test_list_files_configurations.
    • Parameterized insert-into append tests via test_insert_into_parameterized.
    • Added comprehensive unit tests for all SchemaSource cases:
      • test_schema_source_tracking_comprehensive
      • infer_preserves_provided_schema
    • Removed dozens of redundant individual tests in favor of DRY loops and shared helpers (create_test_schema, generate_test_files…).

Deleted tests → New tests mapping

Deleted test(s) Replacement tests
read_single_file (old) read_single_file (refactored)
do_not_load_table_stats_by_default
load_table_stats_when_no_stats
test_table_stats_behaviors
test_assert_list_files_for_scan_grouping
test_assert_list_files_for_multi_path
test_assert_list_files_for_exact_paths
test_list_files_configurations
test_insert_into_append_new_json_files
test_insert_into_append_new_csv_files
test_insert_into_append_2_new_parquet_files_defaults
test_insert_into_append_1_new_parquet_files_defaults
test_insert_into_parameterized

Are these changes tested?

  • Yes. This PR includes a full suite of unit tests covering:
    • Single-file + stats: read_single_file, test_table_stats_behaviors
    • Schema-source tracking: test_schema_source_tracking_comprehensive, infer_preserves_provided_schema
    • File-listing grouping: test_list_files_configurations
    • Insert-into append behavior: test_insert_into_parameterized

Shared helpers and parameterized loops ensure that every previously tested scenario is still exercised, with improved maintainability and coverage.

Are there any user-facing changes?

Yes:

  • ListingTable now exposes a schema_source() method, enabling downstream consumers to programmatically check the origin of the schema.
  • This may help users understand or debug unexpected schema-related behavior when working with listing tables.

There are no breaking changes to the public API, but the enhancement provides improved introspection capabilities.

@github-actions github-actions bot added the core Core DataFusion crate label Jun 6, 2025
@kosiew kosiew marked this pull request as draft June 6, 2025 14:15
@kosiew kosiew marked this pull request as ready for review June 7, 2025 08:14
@kosiew kosiew force-pushed the list-table-config-file-schema-16270 branch from 47fcdae to a40a448 Compare June 7, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent schema coercion in ListingTableConfig
1 participant