Skip to content

Conversation

@arnej27959
Copy link
Member

@thomasht86 please review
@bjorncs FYI

This commit adds the ability to import XGBoost models saved in Universal
Binary JSON (.ubj) format, in addition to the existing JSON format support.

Key changes:
- Add ubjson library dependency for parsing UBJ binary format
- Create XGBoostUbjParser to handle UBJ model files
- Extract common tree-to-expression logic into AbstractXGBoostParser base class
- Convert flat UBJ array representation to hierarchical tree structure
- Extract and apply base_score logit transformation from model metadata
- Add test case comparing JSON and UBJ model imports
- Add utility tools for UBJ-to-JSON conversion and debugging

Enables base score extraction with logistic transformation
Add the ubjson library (com.dev-smart:ubjson) to the allowed dependencies
lists across all Maven enforcer configurations. This is required for the
XGBoost UBJ format import feature added in the previous commit.
Add a probe method to validate UBJ file structure before parsing,
and precompute the base_score logit transformation instead of
generating it as a runtime expression string.
Separates feature indices from feature name formatting to enable
flexible feature naming in ranking expressions.  This allows models to
use meaningful feature names (e.g., "mean_radius") instead of generic
indexed names, improving readability of generated ranking expressions.
When loading an XGBoost UBJ model, automatically checks for and loads
feature names from an optional companion text file. For example, when
reading "model.ubj", will look for "model-features.txt" and use those
names if present.

Key features:
- Automatically loads model-features.txt alongside model.ubj
- One feature name per line, supports # comments and blank lines
- Feature names from file override any names in the UBJ file
- Graceful fallback to xgboost_input_X format if file missing or invalid
- No-arg toRankingExpression() automatically uses loaded names when valid

This enables easy customization of feature names without modifying
model files, improving readability of generated ranking expressions.
The importer now extracts and tracks the model's objective function type
(e.g., reg:squarederror, binary:logistic) to correctly handle base_score:
- Apply logit transformation only for logistic objectives
- Use base_score directly for regression objectives
- Use objective-specific defaults (0.5 for logistic, 0.0 for regression)
- Relax feature name validation to require "at least N" instead of "exactly N"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants