-
Notifications
You must be signed in to change notification settings - Fork 977
[issue-2528] [SDK] Add Structured Output Compliance evaluation metric #2554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[issue-2528] [SDK] Add Structured Output Compliance evaluation metric #2554
Conversation
@vincentkoc Can I get a review on this if there is no update from the first PR raised by the other person? |
Apologies for the delay in reviewing this PR. Could you please resolve the current conflicts? I’ll make sure someone reviews it as soon as possible once the conflicts are addressed. Thank you for your patience! |
@andrescrz Sorry for the delay from my side. I will resolve the conflict soon. |
@andrescrz I have resolved the conflict. You can review now and let me know if any changes are needed in the PR; I would like to address them. |
Hi @andrescrz, sorry to message again, but could I get a review on my PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new "Structured Output Compliance" evaluation metric that validates whether LLM outputs conform to expected JSON schemas or valid JSON format. The metric uses LLM-as-a-judge approach and is integrated into both the Python SDK and frontend UI for online evaluations.
Key changes:
- Implementation of the StructuredOutputCompliance metric in the Python SDK with template, parser, and metric components
- Frontend integration adding the metric to LLM judge options and UI templates
- Documentation for the new metric with usage examples
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/template.py | Defines the prompt template and query generation for structured output validation |
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/parser.py | Parses LLM output and validates the response format for the compliance metric |
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/metric.py | Main metric implementation with sync/async scoring methods |
sdks/python/src/opik/evaluation/metrics/init.py | Exports the new StructuredOutputCompliance metric |
sdks/python/examples/metrics.py | Adds usage example for the new metric |
apps/opik-frontend/src/types/llm.ts | Adds structure_compliance to LLM_JUDGE enum |
apps/opik-frontend/src/constants/llm.ts | Defines frontend template configuration for structured output compliance |
apps/opik-documentation/documentation/fern/docs/evaluation/metrics/structure_output_compliance.mdx | Documentation for the new metric with examples and usage |
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/template.py
Outdated
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/parser.py
Outdated
Show resolved
Hide resolved
...pik-documentation/documentation/fern/docs/evaluation/metrics/structure_output_compliance.mdx
Outdated
Show resolved
Hide resolved
...pik-documentation/documentation/fern/docs/evaluation/metrics/structure_output_compliance.mdx
Outdated
Show resolved
Hide resolved
Hi @Vikaspal8923 ! Thank you for contribution! Please fix linter errors:
You need to install pre-commit as described in Contribution guide and run it locally: cd sdks/python
pre-commit run --all-files |
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/template.py
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/parser.py
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/metric.py
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/structure_output_compliance/metric.py
Show resolved
Hide resolved
@Vikaspal8923 The metric implementation looks very promising. However, you should add unit and integration tests to ensure reliability. I’ve left comments highlighting the areas where tests are needed. Please refer to how other metrics are covered in Opik. |
@yaricom sure |
Co-authored-by: Copilot <[email protected]>
…utput_compliance/template.py Co-authored-by: Copilot <[email protected]>
…rics/structure_output_compliance.mdx Co-authored-by: Copilot <[email protected]>
…rics/structure_output_compliance.mdx Co-authored-by: Copilot <[email protected]>
Hi @Vikaspal8923 ! You can register and get your own OpenAI API key at https://platform.openai.com |
@yaricom, I have tested all the integration tests and fixed test failures. Can you take a look and let me know if there are any further changes? |
…nce' into add_evaluation_structure_compliance
@yaricom any changes I have to address or is it ready now ? |
@Vikaspal8923 Please fix this error: https://github.com/comet-ml/opik/actions/runs/17410830550/job/49427360921?pr=2554
|
@yaricom, I’ve updated the PR title and description to follow the required format. Please recheck it and let me know if any issues persist |
@Vikaspal8923 please add integration test that checks JSON schema as you mentioned in the example usage. @model_parametrizer
def test__structured_output_compliance__with_json_schema(model):
"""Test structured output compliance with schema validation."""
structured_output_metric = metrics.StructuredOutputCompliance(
model=model, track=False
)
schema = '{"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"]}'
result = structured_output_metric.score(
output='{"name": "John", "age": 30}', schema=schema
)
assert_helpers.assert_score_result(result)
assert result.value > 0.5 |
…compliance metric
…com/Vikaspal8923/opik into add_evaluation_structure_compliance
@yaricom Added 👍. |
LOGGER = logging.getLogger(__name__) | ||
|
||
|
||
class StructuredOutputComplianceResponseFormat(pydantic.BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this class (StructuredOutputComplianceResponseFormat
) into schema.py
module.
…com/Vikaspal8923/opik into add_evaluation_structure_compliance
@yaricom Done 👍. |
@Vikaspal8923 Thank you for contribution! |
Details
Implemented the "Structured Output Compliance" evaluation metric, which validates model outputs as JSON/JSON-LD and returns a boolean result plus a "reason."
This extends the LLM-as-a-judge evaluation in both the frontend (Online Evaluation tab) and the Python SDK.
Change checklist
Issues
Closes #2558
Resolves #2528
/claim #2528
Testing
Documentation
/docs/evaluation/metrics/structure_output_compliance.mdx
with new metric detailsDemo
video ::
https://github.com/user-attachments/assets/47ffd3e9-6642-4678-9e72-87765c747bac