Skip to content

Capability eval suite #28

@jemc

Description

@jemc

In our ad hoc testing with KurtOpenAI and KurtVertexAI, we have seen problems like:

  • Vertex AI sometimes failing to generate valid data, even when supposedly being forced to by the relevant API parameters
  • Vertex AI sometimes returning 500 errors

We want to be able to formalize this kind of testing for any LLM provider, so we can share empirically validated findings about the relative capabilities of different LLM providers within the context of the features that are important for Kurt users.

I envision:

  • a script that can be used to generate an output report for a given LLM provider / model
  • a way of storing those results in a repository (either this one, or maybe a separate one dedicated to this work)
  • a nicely readable way (markdown? HTML?) of seeing the current summary of capabilities across LLM providers / models
  • a blog post showing our findings across the "big three" LLMs (GPT, Gemini, Claude)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions