Skip to content

Commit ce59cc4

Browse files
authored
Add safety evaluators tutorial (#46218)
1 parent 9b61aa6 commit ce59cc4

File tree

10 files changed

+421
-36
lines changed

10 files changed

+421
-36
lines changed

docs/ai/conceptual/evaluation-libraries.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -31,34 +31,34 @@ You can also customize to add your own evaluations by implementing the <xref:Mic
3131

3232
Quality evaluators measure response quality. They use an LLM to perform the evaluation.
3333

34-
| Metric | Description | Evaluator type |
35-
|----------------|--------------------------------------------------------|----------------|
36-
| `Relevance` | Evaluates how relevant a response is to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> |
37-
| `Completeness` | Evaluates how comprehensive and accurate a response is | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> |
38-
| `Retrieval` | Evaluates performance in retrieving information for additional context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> |
39-
| `Fluency` | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
40-
| `Coherence` | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
41-
| `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
42-
| `Groundedness` | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> |
43-
| `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator> |
34+
| Evaluator type | Metric | Description |
35+
|----------------------------------------------------------------------|-------------|-------------|
36+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> | `Relevance` | Evaluates how relevant a response is to a query |
37+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> | `Completeness` | Evaluates how comprehensive and accurate a response is |
38+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> | `Retrieval` | Evaluates performance in retrieving information for additional context |
39+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> | `Fluency` | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability|
40+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> | `Coherence` | Evaluates the logical and orderly presentation of ideas |
41+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> | `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query |
42+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> | `Groundedness` | Evaluates how well a generated response aligns with the given context |
43+
| <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>| `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is |
4444

4545
† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).
4646

4747
### Safety evaluators
4848

4949
Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.
5050

51-
| Metric | Description | Evaluator type |
52-
|--------------------|-----------------------------------------------------------------------|------------------------------|
53-
| `Groundedness Pro` | Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Safety.GroundednessProEvaluator> |
54-
| `Protected Material` | Evaluates response for the presence of protected material | <xref:Microsoft.Extensions.AI.Evaluation.Safety.ProtectedMaterialEvaluator> |
55-
| `Ungrounded Attributes` | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | <xref:Microsoft.Extensions.AI.Evaluation.Safety.UngroundedAttributesEvaluator> |
56-
| `Hate And Unfairness` | Evaluates a response for the presence of content that's hateful or unfair | <xref:Microsoft.Extensions.AI.Evaluation.Safety.HateAndUnfairnessEvaluator> |
57-
| `Self Harm` | Evaluates a response for the presence of content that indicates self harm | <xref:Microsoft.Extensions.AI.Evaluation.Safety.SelfHarmEvaluator> |
58-
| `Violence` | Evaluates a response for the presence of violent content | <xref:Microsoft.Extensions.AI.Evaluation.Safety.ViolenceEvaluator> |
59-
| `Sexual` | Evaluates a response for the presence of sexual content | <xref:Microsoft.Extensions.AI.Evaluation.Safety.SexualEvaluator> |
60-
| `Code Vulnerability` | Evaluates a response for the presence of vulnerable code | <xref:Microsoft.Extensions.AI.Evaluation.Safety.CodeVulnerabilityEvaluator> |
61-
| `Indirect Attack` | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | <xref:Microsoft.Extensions.AI.Evaluation.Safety.IndirectAttackEvaluator> |
51+
| Evaluator type | Metric | Description |
52+
|---------------------------------------------------------------------------|--------------------|-------------|
53+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.GroundednessProEvaluator> | `Groundedness Pro` | Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context |
54+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.ProtectedMaterialEvaluator> | `Protected Material` | Evaluates response for the presence of protected material |
55+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.UngroundedAttributesEvaluator> | `Ungrounded Attributes` | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes |
56+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.HateAndUnfairnessEvaluator>| `Hate And Unfairness` | Evaluates a response for the presence of content that's hateful or unfair |
57+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.SelfHarmEvaluator>| `Self Harm` | Evaluates a response for the presence of content that indicates self harm |
58+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.ViolenceEvaluator>| `Violence` | Evaluates a response for the presence of violent content |
59+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.SexualEvaluator>| `Sexual` | Evaluates a response for the presence of sexual content |
60+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.CodeVulnerabilityEvaluator> | `Code Vulnerability` | Evaluates a response for the presence of vulnerable code |
61+
| <xref:Microsoft.Extensions.AI.Evaluation.Safety.IndirectAttackEvaluator> | `Indirect Attack` | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering |
6262

6363
† In addition, the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentHarmEvaluator> provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.
6464

docs/ai/quickstarts/evaluate-ai-response.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
2-
title: Quickstart - Evaluate a model's response
2+
title: Quickstart - Evaluate the quality of a model's response
33
description: Learn how to create an MSTest app to evaluate the AI chat response of a language model.
44
ms.date: 03/18/2025
55
ms.topic: quickstart
66
ms.custom: devx-track-dotnet, devx-track-dotnet-ai
77
---
88

9-
# Evaluate a model's response
9+
# Evaluate the quality of a model's response
1010

11-
In this quickstart, you create an MSTest app to evaluate the chat response of an OpenAI model. The test app uses the [Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) libraries.
11+
In this quickstart, you create an MSTest app to evaluate the quality of a chat response from an OpenAI model. The test app uses the [Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) libraries.
1212

1313
> [!NOTE]
1414
> This quickstart demonstrates the simplest usage of the evaluation API. Notably, it doesn't demonstrate use of the [response caching](../conceptual/evaluation-libraries.md#cached-responses) and [reporting](../conceptual/evaluation-libraries.md#reporting) functionality, which are important if you're authoring unit tests that run as part of an "offline" evaluation pipeline. The scenario shown in this quickstart is suitable in use cases such as "online" evaluation of AI responses within production code and logging scores to telemetry, where caching and reporting aren't relevant. For a tutorial that demonstrates the caching and reporting functionality, see [Tutorial: Evaluate a model's response with response caching and reporting](../tutorials/evaluate-with-reporting.md)
@@ -49,9 +49,9 @@ Complete the following steps to create an MSTest project that connects to the `g
4949
5050
```bash
5151
dotnet user-secrets init
52-
dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-azure-openai-endpoint>
52+
dotnet user-secrets set AZURE_OPENAI_ENDPOINT <your-Azure-OpenAI-endpoint>
5353
dotnet user-secrets set AZURE_OPENAI_GPT_NAME gpt-4o
54-
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-id>
54+
dotnet user-secrets set AZURE_TENANT_ID <your-tenant-ID>
5555
```
5656
5757
(Depending on your environment, the tenant ID might not be needed. In that case, remove it from the code that instantiates the <xref:Azure.Identity.DefaultAzureCredential>.)

docs/ai/toc.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,11 @@ items:
8181
items:
8282
- name: The Microsoft.Extensions.AI.Evaluation libraries
8383
href: conceptual/evaluation-libraries.md
84-
- name: "Quickstart: Evaluate a model's response"
84+
- name: "Quickstart: Evaluate the quality of a response"
8585
href: quickstarts/evaluate-ai-response.md
86-
- name: "Tutorial: Evaluate a response with response caching and reporting"
86+
- name: "Tutorial: Evaluate the safety of a response"
87+
href: tutorials/evaluate-safety.md
88+
- name: "Tutorial: Evaluate a response with caching and reporting"
8789
href: tutorials/evaluate-with-reporting.md
8890
- name: Resources
8991
items:

0 commit comments

Comments
 (0)