feat: Add generic BaseAdapter framework for third-party evaluator integration (DeepEval + Autoevals) by stone-coding · Pull Request #528 · aws/bedrock-agentcore-sdk-python

stone-coding · 2026-06-16T22:46:47Z

Major changes in this PR:

Adds a generic third-party evaluator adapter framework

Introduces BaseAdapter under evaluation/custom_code_based_evaluators/third_party/.
Standardizes the flow: extract fields from AgentCore spans → validate fields → execute third-party metric/scorer → return EvaluatorOutput.
Supports field_mapper as an escape hatch for custom or unsupported span formats.

Adds DeepEval support

Adds DeepEvalAdapter for running DeepEval BaseMetric implementations such as AnswerRelevancy, Faithfulness, Bias, Toxicity, and GEval.
Converts extracted fields into DeepEval LLMTestCase.
Handles MissingTestCaseParamsError and returns actionable MISSING_REQUIRED_FIELD errors.

Adds Autoevals support

Adds AutoevalsAdapter for Autoevals scorers such as Factuality and ClosedQA.
Maps AgentCore fields into Autoevals eval(input, output, expected) format.
Supports configurable pass/fail threshold.

Adds span parsing support

Adds parser layer for extracting input, actual_output, retrieval_context, context, and expected_output from AgentCore ADOT spans and evaluationReferenceInputs.
Supports Strands, LangChain OTel, and OpenInference parser entry points, with shared Phase 1 extraction logic.
Returns clear FIELD_EXTRACTION_ERROR when supported fields cannot be extracted.

Updates evaluator input model

Adds reference_inputs to EvaluatorInput so expected_output can flow from evaluation reference inputs into third-party evaluators.

Adds tests

Adds unit tests covering DeepEval, Autoevals, span parsing, field mapping, missing required fields, and error handling.
All 42 tests pass.

jariy17 · 2026-06-18T18:54:11Z

@@ -0,0 +1,5 @@
+"""DeepEval integration for AgentCore Evaluation."""
+
+from bedrock_agentcore.evaluation.integrations.deepeval.handler import DeepEvalHandler


Since this is using eval's custom code evaluator please but this under custom_code_based_evaluators.

jariy17 · 2026-06-18T18:55:16Z

+import threading
+from typing import Any, Callable, Dict, Optional
+
+from deepeval.metrics import BaseMetric


Please add an integ test for this. Look into tests_integ for examples.

jariy17 · 2026-06-18T19:23:18Z

+import threading
+from typing import Any, Callable, Dict, Optional
+
+from deepeval.metrics import BaseMetric


Also, let's add this in our pyproject as an optional dependency, so customer's know which deepeval version we support.

jariy17 · 2026-06-18T19:24:36Z

+
+
+@dataclass
+class ParsedEvaluationEvent:


Please use EvaluatorInput from our code_based_evaluator. No need to duplicate lambda logic.

jariy17 · 2026-06-25T21:38:39Z

+            Error: {"errorCode": str, "errorMessage": str}
+        """
+        try:
+            if isinstance(event, EvaluatorInput):


ParsedEvaluationEvent and EvaluatorInput look like they're doing the same job — both just turn the raw lambda event into a structured input. call even copies one into the other field-for-field. Is there a reason we need a second type instead of reusing EvaluatorInput?

Proposal: make it a requirement that customers place these adapters within the @code_based_evaluators decorator. That way the adapter stops owning input/output validation and the decorator does it instead. Keeps the adapter focused on just running the eval.

jariy17 · 2026-06-25T22:27:10Z

+}
+
+
+def _get_required_params(metric: BaseMetric) -> List[str]:


metric.measure() already calls check_llm_test_case_params with the metric's own _required_params and raises MissingTestCaseParamsError.

So we can drop the registry: build the LLMTestCase with whatever fields we have, call measure(), and catch that error.

By doing this, we let customers use GEval too — its required fields aren't fixed on the class, they're whatever the customer passes to evaluation_params at construction, so a static registry can never cover it. Letting the metric validate itself handles that case for free.

jariy17 · 2026-06-25T22:37:11Z

+) -> Dict[str, Any]:
+    """Extract evaluation fields from AgentCore session spans.
+
+    Parses _eval_log_records from span attributes, filters by target_trace_id,


Can you tell me what otel agent semantic you are following here? Because I haven't seen any agent SDK emit this _eval_log_records?

jariy17 · 2026-06-25T22:54:26Z

+        self.validate_fields(fields)
+        return fields
+
+    def validate_fields(self, fields: Dict[str, Any]) -> None:


Can you add @AbstractMethod here please? The no-op default means a subclass that forgets to override it silently skips validation, and bad fields fail deeper in execute instead. Both adapters override it anyway, so abstract just makes each one declare its required fields on purpose.

jariy17 · 2026-06-25T23:06:22Z

+
+        thread = threading.Thread(target=target, daemon=True)
+        thread.start()
+        thread.join(timeout=self.timeout)


When the thread "times out" here, it doesn't actually end join just returns back to the caller while the worker keeps running. So if Lambda reuses the same container, we can have a background thread from a previous invocation still executing during the next one. I've heard this is a real failure case, so let's drop the thread machinery and let the AWS Lambda timeout handle it for us instead.

jariy17 · 2026-06-25T23:09:39Z

+
+    def __init__(
+        self,
+        field_mapper: Optional[Callable[[Dict[str, Any]], Dict[str, Any]]] = None,


Can we make extract_fields_from_spans the default value of field_mapper in the constructor? Then we have one extraction path instead of the if-field_mapper-else branch.

Introduces a new integrations/deepeval/ module that adapts AgentCore Lambda evaluation events into DeepEval LLMTestCase objects, runs any BaseMetric, and returns structured score/label/explanation responses.

…leTurnParams deprecation

…d EvaluatorInput support

… layer, simplify per TJ/Irene feedback

…ion, add validate_fields to DeepEvalAdapter

…xpected_response_text property

stone-coding requested a review from a team June 16, 2026 22:46

stone-coding requested a deployment to manual-approval June 16, 2026 22:47 — with GitHub Actions Waiting

jariy17 mentioned this pull request Jun 18, 2026

Expose evaluationReferenceInputs (ground truth) on EvaluatorInput for code-based evaluators #539

Closed

stone-coding requested a deployment to manual-approval June 24, 2026 23:26 — with GitHub Actions Waiting

stone-coding changed the title ~~feat: Add DeepEvalHandler for third-party evaluator integration~~ feat: Add generic BaseAdapter framework for third-party evaluator integration (DeepEval + Autoevals) Jun 25, 2026

jariy17 requested changes Jun 25, 2026

View reviewed changes

stone-coding requested a deployment to manual-approval June 30, 2026 18:54 — with GitHub Actions Waiting

haomiao037 added 13 commits June 30, 2026 11:59

Add DeepEvalHandler integration with unit tests

77d16a9

Introduces a new integrations/deepeval/ module that adapts AgentCore Lambda evaluation events into DeepEval LLMTestCase objects, runs any BaseMetric, and returns structured score/label/explanation responses.

Fix span extraction to use real AgentCore _eval_log_records structure

402ea78

Set context field from tool messages for HallucinationMetric support

e9ef47d

Use metric.success for label instead of manual threshold comparison

f97827e

Add model override and timeout enforcement to DeepEvalHandler

9d256ae

Add model override, timeout enforcement, use metric.success, fix Sing…

c142d50

…leTurnParams deprecation

Fix _get_required_params to handle GEval unmappable typing params

3ccc98c

Add .deepeval/ to gitignore

a884f91

Move model override to init to avoid per-call mutation

6ac198c

Refactor to BaseAdapter framework with DeepEval/Autoevals adapters an…

8d415e5

…d EvaluatorInput support

Major refactor: move to custom_code_based_evaluators, add span parser…

8627ab0

… layer, simplify per TJ/Irene feedback

Fix review items: add reference_inputs to model, tighten error detect…

9a9b6a7

…ion, add validate_fields to DeepEvalAdapter

Adapt to upstream ReferenceInput model: remove duplicate field, use e…

0499d4b

…xpected_response_text property

stone-coding force-pushed the deepeval-handler branch from 2138661 to 0499d4b Compare June 30, 2026 19:01

stone-coding requested a deployment to manual-approval June 30, 2026 19:01 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add generic BaseAdapter framework for third-party evaluator integration (DeepEval + Autoevals)#528

feat: Add generic BaseAdapter framework for third-party evaluator integration (DeepEval + Autoevals)#528
stone-coding wants to merge 13 commits into
aws:mainfrom
stone-coding:deepeval-handler

stone-coding commented Jun 16, 2026 •

edited

Loading

Uh oh!

jariy17 Jun 18, 2026

Uh oh!

jariy17 Jun 18, 2026

Uh oh!

jariy17 Jun 18, 2026

Uh oh!

jariy17 Jun 18, 2026

Uh oh!

jariy17 Jun 25, 2026

Uh oh!

jariy17 Jun 25, 2026

Uh oh!

jariy17 Jun 25, 2026

Uh oh!

jariy17 Jun 25, 2026

Uh oh!

jariy17 Jun 25, 2026

Uh oh!

jariy17 Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,5 @@
		"""DeepEval integration for AgentCore Evaluation."""

		from bedrock_agentcore.evaluation.integrations.deepeval.handler import DeepEvalHandler



		@dataclass
		class ParsedEvaluationEvent:

Uh oh!

Conversation

stone-coding commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stone-coding commented Jun 16, 2026 •

edited

Loading