Skip to content

[Known Limitation] Benchmark multilingual semantic retrieval before changing the default embedding model #46

Description

@orenlab

[Known Limitation] Benchmark multilingual semantic retrieval before changing the default embedding model

Labels: enhancement, memory, semantic-search, known-limitation
Priority: Post-2.1.0a1
Area: Engineering Memory / semantic retrieval
Current default: BAAI/bge-small-en-v1.5
Primary candidate: intfloat/multilingual-e5-small
Secondary candidate: Qwen3-Embedding-0.6B

Summary

CodeClone Engineering Memory currently uses BAAI/bge-small-en-v1.5 as the default local embedding model for FastEmbed-backed semantic retrieval.

The current model is small, fast, deterministic enough for local use, and works well for English technical text. However, it is English-oriented. Russian and mixed Russian/English queries may retrieve weaker results than equivalent English queries, even when the relevant Engineering Memory records and trajectories are present.

This limitation does not mean that semantic retrieval is broken. Current evidence shows that:

  • semantic search can retrieve the relevant Phase 34 records and trajectories;
  • plain FTS-only retrieval may produce noisy results for conceptual or phase-oriented queries;
  • trajectory retrieval often contains richer implementation history than the record lane;
  • retrieval quality depends on lane routing and search orchestration, not only on the embedding model.

The model should therefore not be replaced before 2.1.0a1. After the alpha release, CodeClone should run a controlled benchmark comparing the current BGE baseline with multilingual and larger embedding candidates.

Current Configuration

[tool.codeclone.memory.semantic]
enabled = true
backend = "lancedb"
index_path = ".codeclone/memory/semantic_index.lance"

embedding_provider = "fastembed"
embedding_model = "BAAI/bge-small-en-v1.5"
embedding_cache_dir = ".codeclone/memory/fastembed"
allow_model_download = true

dimension = 384
max_results = 20
index_audit = true

Relevant optional dependencies already exist:

[project.optional-dependencies]
semantic-fastembed = [
    "fastembed>=0.8.0,<0.9",
]

semantic-local = [
    "lancedb>=0.33.0",
    "fastembed>=0.8.0,<0.9",
]

No new Python dependency is expected for intfloat/multilingual-e5-small if it is loaded through FastEmbed.

Known Limitation

The current embedding model may underperform on:

  • Russian-language project-history queries;
  • mixed Russian/English technical queries;
  • phase names expressed in Russian while source records are written in English;
  • conceptual queries where the relevant evidence is distributed across memory and trajectory lanes;
  • queries that require semantic matching rather than lexical overlap.

Example:

фаза 34

The relevant Phase 34 material exists, but retrieval quality varies by mode:

  • FTS-only may rank noisy module/path matches;
  • semantic search performs better;
  • trajectory_search often provides the richest implementation history;
  • a multilingual embedding model may improve Russian and mixed-language recall.

Candidate Models

Baseline

BAAI/bge-small-en-v1.5

Characteristics:

  • English-oriented;
  • 384-dimensional output;
  • 512-token context;
  • small and fast;
  • current production baseline.

Primary candidate

intfloat/multilingual-e5-small

Characteristics:

  • multilingual, including Russian;
  • 384-dimensional output;
  • 512-token context;
  • uses query: and passage: prefixes;
  • approximately drop-in at the index-schema level;
  • full semantic reindex still required because vector spaces are incompatible.

Proposed configuration:

[tool.codeclone.memory.semantic]
enabled = true
backend = "lancedb"
index_path = ".codeclone/memory/semantic_index.lance"

embedding_provider = "fastembed"
embedding_model = "intfloat/multilingual-e5-small"
embedding_cache_dir = ".codeclone/memory/fastembed"
allow_model_download = true

dimension = 384
max_results = 20
index_audit = true

Secondary candidate

Qwen3-Embedding-0.6B

Characteristics:

  • substantially larger and more expensive;
  • multilingual;
  • potentially higher retrieval quality;
  • higher model-load latency and memory use;
  • likely requires provider-specific integration and benchmark validation;
  • must not be assumed superior without project-specific evidence.

FastEmbed Integration Requirement

intfloat/multilingual-e5-small is not guaranteed to be present in the built-in FastEmbed registry for the pinned FastEmbed version. Changing only embedding_model in pyproject.toml is therefore insufficient.

CodeClone must add explicit, idempotent custom-model registration before constructing TextEmbedding.

Expected registration contract:

from fastembed import TextEmbedding
from fastembed.common.model_description import ModelSource, PoolingType


def ensure_multilingual_e5_small_registered() -> None:
    model_name = "intfloat/multilingual-e5-small"
    supported = {
        str(item.get("model"))
        for item in TextEmbedding.list_supported_models()
        if isinstance(item, dict)
    }
    if model_name in supported:
        return

    TextEmbedding.add_custom_model(
        model=model_name,
        pooling=PoolingType.MEAN,
        normalization=True,
        sources=ModelSource(hf=model_name),
        dim=384,
        model_file="onnx/model.onnx",
    )

The provider should call the registration helper before model construction:

def _get_model(self) -> _TextEmbeddingModel:
    if self._model is not None:
        return self._model

    if self.model_name.lower() == "intfloat/multilingual-e5-small":
        ensure_multilingual_e5_small_registered()

    model = self._text_embedding(
        model_name=self.model_name,
        cache_dir=str(self.cache_dir),
        local_files_only=not self.allow_model_download,
    )
    self._model = cast(_TextEmbeddingModel, model)
    return self._model

Implementation requirements:

  • detect whether the model is already supported natively;
  • register it only when required;
  • keep registration process-wide idempotent;
  • use PoolingType.MEAN;
  • enable normalization;
  • set dim=384;
  • use ModelSource(hf="intfloat/multilingual-e5-small");
  • use model_file="onnx/model.onnx";
  • preserve query: and passage: formatting;
  • fail with a clear MemorySemanticUnavailableError when registration or model loading fails;
  • add a regression test proving that repeated provider construction does not register the model twice;
  • add a test proving that the configured dimension matches the returned vector length;
  • add a test proving that a cached local model works with allow_model_download = false;
  • do not silently fall back to another embedding model.

The provider should also make the model token-window contract explicit:

_KNOWN_MODEL_MAX_TOKENS = {
    "baai/bge-small-en-v1.5": 512,
    "intfloat/multilingual-e5-small": 512,
}

The benchmark runner must record whether the model was loaded from:

  • the built-in FastEmbed registry;
  • CodeClone custom registration;
  • an existing local cache;
  • a fresh download.

Benchmark Goal

Determine whether CodeClone should:

  1. keep BAAI/bge-small-en-v1.5 as the default;
  2. switch to intfloat/multilingual-e5-small;
  3. offer multilingual E5 as an opt-in profile;
  4. adopt Qwen3-Embedding-0.6B for a higher-quality local profile;
  5. improve retrieval routing without changing the model.

The benchmark must separate three concerns:

  • embedding model quality;
  • retrieval-pipeline quality;
  • runtime and operational cost.

Benchmark Corpus

Use one frozen corpus snapshot for every model and pipeline configuration.

The snapshot must include:

  • Engineering Memory SQLite database;
  • trajectory records;
  • audit summaries;
  • semantic projections;
  • statuses and supersession state;
  • project identity;
  • source commit SHA;
  • projection version;
  • chunking implementation;
  • query/document prefixes;
  • candidate limits;
  • LanceDB schema.

Each run should write a manifest:

{
  "benchmark_version": "memory-retrieval-v1",
  "repo_commit": "<sha>",
  "project_id": "<project-id>",
  "memory_db_sha256": "<sha256>",
  "audit_db_sha256": "<sha256>",
  "projection_version": "<version>",
  "model": "intfloat/multilingual-e5-small",
  "provider": "fastembed",
  "dimension": 384,
  "query_prefix": "query: ",
  "document_prefix": "passage: ",
  "max_results": 20
}

A complete semantic reindex is required for every embedding model.

Benchmark Dataset

Start with 50–80 manually reviewed queries.

Recommended distribution:

Query group Target count
English 15
Russian 15
Mixed Russian/English 10
Exact phase/task names 8
Path/symbol queries 8
Incident/recovery/history 8
Negative/stale/cross-project 6

Example queries

Phase and project history

Phase 34
фаза 34
что было сделано в фазе 34
Phase 34 context governance
34.4 review receipt
что сломалось при реализации patch trail
immutable blast artifact

Decisions and rationale

почему blast artifact должен быть immutable
why is full blast returned when artifact persistence fails
зачем review receipt вынесли из finish response

Paths and symbols

_session_audit_artifact_mixin.py
finish_propose_memory
why does workspace_intent gate import MCP internals

Trajectories and incidents

какой intent нарушил cohesion при patch trail
which controlled change completed slice 34.4
show the failed first attempt at patch trail

Negative and stale cases

Phase 35 shipped
mem-22e50218
current next task according to stale handoff

Ground-Truth Format

Each benchmark case should declare required positives, acceptable alternatives, forbidden noise, stale records, and expected lanes.

Example:

{
  "id": "phase34-ru-001",
  "query": "что было сделано в фазе 34",
  "language": "ru",
  "intent_type": "project_history",
  "expected_lanes": ["memory", "trajectory"],
  "expected_memory_ids": [
    "mem-5a6f7e46",
    "mem-e228e628"
  ],
  "expected_trajectory_ids": [
    "intent-30b56d21-004",
    "intent-4292f012-001"
  ],
  "acceptable_alternatives": [
    "mem-a8e1304e",
    "mem-af8fb2a0"
  ],
  "forbidden_ids": [
    "unrelated-module-role-record"
  ],
  "stale_ids": [
    "stale-handoff-record"
  ],
  "rationale": "Should retrieve Phase 34 implementation history rather than modules containing _context in their paths."
}

Ground truth must not require exact full-list equality. It should support:

  • required positives;
  • acceptable positives;
  • forbidden results;
  • stale/superseded negatives;
  • lane expectations.

Retrieval Pipelines to Compare

For each embedding model, compare:

A. FTS/BM25 only
B. semantic only
C. FTS + semantic RRF
D. deterministic candidates + semantic
E. FTS + deterministic candidates + semantic
F. E + trajectory lane
G. optional reranker on top of F

Deterministic candidate generation should include:

  • exact record ID;
  • exact path;
  • exact symbol;
  • parent module;
  • typed links;
  • import neighbours;
  • blast-radius relationships;
  • project scope.

Semantic ranking should operate on:

  • intent;
  • statement;
  • architecture decision;
  • change rationale;
  • risk note;
  • audit summary;
  • trajectory summary.

Quality Metrics

Primary metrics:

  • Recall@5;
  • Recall@10;
  • MRR;
  • nDCG@10.

CodeClone-specific metrics:

  • lane recall;
  • wrong-lane miss rate;
  • empty-result rate;
  • noise@5;
  • stale retrieval rate;
  • superseded retrieval rate;
  • cross-project leakage;
  • required-ID hit rate;
  • acceptable-alternative hit rate.

Report metrics separately for:

  • English;
  • Russian;
  • mixed Russian/English;
  • exact identifiers;
  • conceptual/history queries;
  • stale/negative cases.

The most important multilingual comparison metrics are:

RU Recall@10
RU MRR
Mixed Recall@10
Mixed MRR

Runtime and Operational Metrics

Use CodeClone Observer where possible.

Measure separately:

Cold start

  • model load duration;
  • model load RSS delta;
  • first-query latency;
  • cache initialization cost.

Warm retrieval

  • latency p50/p95/p99;
  • query-embedding latency;
  • candidate-generation latency;
  • vector-search latency;
  • fusion/rerank latency;
  • response payload size;
  • returned context units.

Full rebuild

  • total rebuild duration;
  • records per second;
  • chunks per second;
  • peak RSS;
  • embedding cache size;
  • LanceDB index size;
  • DB queries and writes;
  • failed/skipped projection counts.

Incremental rebuild

  • changed-record count;
  • incremental duration;
  • DB queries and writes;
  • peak RSS;
  • resulting index consistency.

Determinism Checks

Run every model/pipeline combination at least three times on the same platform and corpus snapshot.

Verify:

  • same candidate set;
  • same top-k IDs;
  • stable ordering;
  • deterministic tie-breaking by stable ID;
  • same projection count;
  • same index manifest;
  • same status and project filtering.

Floating-point scores do not need to match bit-for-bit across platforms, but the ranking contract must remain stable on the same provider and platform.

Success Criteria for multilingual-e5-small

Do not change the default model based on one successful query.

A reasonable promotion threshold:

  • Russian Recall@10 improves by at least 15 percentage points over BGE;
  • mixed-language MRR improves materially;
  • English nDCG@10 decreases by no more than 3–5%;
  • warm p95 latency is no worse than 2× the current baseline;
  • peak RSS remains suitable for local developer machines;
  • full rebuild remains operationally acceptable;
  • stale and cross-project leakage do not increase;
  • improvements are not caused only by a larger or noisier candidate set.

Success Criteria for Qwen3-Embedding-0.6B

Qwen should be considered only if it provides measurable additional value beyond multilingual E5.

Required evidence:

  • statistically meaningful Recall/MRR/nDCG improvement;
  • improved difficult conceptual queries;
  • acceptable model-load latency;
  • acceptable peak RSS;
  • acceptable rebuild duration;
  • clear benefit relative to the extra operational cost.

If it does not clearly outperform multilingual E5, it should not become the default.

Non-Goals

This work must not:

  • change edit_allowed;
  • change do_not_touch;
  • authorize edits;
  • override findings or gates;
  • override Patch Trail;
  • alter baseline/report/cache/fingerprint contracts;
  • make semantic ranking authoritative;
  • hide stale or superseded state;
  • silently change retrieval semantics.

Embeddings remain an advisory ranking signal.

Delivery Plan

After 2.1.0a1

  1. Add FastEmbed support for intfloat/multilingual-e5-small.
  2. Add explicit model metadata and token-window handling.
  3. Create benchmarks/memory_retrieval/v1/cases.jsonl.
  4. Add a benchmark runner with JSON output.
  5. Run BGE vs multilingual E5 on the frozen corpus.
  6. Produce a comparative Markdown and JSON report.
  7. Decide whether E5 becomes:
    • the new default;
    • an opt-in multilingual profile;
    • or a rejected candidate.
  8. Only then integrate and evaluate Qwen3-Embedding-0.6B.

Acceptance Criteria

  • A versioned benchmark dataset exists.
  • A reproducible benchmark runner exists.
  • intfloat/multilingual-e5-small is explicitly and idempotently registered with FastEmbed when it is not built in.
  • Registration uses mean pooling, normalization, dimension 384, and the expected ONNX model path.
  • Repeated provider construction does not duplicate registration.
  • Model-load diagnostics distinguish built-in, custom-registered, cached, downloaded, and unavailable states.
  • BGE and multilingual E5 are compared on the same corpus snapshot.
  • Full reindex is enforced between models.
  • Quality and runtime metrics are reported separately.
  • Russian, English, and mixed-language results are visible.
  • Lane-routing effects are measured separately from model effects.
  • The final decision is evidence-based and documented.
  • No default embedding model is changed before 2.1.0a1.
  • No authority, gate, baseline, report, cache, fingerprint, or edit-permission contract changes.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions