[Known Limitation] Benchmark multilingual semantic retrieval before changing the default embedding model

# [Known Limitation] Benchmark multilingual semantic retrieval before changing the default embedding model

**Labels:** `enhancement`, `memory`, `semantic-search`, `known-limitation`  
**Priority:** Post-`2.1.0a1`  
**Area:** Engineering Memory / semantic retrieval  
**Current default:** `BAAI/bge-small-en-v1.5`  
**Primary candidate:** `intfloat/multilingual-e5-small`  
**Secondary candidate:** `Qwen3-Embedding-0.6B`

## Summary

CodeClone Engineering Memory currently uses `BAAI/bge-small-en-v1.5` as the default local embedding model for FastEmbed-backed semantic retrieval.

The current model is small, fast, deterministic enough for local use, and works well for English technical text. However, it is English-oriented. Russian and mixed Russian/English queries may retrieve weaker results than equivalent English queries, even when the relevant Engineering Memory records and trajectories are present.

This limitation does **not** mean that semantic retrieval is broken. Current evidence shows that:

- semantic search can retrieve the relevant Phase 34 records and trajectories;
- plain FTS-only retrieval may produce noisy results for conceptual or phase-oriented queries;
- trajectory retrieval often contains richer implementation history than the record lane;
- retrieval quality depends on lane routing and search orchestration, not only on the embedding model.

The model should therefore not be replaced before `2.1.0a1`. After the alpha release, CodeClone should run a controlled benchmark comparing the current BGE baseline with multilingual and larger embedding candidates.

## Current Configuration

```toml
[tool.codeclone.memory.semantic]
enabled = true
backend = "lancedb"
index_path = ".codeclone/memory/semantic_index.lance"

embedding_provider = "fastembed"
embedding_model = "BAAI/bge-small-en-v1.5"
embedding_cache_dir = ".codeclone/memory/fastembed"
allow_model_download = true

dimension = 384
max_results = 20
index_audit = true
```

Relevant optional dependencies already exist:

```toml
[project.optional-dependencies]
semantic-fastembed = [
    "fastembed>=0.8.0,<0.9",
]

semantic-local = [
    "lancedb>=0.33.0",
    "fastembed>=0.8.0,<0.9",
]
```

No new Python dependency is expected for `intfloat/multilingual-e5-small` if it is loaded through FastEmbed.

## Known Limitation

The current embedding model may underperform on:

- Russian-language project-history queries;
- mixed Russian/English technical queries;
- phase names expressed in Russian while source records are written in English;
- conceptual queries where the relevant evidence is distributed across memory and trajectory lanes;
- queries that require semantic matching rather than lexical overlap.

Example:

```text
фаза 34
```

The relevant Phase 34 material exists, but retrieval quality varies by mode:

- FTS-only may rank noisy module/path matches;
- semantic search performs better;
- `trajectory_search` often provides the richest implementation history;
- a multilingual embedding model may improve Russian and mixed-language recall.

## Candidate Models

### Baseline

```text
BAAI/bge-small-en-v1.5
```

Characteristics:

- English-oriented;
- 384-dimensional output;
- 512-token context;
- small and fast;
- current production baseline.

### Primary candidate

```text
intfloat/multilingual-e5-small
```

Characteristics:

- multilingual, including Russian;
- 384-dimensional output;
- 512-token context;
- uses `query:` and `passage:` prefixes;
- approximately drop-in at the index-schema level;
- full semantic reindex still required because vector spaces are incompatible.

Proposed configuration:

```toml
[tool.codeclone.memory.semantic]
enabled = true
backend = "lancedb"
index_path = ".codeclone/memory/semantic_index.lance"

embedding_provider = "fastembed"
embedding_model = "intfloat/multilingual-e5-small"
embedding_cache_dir = ".codeclone/memory/fastembed"
allow_model_download = true

dimension = 384
max_results = 20
index_audit = true
```

### Secondary candidate

```text
Qwen3-Embedding-0.6B
```

Characteristics:

- substantially larger and more expensive;
- multilingual;
- potentially higher retrieval quality;
- higher model-load latency and memory use;
- likely requires provider-specific integration and benchmark validation;
- must not be assumed superior without project-specific evidence.

## FastEmbed Integration Requirement

`intfloat/multilingual-e5-small` is not guaranteed to be present in the built-in FastEmbed registry for the pinned FastEmbed version. Changing only `embedding_model` in `pyproject.toml` is therefore insufficient.

CodeClone must add explicit, idempotent custom-model registration before constructing `TextEmbedding`.

Expected registration contract:

```python
from fastembed import TextEmbedding
from fastembed.common.model_description import ModelSource, PoolingType


def ensure_multilingual_e5_small_registered() -> None:
    model_name = "intfloat/multilingual-e5-small"
    supported = {
        str(item.get("model"))
        for item in TextEmbedding.list_supported_models()
        if isinstance(item, dict)
    }
    if model_name in supported:
        return

    TextEmbedding.add_custom_model(
        model=model_name,
        pooling=PoolingType.MEAN,
        normalization=True,
        sources=ModelSource(hf=model_name),
        dim=384,
        model_file="onnx/model.onnx",
    )
```

The provider should call the registration helper before model construction:

```python
def _get_model(self) -> _TextEmbeddingModel:
    if self._model is not None:
        return self._model

    if self.model_name.lower() == "intfloat/multilingual-e5-small":
        ensure_multilingual_e5_small_registered()

    model = self._text_embedding(
        model_name=self.model_name,
        cache_dir=str(self.cache_dir),
        local_files_only=not self.allow_model_download,
    )
    self._model = cast(_TextEmbeddingModel, model)
    return self._model
```

Implementation requirements:

- detect whether the model is already supported natively;
- register it only when required;
- keep registration process-wide idempotent;
- use `PoolingType.MEAN`;
- enable normalization;
- set `dim=384`;
- use `ModelSource(hf="intfloat/multilingual-e5-small")`;
- use `model_file="onnx/model.onnx"`;
- preserve `query:` and `passage:` formatting;
- fail with a clear `MemorySemanticUnavailableError` when registration or model loading fails;
- add a regression test proving that repeated provider construction does not register the model twice;
- add a test proving that the configured dimension matches the returned vector length;
- add a test proving that a cached local model works with `allow_model_download = false`;
- do not silently fall back to another embedding model.

The provider should also make the model token-window contract explicit:

```python
_KNOWN_MODEL_MAX_TOKENS = {
    "baai/bge-small-en-v1.5": 512,
    "intfloat/multilingual-e5-small": 512,
}
```

The benchmark runner must record whether the model was loaded from:

- the built-in FastEmbed registry;
- CodeClone custom registration;
- an existing local cache;
- a fresh download.

## Benchmark Goal

Determine whether CodeClone should:

1. keep `BAAI/bge-small-en-v1.5` as the default;
2. switch to `intfloat/multilingual-e5-small`;
3. offer multilingual E5 as an opt-in profile;
4. adopt `Qwen3-Embedding-0.6B` for a higher-quality local profile;
5. improve retrieval routing without changing the model.

The benchmark must separate three concerns:

- embedding model quality;
- retrieval-pipeline quality;
- runtime and operational cost.

## Benchmark Corpus

Use one frozen corpus snapshot for every model and pipeline configuration.

The snapshot must include:

- Engineering Memory SQLite database;
- trajectory records;
- audit summaries;
- semantic projections;
- statuses and supersession state;
- project identity;
- source commit SHA;
- projection version;
- chunking implementation;
- query/document prefixes;
- candidate limits;
- LanceDB schema.

Each run should write a manifest:

```json
{
  "benchmark_version": "memory-retrieval-v1",
  "repo_commit": "<sha>",
  "project_id": "<project-id>",
  "memory_db_sha256": "<sha256>",
  "audit_db_sha256": "<sha256>",
  "projection_version": "<version>",
  "model": "intfloat/multilingual-e5-small",
  "provider": "fastembed",
  "dimension": 384,
  "query_prefix": "query: ",
  "document_prefix": "passage: ",
  "max_results": 20
}
```

A complete semantic reindex is required for every embedding model.

## Benchmark Dataset

Start with 50–80 manually reviewed queries.

Recommended distribution:

| Query group | Target count |
|---|---:|
| English | 15 |
| Russian | 15 |
| Mixed Russian/English | 10 |
| Exact phase/task names | 8 |
| Path/symbol queries | 8 |
| Incident/recovery/history | 8 |
| Negative/stale/cross-project | 6 |

### Example queries

#### Phase and project history

```text
Phase 34
фаза 34
что было сделано в фазе 34
Phase 34 context governance
34.4 review receipt
что сломалось при реализации patch trail
immutable blast artifact
```

#### Decisions and rationale

```text
почему blast artifact должен быть immutable
why is full blast returned when artifact persistence fails
зачем review receipt вынесли из finish response
```

#### Paths and symbols

```text
_session_audit_artifact_mixin.py
finish_propose_memory
why does workspace_intent gate import MCP internals
```

#### Trajectories and incidents

```text
какой intent нарушил cohesion при patch trail
which controlled change completed slice 34.4
show the failed first attempt at patch trail
```

#### Negative and stale cases

```text
Phase 35 shipped
mem-22e50218
current next task according to stale handoff
```

## Ground-Truth Format

Each benchmark case should declare required positives, acceptable alternatives, forbidden noise, stale records, and expected lanes.

Example:

```json
{
  "id": "phase34-ru-001",
  "query": "что было сделано в фазе 34",
  "language": "ru",
  "intent_type": "project_history",
  "expected_lanes": ["memory", "trajectory"],
  "expected_memory_ids": [
    "mem-5a6f7e46",
    "mem-e228e628"
  ],
  "expected_trajectory_ids": [
    "intent-30b56d21-004",
    "intent-4292f012-001"
  ],
  "acceptable_alternatives": [
    "mem-a8e1304e",
    "mem-af8fb2a0"
  ],
  "forbidden_ids": [
    "unrelated-module-role-record"
  ],
  "stale_ids": [
    "stale-handoff-record"
  ],
  "rationale": "Should retrieve Phase 34 implementation history rather than modules containing _context in their paths."
}
```

Ground truth must not require exact full-list equality. It should support:

- required positives;
- acceptable positives;
- forbidden results;
- stale/superseded negatives;
- lane expectations.

## Retrieval Pipelines to Compare

For each embedding model, compare:

```text
A. FTS/BM25 only
B. semantic only
C. FTS + semantic RRF
D. deterministic candidates + semantic
E. FTS + deterministic candidates + semantic
F. E + trajectory lane
G. optional reranker on top of F
```

Deterministic candidate generation should include:

- exact record ID;
- exact path;
- exact symbol;
- parent module;
- typed links;
- import neighbours;
- blast-radius relationships;
- project scope.

Semantic ranking should operate on:

- intent;
- statement;
- architecture decision;
- change rationale;
- risk note;
- audit summary;
- trajectory summary.

## Quality Metrics

Primary metrics:

- Recall@5;
- Recall@10;
- MRR;
- nDCG@10.

CodeClone-specific metrics:

- lane recall;
- wrong-lane miss rate;
- empty-result rate;
- noise@5;
- stale retrieval rate;
- superseded retrieval rate;
- cross-project leakage;
- required-ID hit rate;
- acceptable-alternative hit rate.

Report metrics separately for:

- English;
- Russian;
- mixed Russian/English;
- exact identifiers;
- conceptual/history queries;
- stale/negative cases.

The most important multilingual comparison metrics are:

```text
RU Recall@10
RU MRR
Mixed Recall@10
Mixed MRR
```

## Runtime and Operational Metrics

Use CodeClone Observer where possible.

Measure separately:

### Cold start

- model load duration;
- model load RSS delta;
- first-query latency;
- cache initialization cost.

### Warm retrieval

- latency p50/p95/p99;
- query-embedding latency;
- candidate-generation latency;
- vector-search latency;
- fusion/rerank latency;
- response payload size;
- returned context units.

### Full rebuild

- total rebuild duration;
- records per second;
- chunks per second;
- peak RSS;
- embedding cache size;
- LanceDB index size;
- DB queries and writes;
- failed/skipped projection counts.

### Incremental rebuild

- changed-record count;
- incremental duration;
- DB queries and writes;
- peak RSS;
- resulting index consistency.

## Determinism Checks

Run every model/pipeline combination at least three times on the same platform and corpus snapshot.

Verify:

- same candidate set;
- same top-k IDs;
- stable ordering;
- deterministic tie-breaking by stable ID;
- same projection count;
- same index manifest;
- same status and project filtering.

Floating-point scores do not need to match bit-for-bit across platforms, but the ranking contract must remain stable on the same provider and platform.

## Success Criteria for `multilingual-e5-small`

Do not change the default model based on one successful query.

A reasonable promotion threshold:

- Russian Recall@10 improves by at least 15 percentage points over BGE;
- mixed-language MRR improves materially;
- English nDCG@10 decreases by no more than 3–5%;
- warm p95 latency is no worse than 2× the current baseline;
- peak RSS remains suitable for local developer machines;
- full rebuild remains operationally acceptable;
- stale and cross-project leakage do not increase;
- improvements are not caused only by a larger or noisier candidate set.

## Success Criteria for `Qwen3-Embedding-0.6B`

Qwen should be considered only if it provides measurable additional value beyond multilingual E5.

Required evidence:

- statistically meaningful Recall/MRR/nDCG improvement;
- improved difficult conceptual queries;
- acceptable model-load latency;
- acceptable peak RSS;
- acceptable rebuild duration;
- clear benefit relative to the extra operational cost.

If it does not clearly outperform multilingual E5, it should not become the default.

## Non-Goals

This work must not:

- change `edit_allowed`;
- change `do_not_touch`;
- authorize edits;
- override findings or gates;
- override Patch Trail;
- alter baseline/report/cache/fingerprint contracts;
- make semantic ranking authoritative;
- hide stale or superseded state;
- silently change retrieval semantics.

Embeddings remain an advisory ranking signal.

## Delivery Plan

### After `2.1.0a1`

1. Add FastEmbed support for `intfloat/multilingual-e5-small`.
2. Add explicit model metadata and token-window handling.
3. Create `benchmarks/memory_retrieval/v1/cases.jsonl`.
4. Add a benchmark runner with JSON output.
5. Run BGE vs multilingual E5 on the frozen corpus.
6. Produce a comparative Markdown and JSON report.
7. Decide whether E5 becomes:
   - the new default;
   - an opt-in multilingual profile;
   - or a rejected candidate.
8. Only then integrate and evaluate `Qwen3-Embedding-0.6B`.

## Acceptance Criteria

- A versioned benchmark dataset exists.
- A reproducible benchmark runner exists.
- `intfloat/multilingual-e5-small` is explicitly and idempotently registered with FastEmbed when it is not built in.
- Registration uses mean pooling, normalization, dimension 384, and the expected ONNX model path.
- Repeated provider construction does not duplicate registration.
- Model-load diagnostics distinguish built-in, custom-registered, cached, downloaded, and unavailable states.
- BGE and multilingual E5 are compared on the same corpus snapshot.
- Full reindex is enforced between models.
- Quality and runtime metrics are reported separately.
- Russian, English, and mixed-language results are visible.
- Lane-routing effects are measured separately from model effects.
- The final decision is evidence-based and documented.
- No default embedding model is changed before `2.1.0a1`.
- No authority, gate, baseline, report, cache, fingerprint, or edit-permission contract changes.

Query group	Target count
English	15
Russian	15
Mixed Russian/English	10
Exact phase/task names	8
Path/symbol queries	8
Incident/recovery/history	8
Negative/stale/cross-project	6

[Known Limitation] Benchmark multilingual semantic retrieval before changing the default embedding model #46

Description

[Known Limitation] Benchmark multilingual semantic retrieval before changing the default embedding model

Summary

Current Configuration

Known Limitation

Candidate Models

Baseline

Primary candidate

Secondary candidate

FastEmbed Integration Requirement

Benchmark Goal

Benchmark Corpus

Benchmark Dataset

Example queries

Phase and project history

Decisions and rationale

Paths and symbols

Trajectories and incidents

Negative and stale cases

Ground-Truth Format

Retrieval Pipelines to Compare

Quality Metrics

Runtime and Operational Metrics

Cold start

Warm retrieval

Full rebuild

Incremental rebuild

Determinism Checks

Success Criteria for multilingual-e5-small

Success Criteria for Qwen3-Embedding-0.6B

Non-Goals

Delivery Plan

After 2.1.0a1

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Success Criteria for `multilingual-e5-small`

Success Criteria for `Qwen3-Embedding-0.6B`

After `2.1.0a1`