Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions WHATS_NEW.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

## What's new (2026-06-26)

### Template-Free Element Proposal (Pixels to Elements)

Get a clean numbered element list straight from the screen when there's no accessibility tree. Full reference: [`docs/source/Eng/doc/new_features/v220_features_doc.rst`](docs/source/Eng/doc/new_features/v220_features_doc.rst).

- **`propose_elements` / `tag_kinds`** (`AC_propose_elements`, `AC_tag_kinds`): Set-of-Marks, `observation` and the grounding helpers all assume you already have element boxes — but a game, a custom-drawn app or a remote desktop has no accessibility tree. `propose_elements` builds that top-of-funnel list from pixels: detect widget boxes (closed-edge blobs via Canny + morphology + `connected_boxes`) and text boxes (`text_regions.find_text_regions`), fuse them — the `element_parse` `ocr > icon` priority *is* the "drop widget-that-is-really-text" cross-check — and return them in reading order, each tagged `text` or `widget`. `tag_kinds` is the pure labeller. cv2 imported lazily; the labeller is fully testable. Seventh and final feature of the ROUND-15 perception lane. No `PySide6`.

### Classify a Widget from Its Pixel Shape

Tell a checkbox from a radio button from a text field — from pixels, no model. Full reference: [`docs/source/Eng/doc/new_features/v219_features_doc.rst`](docs/source/Eng/doc/new_features/v219_features_doc.rst).
Expand Down
51 changes: 51 additions & 0 deletions docs/source/Eng/doc/new_features/v220_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Template-Free Element Proposal (Pixels to Elements)
===================================================

Set-of-Marks, ``observation`` and the grounding helpers all assume you already
have a list of element boxes — but on a screen the framework doesn't model
(a game, a custom-drawn app, a remote desktop) there is no accessibility tree to
provide one. ``element_proposal`` builds that top-of-funnel list from pixels:
detect candidate *widget* boxes (closed-edge blobs) and *text* boxes
(:func:`text_regions.find_text_regions`), fuse them — dropping widget boxes that
are really just text — and return them in reading order, each tagged ``text`` or
``widget``.

* :func:`propose_elements` — the full pixel-to-elements pipeline.
* :func:`tag_kinds` — pure: label fused boxes ``text`` / ``widget`` by source and
keep their reading-order ``index``.

The fusion / cross-check / ordering reuse :mod:`element_parse` — the ``ocr`` >
``icon`` source priority *is* the "drop widget-that-is-really-text" check — and
the text detection reuses :mod:`text_regions`. ``cv2`` is imported lazily so the
module stays importable; :func:`tag_kinds` is pure and fully testable. Imports no
``PySide6``.

Headless API
------------

.. code-block:: python

from je_auto_control import propose_elements, mark_elements

# No accessibility tree? Propose elements straight from the screen:
elements = propose_elements(min_area=120)
# [{'box': [x, y, w, h], 'kind': 'widget', 'index': 0}, ...]

# Feed them to Set-of-Marks like any other element list:
marks = mark_elements(elements)

``propose_elements`` returns ``[{box, kind, index}]`` in reading order, where
``kind`` is ``text`` or ``widget``. It is the missing top-of-funnel for the
agent stack on un-modelled UIs: pixels in, a clean numbered element list out,
ready for marking, observation or grounding. Tune ``min_area`` for the smallest
control you care about and ``iou_threshold`` for how aggressively overlapping
text and widget boxes are merged.

Executor commands
-----------------

``AC_propose_elements`` (``region`` ``[x, y, w, h]`` / ``min_area`` /
``iou_threshold`` → ``{elements}``) runs the full pipeline on the screen, and
``AC_tag_kinds`` (``elements`` JSON list → ``{elements}``, pure) labels a
pre-fused list. They are the matching read-only ``ac_*`` MCP tools and Script
Builder commands under **Image**.
42 changes: 42 additions & 0 deletions docs/source/Zh/doc/new_features/v220_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
免模板元素提案(像素到元素)
============================

Set-of-Marks、``observation`` 與 grounding 輔助函式都假設你已有一份元素方框清單——但在框架無法
建模的畫面上(遊戲、自繪 app、遠端桌面),並沒有無障礙樹可提供。``element_proposal`` 從像素建立
這份漏斗頂端清單:偵測候選*控制項*方框(封閉邊緣 blob)與*文字*方框
(:func:`text_regions.find_text_regions`),將兩者融合——丟棄其實只是文字的控制項方框——
並依閱讀順序回傳,每個標記為 ``text`` 或 ``widget``。

* :func:`propose_elements` ——完整的像素到元素管線。
* :func:`tag_kinds` ——純函式:依來源把融合後的方框標記 ``text`` / ``widget``,並保留其閱讀順序 ``index``。

融合 / 交叉檢查 / 排序重用 :mod:`element_parse`——``ocr`` > ``icon`` 來源優先序*即*「丟棄其實是
文字的控制項」檢查——文字偵測則重用 :mod:`text_regions`。``cv2`` 採延遲匯入,故模組仍可匯入;
:func:`tag_kinds` 為純函式且可完整測試。不匯入 ``PySide6``。

無頭 API
--------

.. code-block:: python

from je_auto_control import propose_elements, mark_elements

# 沒有無障礙樹?直接從畫面提案元素:
elements = propose_elements(min_area=120)
# [{'box': [x, y, w, h], 'kind': 'widget', 'index': 0}, ...]

# 像任何元素清單一樣餵給 Set-of-Marks:
marks = mark_elements(elements)

``propose_elements`` 依閱讀順序回傳 ``[{box, kind, index}]``,``kind`` 為 ``text`` 或 ``widget``。
它是 agent 堆疊在未建模 UI 上缺少的漏斗頂端:像素進、乾淨的編號元素清單出,可供標記、observation
或 grounding。以 ``min_area`` 調整你在意的最小控制項,以 ``iou_threshold`` 調整重疊文字與控制項
方框合併的積極程度。

執行器指令
----------

``AC_propose_elements``(``region`` ``[x, y, w, h]`` / ``min_area`` /
``iou_threshold`` → ``{elements}``)在畫面上執行完整管線,``AC_tag_kinds``
(``elements`` JSON 清單 → ``{elements}``,純函式)則標記預先融合的清單。皆以對應的唯讀
``ac_*`` MCP 工具及 Script Builder 指令(位於 **Image** 分類下)形式提供。
3 changes: 3 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@
from je_auto_control.utils.icon_classify import (
box_features, classify_icon, classify_widget,
)
# Propose a clean element list from raw pixels (template-free)
from je_auto_control.utils.element_proposal import propose_elements, tag_kinds
# Rich clipboard formats — RTF + CSV/TSV codecs and Windows get / set
from je_auto_control.utils.clipboard_rich_formats import (
build_rtf, csv_to_rows, get_clipboard_csv, get_clipboard_rtf, rows_to_csv,
Expand Down Expand Up @@ -1779,6 +1781,7 @@ def start_autocontrol_gui(*args, **kwargs):
"normalize_theme", "match_theme",
"localize_changes", "rank_changes",
"classify_widget", "box_features", "classify_icon",
"propose_elements", "tag_kinds",
"build_rtf", "rtf_to_text", "rows_to_csv", "csv_to_rows",
"set_clipboard_rtf", "get_clipboard_rtf",
"set_clipboard_csv", "get_clipboard_csv",
Expand Down
19 changes: 19 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -4648,6 +4648,25 @@ def _add_work_queue_specs(specs: List[CommandSpec]) -> None:
),
description="Classify the widget in an image box from its pixels.",
))
specs.append(CommandSpec(
"AC_propose_elements", "Image", "Propose Elements (template-free)",
fields=(
FieldSpec("region", FieldType.STRING, optional=True,
placeholder="[x, y, w, h]"),
FieldSpec("min_area", FieldType.INT, optional=True, default=80),
FieldSpec("iou_threshold", FieldType.FLOAT, optional=True,
default=0.5),
),
description="Propose text/widget element boxes from raw screen pixels.",
))
specs.append(CommandSpec(
"AC_tag_kinds", "Image", "Tag Element Kinds",
fields=(
FieldSpec("elements", FieldType.STRING,
placeholder="JSON list of fused boxes"),
),
description="Label fused element boxes text/widget by source.",
))
specs.append(CommandSpec(
"AC_normalize_ext", "Shell", "Normalize Extension",
fields=(
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/utils/element_proposal/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Propose a clean element list from raw pixels, with no template or model."""
from je_auto_control.utils.element_proposal.element_proposal import (
propose_elements, tag_kinds,
)

__all__ = ["propose_elements", "tag_kinds"]
80 changes: 80 additions & 0 deletions je_auto_control/utils/element_proposal/element_proposal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
"""Propose a clean element list from raw pixels, with no template or model.

Set-of-Marks, ``observation`` and the grounding helpers all assume you already
have a list of element boxes — but on a screen the framework doesn't model
(a game, a custom-drawn app, a remote desktop) there is no accessibility tree to
provide one. ``element_proposal`` builds that top-of-funnel list from pixels:
detect candidate *widget* boxes (closed-edge blobs) and *text* boxes
(:func:`text_regions.find_text_regions`), fuse them — dropping widget boxes that
are really just text — and return them in reading order, each tagged ``text`` or
``widget``.

* :func:`propose_elements` — the full pixel-to-elements pipeline.
* :func:`tag_kinds` — pure: label fused boxes ``text`` / ``widget`` by source and
keep their reading-order ``index``.

The fusion / cross-check / ordering reuse :mod:`element_parse` (the ``ocr`` >
``icon`` priority *is* the "drop widget-that-is-really-text" check) and
:mod:`text_regions`; ``cv2`` is imported lazily so the module stays importable.
:func:`tag_kinds` is pure and fully testable. Imports no ``PySide6``.
"""
from typing import Any, Dict, List, Optional, Sequence

# Reading-order source tag to element kind.
_KIND_BY_SOURCE = {"ocr": "text", "icon": "widget", "a11y": "element"}


def tag_kinds(elements: Sequence[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Label fused boxes ``text`` / ``widget`` by source (pure).

Each input box carries a ``source`` (``ocr`` / ``icon``) and an ``index``
from :func:`element_parse.reading_order`. Returns ``[{box, kind, index}]``.
"""
result: List[Dict[str, Any]] = []
for element in elements:
box = [int(element["x"]), int(element["y"]),
int(element["width"]), int(element["height"])]
kind = _KIND_BY_SOURCE.get(element.get("source"), "widget")
result.append({"box": box, "kind": kind, "index": element.get("index")})
return result


def _reasonable(box: Dict[str, Any], frame_w: int, frame_h: int) -> bool:
"""Keep plausibly-widget blobs: not the whole frame, not a thin rule."""
width, height = int(box["width"]), int(box["height"])
if width >= 0.95 * frame_w and height >= 0.95 * frame_h:
return False
aspect = width / height if height else 0.0
return 0.05 <= aspect <= 15.0


def _widget_boxes(gray: Any, min_area: int) -> List[Dict[str, Any]]:
"""Detect candidate widget boxes as closed-edge blobs (cv2)."""
import cv2
from je_auto_control.utils.cv2_utils.blobs import connected_boxes
edges = cv2.Canny(gray, 50, 150)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
closed = cv2.morphologyEx(edges, cv2.MORPH_CLOSE, kernel)
height, width = gray.shape[:2]
return [box for box in connected_boxes(closed, min_area=int(min_area))
if _reasonable(box, width, height)]


def propose_elements(source: Optional[Any] = None, *,
region: Optional[Sequence[int]] = None, min_area: int = 80,
iou_threshold: float = 0.5) -> List[Dict[str, Any]]:
"""Propose ``text`` / ``widget`` element boxes from pixels, in reading order.

Detects widget blobs and text regions on ``source`` (a fresh screen grab of
``region`` by default), fuses them (overlapping text wins over widget), and
orders them. Returns ``[{box, kind, index}]``.
"""
from je_auto_control.utils.element_parse import fuse_elements, reading_order
from je_auto_control.utils.text_regions import find_text_regions
from je_auto_control.utils.visual_match.visual_match import _haystack_gray
gray = _haystack_gray(source, region)
text = find_text_regions(gray, min_area=int(min_area))
widgets = _widget_boxes(gray, int(min_area))
fused = fuse_elements(ocr_boxes=text, icon_boxes=widgets,
iou_threshold=float(iou_threshold))
return tag_kinds(reading_order(fused))
19 changes: 19 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2946,6 +2946,23 @@ def _classify_icon(source: Any, box: Any) -> Dict[str, Any]:
return classify_icon(str(source), _coerce_list(box))


def _propose_elements(region: Any = None, min_area: Any = 80,
iou_threshold: Any = 0.5) -> Dict[str, Any]:
"""Adapter: propose text/widget element boxes from pixels (device)."""
from je_auto_control.utils.element_proposal import propose_elements
elements = propose_elements(region=_coerce_region(region),
min_area=int(min_area),
iou_threshold=float(iou_threshold))
return {"elements": elements}


def _tag_kinds(elements: Any) -> Dict[str, Any]:
"""Adapter: label fused boxes text/widget by source (pure)."""
from je_auto_control.utils.element_proposal import tag_kinds
items = _coerce_list(elements) if elements else []
return {"elements": tag_kinds(items)}


def _normalize_ext(target: str) -> Dict[str, Any]:
"""Adapter: the lowercased extension of a path / bare ext (pure)."""
from je_auto_control.utils.file_assoc import normalize_ext
Expand Down Expand Up @@ -6989,6 +7006,8 @@ def __init__(self):
"AC_localize_changes": _localize_changes,
"AC_classify_widget": _classify_widget,
"AC_classify_icon": _classify_icon,
"AC_propose_elements": _propose_elements,
"AC_tag_kinds": _tag_kinds,
"AC_normalize_ext": _normalize_ext,
"AC_file_association": _file_association,
"AC_get_control_text": _get_control_text,
Expand Down
25 changes: 25 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -4159,6 +4159,31 @@ def img_histogram_tools() -> List[MCPTool]:
handler=h.classify_icon,
annotations=READ_ONLY,
),
MCPTool(
name="ac_propose_elements",
description=("Propose text/widget element boxes from raw screen "
"pixels (template-free): detect widget blobs + text "
"regions, fuse, order. 'region' [x,y,w,h] clips. "
"Returns {elements:[{box, kind, index}]}."),
input_schema=schema({"region": {"type": "array",
"items": {"type": "integer"}},
"min_area": {"type": "integer"},
"iou_threshold": {"type": "number"}}),
handler=h.propose_elements,
annotations=READ_ONLY,
),
MCPTool(
name="ac_tag_kinds",
description=("Label fused element boxes 'text'/'widget' by source "
"(pure). 'elements' is a list of {x,y,width,height,"
"source,index}. Returns {elements:[{box, kind, "
"index}]}."),
input_schema=schema({"elements": {"type": "array",
"items": {"type": "object"}}},
required=["elements"]),
handler=h.tag_kinds,
annotations=READ_ONLY,
),
]


Expand Down
12 changes: 12 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -797,6 +797,18 @@ def classify_icon(source, box):
return _classify_icon(source, box)


def propose_elements(region=None, min_area=80, iou_threshold=0.5):
from je_auto_control.utils.executor.action_executor import (
_propose_elements,
)
return _propose_elements(region, min_area, iou_threshold)


def tag_kinds(elements):
from je_auto_control.utils.executor.action_executor import _tag_kinds
return _tag_kinds(elements)


def normalize_ext(target):
from je_auto_control.utils.executor.action_executor import _normalize_ext
return _normalize_ext(target)
Expand Down
Loading
Loading