Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-19) — Set-of-Marks Overlay](#whats-new-2026-06-19--set-of-marks-overlay)
- [What's new (2026-06-19) — Checkpoint & Resume](#whats-new-2026-06-19--checkpoint--resume)
- [What's new (2026-06-19) — i18n / l10n Testing](#whats-new-2026-06-19--i18n--l10n-testing)
- [What's new (2026-06-19) — Data Quality](#whats-new-2026-06-19--data-quality)
Expand Down Expand Up @@ -74,6 +75,13 @@

---

## What's new (2026-06-19) — Set-of-Marks Overlay

The standard VLM-grounding format, full stack. Full reference: [`docs/source/Eng/doc/new_features/v22_features_doc.rst`](docs/source/Eng/doc/new_features/v22_features_doc.rst).

- **Number elements** — `mark_elements` / `render_marks` / `resolve_mark` (pure + Pillow): assign `1..N` to interactable elements (with centre/role/text), draw numbered red boxes on a screenshot, and map a chosen number back to its element — so a VLM picks a *number* instead of guessing pixels (directly strengthens the existing VLM locator).
- **Mark-then-click loop** — `mark_screen(render_path=...)` / `mark_click(n)` (`AC_mark_screen` / `AC_mark_click`, `ac_*`): number the live a11y tree (+ optional overlay screenshot), feed marks+image to a model, then click mark `n`.

## What's new (2026-06-19) — Checkpoint & Resume

Durable execution for long flows + a `py.typed` marker, full stack. Full reference: [`docs/source/Eng/doc/new_features/v21_features_doc.rst`](docs/source/Eng/doc/new_features/v21_features_doc.rst).
Expand Down
8 changes: 8 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-19) — Set-of-Marks 叠图](#本次更新-2026-06-19--set-of-marks-叠图)
- [本次更新 (2026-06-19) — 检查点与续跑](#本次更新-2026-06-19--检查点与续跑)
- [本次更新 (2026-06-19) — i18n / l10n 测试](#本次更新-2026-06-19--i18n--l10n-测试)
- [本次更新 (2026-06-19) — 数据质量](#本次更新-2026-06-19--数据质量)
Expand Down Expand Up @@ -73,6 +74,13 @@

---

## 本次更新 (2026-06-19) — Set-of-Marks 叠图

VLM 定位的标准格式,走完整五层。完整参考:[`docs/source/Zh/doc/new_features/v22_features_doc.rst`](../docs/source/Zh/doc/new_features/v22_features_doc.rst)。

- **元素标号** — `mark_elements` / `render_marks` / `resolve_mark`(纯函数 + Pillow):为可交互元素指派 `1..N`(含中心/role/text),在截图上画编号红框,并把选到的编号对应回元素——让 VLM 挑*编号*而非猜像素(直接强化既有 VLM locator)。
- **标号后点击循环** — `mark_screen(render_path=...)` / `mark_click(n)`(`AC_mark_screen` / `AC_mark_click`、`ac_*`):为实时 a11y 树标号(+可选叠图截图),把 marks+图像喂给模型,再点击第 `n` 号。

## 本次更新 (2026-06-19) — 检查点与续跑

长流程的耐久执行 + `py.typed` 标记,走完整五层。完整参考:[`docs/source/Zh/doc/new_features/v21_features_doc.rst`](../docs/source/Zh/doc/new_features/v21_features_doc.rst)。
Expand Down
8 changes: 8 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-19) — Set-of-Marks 疊圖](#本次更新-2026-06-19--set-of-marks-疊圖)
- [本次更新 (2026-06-19) — 檢查點與續跑](#本次更新-2026-06-19--檢查點與續跑)
- [本次更新 (2026-06-19) — i18n / l10n 測試](#本次更新-2026-06-19--i18n--l10n-測試)
- [本次更新 (2026-06-19) — 資料品質](#本次更新-2026-06-19--資料品質)
Expand Down Expand Up @@ -73,6 +74,13 @@

---

## 本次更新 (2026-06-19) — Set-of-Marks 疊圖

VLM 定位的標準格式,走完整五層。完整參考:[`docs/source/Zh/doc/new_features/v22_features_doc.rst`](../docs/source/Zh/doc/new_features/v22_features_doc.rst)。

- **元素標號** — `mark_elements` / `render_marks` / `resolve_mark`(純函式 + Pillow):為可互動元素指派 `1..N`(含中心/role/text),在截圖上畫編號紅框,並把選到的編號對應回元素——讓 VLM 挑*編號*而非猜像素(直接強化既有 VLM locator)。
- **標號後點擊迴圈** — `mark_screen(render_path=...)` / `mark_click(n)`(`AC_mark_screen` / `AC_mark_click`、`ac_*`):為即時 a11y 樹標號(+可選疊圖截圖),把 marks+影像餵給模型,再點擊第 `n` 號。

## 本次更新 (2026-06-19) — 檢查點與續跑

長流程的耐久執行 + `py.typed` 標記,走完整五層。完整參考:[`docs/source/Zh/doc/new_features/v21_features_doc.rst`](../docs/source/Zh/doc/new_features/v21_features_doc.rst)。
Expand Down
51 changes: 51 additions & 0 deletions docs/source/Eng/doc/new_features/v22_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
==================================================
New Features (2026-06-19) — Set-of-Marks Overlay
==================================================

Modern GUI agents ground far more reliably when shown a screenshot with
**numbered boxes** over the interactable elements plus an ``id -> bbox``
legend ("Set-of-Marks" prompting): the model picks a *number* instead of
guessing pixel coordinates. This turns AutoControl's existing element
sources into that two-stage "mark then pick a number" loop and resolves the
chosen number back to a click. Pure standard library + Pillow (already a
dependency); wired through the full stack.

.. contents::
:local:
:depth: 2


Numbering and the legend
=======================

::

from je_auto_control import mark_elements, render_marks, resolve_mark

marks = mark_elements(elements) # [{id, bbox, center, role, text}, ...]
legend = [(m["id"], m["text"]) for m in marks]
annotated_png = render_marks(screenshot_png_bytes, marks)
chosen = resolve_mark(marks, 3) # the element the model picked

``mark_elements`` assigns ``1..N`` to every element with a valid bounds and
records its centre; ``render_marks`` draws numbered red boxes on a PNG;
``resolve_mark`` maps a number back to its mark. These are pure and
unit-testable with synthetic elements.


Live "mark then click" loop
==========================

::

from je_auto_control import mark_screen, mark_click

result = mark_screen(render_path="marked.png") # numbers the live a11y tree
# ... feed result["marks"] + marked.png to a VLM, get back a number ...
mark_click(3) # click mark #3

``mark_screen`` numbers the live accessibility elements (and optionally
saves a numbered-box overlay screenshot), caching the marks; ``mark_click``
resolves a number from that cache and clicks the element's centre. Exposed
as ``AC_mark_screen`` / ``AC_mark_click`` (and ``ac_mark_screen`` /
``ac_mark_click``).
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v19_features_doc
doc/new_features/v20_features_doc
doc/new_features/v21_features_doc
doc/new_features/v22_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
47 changes: 47 additions & 0 deletions docs/source/Zh/doc/new_features/v22_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
==========================================
新功能 (2026-06-19) — Set-of-Marks 疊圖
==========================================

現代 GUI agent 在看到「畫上**編號方框**的截圖 + ``id -> bbox`` 圖例」時
定位會可靠得多(Set-of-Marks prompting):模型挑一個*編號*,而不是猜
像素座標。本功能把 AutoControl 既有的元素來源轉成這種「先標號、再挑號」
的兩階段流程,並把選到的編號解析回一次點擊。純標準庫 + Pillow(已是相依);
走完整五層。

.. contents::
:local:
:depth: 2


標號與圖例
==========

::

from je_auto_control import mark_elements, render_marks, resolve_mark

marks = mark_elements(elements) # [{id, bbox, center, role, text}, ...]
legend = [(m["id"], m["text"]) for m in marks]
annotated_png = render_marks(screenshot_png_bytes, marks)
chosen = resolve_mark(marks, 3) # 模型挑中的元素

``mark_elements`` 會為每個有有效 bounds 的元素指派 ``1..N`` 並記錄中心點;
``render_marks`` 在 PNG 上畫出編號紅框;``resolve_mark`` 把編號對應回該
標記。這些都是純函式,可用合成元素做單元測試。


即時「標號後點擊」迴圈
======================

::

from je_auto_control import mark_screen, mark_click

result = mark_screen(render_path="marked.png") # 為即時 a11y 樹標號
# ... 把 result["marks"] + marked.png 餵給 VLM,取回一個編號 ...
mark_click(3) # 點擊第 3 號標記

``mark_screen`` 為即時 accessibility 元素標號(並可另存編號方框疊圖截圖),
並快取這些標記;``mark_click`` 從快取解析編號並點擊該元素中心。對應
``AC_mark_screen`` / ``AC_mark_click``(以及 ``ac_mark_screen`` /
``ac_mark_click``)。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v19_features_doc
doc/new_features/v20_features_doc
doc/new_features/v21_features_doc
doc/new_features/v22_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,10 @@
from je_auto_control.utils.checkpoint import (
Checkpoint, CheckpointStore, run_resumable,
)
# Set-of-Marks overlay (number elements for VLM grounding)
from je_auto_control.utils.set_of_marks import (
mark_click, mark_elements, mark_screen, render_marks, resolve_mark,
)
# Background popup/interrupt watchdog (unattended automation)
from je_auto_control.utils.watchdog import (
PopupWatchdog, WatchdogRule, default_popup_watchdog,
Expand Down Expand Up @@ -588,6 +592,8 @@ def start_autocontrol_gui(*args, **kwargs):
"check_catalog", "check_overflow", "pseudo_localize",
"pseudo_localize_catalog",
"Checkpoint", "CheckpointStore", "run_resumable",
"mark_click", "mark_elements", "mark_screen", "render_marks",
"resolve_mark",
# MCP server
"AuditLogger", "HttpMCPServer", "MCPContent", "MCPPrompt",
"MCPPromptArgument", "MCPResource", "MCPServer", "MCPTool",
Expand Down
18 changes: 18 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -577,7 +577,7 @@
FieldSpec("automation_id", FieldType.STRING, optional=True),
)
specs.append(CommandSpec(
"AC_control_get_value", "Native UI", "Get Control Value",

Check failure on line 580 in je_auto_control/gui/script_builder/command_schema.py

View check run for this annotation

SonarQubeCloud / SonarCloud Code Analysis

Define a constant instead of duplicating this literal "Native UI" 13 times.

See more on https://sonarcloud.io/project/issues?id=Integration-Automation_AutoControlGUI&issues=AZ7fFnm4mUk2Yti149VL&open=AZ7fFnm4mUk2Yti149VL&pullRequest=230
fields=fields,
description="Read a native control's value via the accessibility API.",
))
Expand Down Expand Up @@ -661,6 +661,24 @@
_add_data_quality_specs(specs)
_add_i18n_specs(specs)
_add_checkpoint_specs(specs)
_add_set_of_marks_specs(specs)


def _add_set_of_marks_specs(specs: List[CommandSpec]) -> None:
specs.append(CommandSpec(
"AC_mark_screen", "Native UI", "Set-of-Marks: Number Elements",
fields=(
FieldSpec("app_name", FieldType.STRING, optional=True),
FieldSpec("render_path", FieldType.FILE_PATH, optional=True),
),
description="Number live UI elements (id->bbox legend) for VLM "
"grounding; optional numbered-box overlay screenshot.",
))
specs.append(CommandSpec(
"AC_mark_click", "Native UI", "Set-of-Marks: Click Number",
fields=(FieldSpec("mark_id", FieldType.INT),),
description="Click the element behind a numbered mark.",
))


def _add_checkpoint_specs(specs: List[CommandSpec]) -> None:
Expand Down
15 changes: 15 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2743,6 +2743,19 @@ def _checkpoint_clear(run_id: str, db: str) -> Dict[str, Any]:
return {"cleared": CheckpointStore(db).clear(run_id)}


def _mark_screen(app_name: Optional[str] = None,
render_path: Optional[str] = None) -> Dict[str, Any]:
"""Adapter: number live UI elements (Set-of-Marks) for VLM grounding."""
from je_auto_control.utils.set_of_marks import mark_screen
return mark_screen(app_name=app_name, render_path=render_path)


def _mark_click(mark_id: int) -> Dict[str, Any]:
"""Adapter: click the element behind a numbered mark."""
from je_auto_control.utils.set_of_marks import mark_click
return {"clicked": mark_click(int(mark_id))}


class Executor:
"""
Executor
Expand Down Expand Up @@ -2953,6 +2966,8 @@ def __init__(self):
"AC_run_resumable": _run_resumable,
"AC_checkpoint_status": _checkpoint_status,
"AC_checkpoint_clear": _checkpoint_clear,
"AC_mark_screen": _mark_screen,
"AC_mark_click": _mark_click,
"AC_a11y_record_start": _a11y_record_start,
"AC_a11y_record_stop": _a11y_record_stop,
"AC_a11y_record_events": _a11y_record_events,
Expand Down
28 changes: 27 additions & 1 deletion je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -2303,6 +2303,32 @@ def checkpoint_tools() -> List[MCPTool]:
]


def set_of_marks_tools() -> List[MCPTool]:
return [
MCPTool(
name="ac_mark_screen",
description=("Set-of-Marks: number the live UI elements (a11y "
"tree) and return an id->bbox/center/role/text "
"legend for VLM grounding — the model picks a number "
"instead of pixels. Optionally render a numbered-box "
"overlay screenshot to 'render_path'."),
input_schema=schema({"app_name": {"type": "string"},
"render_path": {"type": "string"}}),
handler=h.mark_screen,
annotations=SIDE_EFFECT_ONLY,
),
MCPTool(
name="ac_mark_click",
description=("Click the element behind a numbered mark from the "
"last ac_mark_screen. Returns {clicked}."),
input_schema=schema({"mark_id": {"type": "integer"}},
required=["mark_id"]),
handler=h.mark_click,
annotations=SIDE_EFFECT_ONLY,
),
]


def unattended_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -3357,7 +3383,7 @@ def media_assert_tools() -> List[MCPTool]:
skill_library_tools, guardrail_tools, a2a_tools, office_tools,
agent_memory_tools, determinism_tools, observer_tools,
sbom_tools, sharding_tools, data_quality_tools, i18n_tools,
checkpoint_tools,
checkpoint_tools, set_of_marks_tools,
screen_record_tools,
process_and_shell_tools, remote_desktop_tools, gamepad_tools,
usb_passthrough_tools, assertion_tools, data_source_tools,
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1133,6 +1133,16 @@ def checkpoint_clear(run_id, db):
return {"cleared": CheckpointStore(db).clear(run_id)}


def mark_screen(app_name=None, render_path=None):
from je_auto_control.utils.set_of_marks import mark_screen as _ms
return _ms(app_name=app_name, render_path=render_path)


def mark_click(mark_id):
from je_auto_control.utils.set_of_marks import mark_click as _mc
return {"clicked": _mc(int(mark_id))}


def vlm_locate(description: str,
screen_region: Optional[List[int]] = None,
model: Optional[str] = None) -> Optional[List[int]]:
Expand Down
10 changes: 10 additions & 0 deletions je_auto_control/utils/set_of_marks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""Set-of-Marks overlay — number on-screen elements for VLM grounding."""
from je_auto_control.utils.set_of_marks.set_of_marks import (
last_marks, mark_click, mark_elements, mark_screen, render_marks,
resolve_mark,
)

__all__ = [
"last_marks", "mark_click", "mark_elements", "mark_screen",
"render_marks", "resolve_mark",
]
Loading
Loading