Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-20) — Coordinate-Space Mapping (Model Grid ⇄ Physical Pixels)](#whats-new-2026-06-20--coordinate-space-mapping-model-grid--physical-pixels)
- [What's new (2026-06-20) — Voice-Command Router](#whats-new-2026-06-20--voice-command-router)
- [What's new (2026-06-20) — Locale-Aware Number, Currency & Date Parsing](#whats-new-2026-06-20--locale-aware-number-currency--date-parsing)
- [What's new (2026-06-20) — Perceptual-Hash Image Dedupe](#whats-new-2026-06-20--perceptual-hash-image-dedupe)
Expand Down Expand Up @@ -97,6 +98,12 @@

---

## What's new (2026-06-20) — Coordinate-Space Mapping (Model Grid ⇄ Physical Pixels)

Translate computer-use model clicks to real pixels. Full reference: [`docs/source/Eng/doc/new_features/v45_features_doc.rst`](docs/source/Eng/doc/new_features/v45_features_doc.rst).

- **`CoordinateSpace` / `xga_space` / `normalized_space` / `downscale_png`** (`AC_to_physical` / `AC_to_model`, `ac_*`): computer-use/VLA models click in a fixed grid (Anthropic downscales to XGA; Gemini returns a 1000×1000 grid), not physical pixels. This maps both ways (round + clamp), `xga_space` aspect-preserves without upscaling, and `downscale_png` resizes a screenshot to the model's input size (Pillow, already core). Pure-arithmetic mapping — unit-tested without a model/GPU.

## What's new (2026-06-20) — Voice-Command Router

Trigger flows hands-free from recognized speech. Full reference: [`docs/source/Eng/doc/new_features/v44_features_doc.rst`](docs/source/Eng/doc/new_features/v44_features_doc.rst).
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-20) — 坐标空间映射(模型网格 ⇄ 物理像素)](#本次更新-2026-06-20--坐标空间映射模型网格--物理像素)
- [本次更新 (2026-06-20) — 语音指令路由器](#本次更新-2026-06-20--语音指令路由器)
- [本次更新 (2026-06-20) — 区域设置感知的数字、货币与日期解析](#本次更新-2026-06-20--区域设置感知的数字货币与日期解析)
- [本次更新 (2026-06-20) — 感知哈希图像去重](#本次更新-2026-06-20--感知哈希图像去重)
Expand Down Expand Up @@ -96,6 +97,12 @@

---

## 本次更新 (2026-06-20) — 坐标空间映射(模型网格 ⇄ 物理像素)

将电脑操作模型的点击转成物理像素。完整参考:[`docs/source/Zh/doc/new_features/v45_features_doc.rst`](../docs/source/Zh/doc/new_features/v45_features_doc.rst)。

- **`CoordinateSpace` / `xga_space` / `normalized_space` / `downscale_png`**(`AC_to_physical` / `AC_to_model`、`ac_*`):电脑操作/VLA 模型以固定网格点击(Anthropic 缩小到 XGA;Gemini 返回 1000×1000 网格),而非物理像素。本功能双向映射(四舍五入 + 夹限),`xga_space` 保持长宽比且不放大,`downscale_png` 将截图缩到模型输入尺寸(Pillow,已是核心)。纯算术映射 —— 无需模型/GPU 即可单元测试。

## 本次更新 (2026-06-20) — 语音指令路由器

以已识别语音免手动触发流程。完整参考:[`docs/source/Zh/doc/new_features/v44_features_doc.rst`](../docs/source/Zh/doc/new_features/v44_features_doc.rst)。
Expand Down
7 changes: 7 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-20) — 座標空間對映(模型網格 ⇄ 實體像素)](#本次更新-2026-06-20--座標空間對映模型網格--實體像素)
- [本次更新 (2026-06-20) — 語音指令路由器](#本次更新-2026-06-20--語音指令路由器)
- [本次更新 (2026-06-20) — 區域設定感知的數字、貨幣與日期解析](#本次更新-2026-06-20--區域設定感知的數字貨幣與日期解析)
- [本次更新 (2026-06-20) — 感知雜湊影像去重](#本次更新-2026-06-20--感知雜湊影像去重)
Expand Down Expand Up @@ -96,6 +97,12 @@

---

## 本次更新 (2026-06-20) — 座標空間對映(模型網格 ⇄ 實體像素)

將電腦操作模型的點擊轉成真實像素。完整參考:[`docs/source/Zh/doc/new_features/v45_features_doc.rst`](../docs/source/Zh/doc/new_features/v45_features_doc.rst)。

- **`CoordinateSpace` / `xga_space` / `normalized_space` / `downscale_png`**(`AC_to_physical` / `AC_to_model`、`ac_*`):電腦操作/VLA 模型以固定網格點擊(Anthropic 縮小到 XGA;Gemini 回傳 1000×1000 網格),而非實體像素。本功能雙向對映(四捨五入 + 夾限),`xga_space` 保持長寬比且不放大,`downscale_png` 將截圖縮到模型輸入尺寸(Pillow,已是核心)。純算術對映 —— 無需模型/GPU 即可單元測試。

## 本次更新 (2026-06-20) — 語音指令路由器

以已辨識語音免手動觸發流程。完整參考:[`docs/source/Zh/doc/new_features/v44_features_doc.rst`](../docs/source/Zh/doc/new_features/v44_features_doc.rst)。
Expand Down
45 changes: 45 additions & 0 deletions docs/source/Eng/doc/new_features/v45_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
Coordinate-Space Mapping (Model Grid ⇄ Physical Pixels)
=======================================================

Computer-use / VLA models do not click in physical pixels. Anthropic recommends
downscaling the screenshot to XGA (~1024×768) and mapping clicks back; Gemini's
computer-use model returns a normalized **1000×1000** grid; others assume the
display size you declared. ``CoordinateSpace`` captures the physical resolution
and the model's grid and converts both ways, so an agent loop can feed the model
a right-sized screenshot and translate its clicks back to real coordinates.

The mapping is pure arithmetic (no dependency); :func:`downscale_png` uses Pillow
(already a core dependency). Imports no ``PySide6``.

Headless API
------------

.. code-block:: python

from je_auto_control import (
CoordinateSpace, xga_space, normalized_space, downscale_png)

space = normalized_space(1920, 1080, grid=1000) # Gemini-style 1000x1000
space.to_physical(500, 500) # -> (960, 540) model click -> real pixel
space.to_model(960, 540) # -> (500, 500) real pixel -> model grid

xga = xga_space(2560, 1440) # Anthropic-style downscale, aspect-preserved
small_png = downscale_png(screenshot_png, xga) # send this to the model

``xga_space`` preserves aspect ratio and never upscales; ``normalized_space``
builds a square grid. Both ``to_physical`` / ``to_model`` round and clamp to valid
pixel/grid bounds.

Executor commands
-----------------

================================ ===================================================
Command Effect
================================ ===================================================
``AC_to_physical`` Map a model-grid ``(x, y)`` to physical pixels.
``AC_to_model`` Map physical pixels to a model grid (inverse).
================================ ===================================================

Both take ``x, y, physical_w, physical_h, model_w, model_h`` and return
``{x, y}``. The same operations are exposed as MCP tools (``ac_to_physical`` /
``ac_to_model``) and as Script Builder commands under **Agent**.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v42_features_doc
doc/new_features/v43_features_doc
doc/new_features/v44_features_doc
doc/new_features/v45_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
42 changes: 42 additions & 0 deletions docs/source/Zh/doc/new_features/v45_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
座標空間對映(模型網格 ⇄ 實體像素)
====================================

電腦操作 / VLA 模型並不是以實體像素點擊。Anthropic 建議將螢幕截圖縮小到 XGA
(~1024×768)再把點擊映射回去;Gemini 的電腦操作模型回傳正規化的 **1000×1000** 網格;
其他模型則假設你宣告的顯示尺寸。``CoordinateSpace`` 捕捉實體解析度與模型網格並雙向轉
換,因此 agent loop 可餵給模型一張尺寸正確的截圖,並把它的點擊轉回真實座標。

對映為純算術(無相依);:func:`downscale_png` 使用 Pillow(已是核心相依)。不匯入
``PySide6``。

無頭 API
--------

.. code-block:: python

from je_auto_control import (
CoordinateSpace, xga_space, normalized_space, downscale_png)

space = normalized_space(1920, 1080, grid=1000) # Gemini 式 1000x1000
space.to_physical(500, 500) # -> (960, 540) 模型點擊 -> 真實像素
space.to_model(960, 540) # -> (500, 500) 真實像素 -> 模型網格

xga = xga_space(2560, 1440) # Anthropic 式縮小,保持長寬比
small_png = downscale_png(screenshot_png, xga) # 把這張送給模型

``xga_space`` 會保持長寬比且永不放大;``normalized_space`` 建立方形網格。
``to_physical`` / ``to_model`` 皆會四捨五入並夾限到有效的像素/網格範圍內。

執行器指令
----------

================================ ===================================================
指令 效果
================================ ===================================================
``AC_to_physical`` 將模型網格 ``(x, y)`` 對映到實體像素。
``AC_to_model`` 將實體像素對映到模型網格(反向)。
================================ ===================================================

兩者皆接受 ``x, y, physical_w, physical_h, model_w, model_h`` 並回傳 ``{x, y}``。相同操
作亦提供為 MCP 工具(``ac_to_physical`` / ``ac_to_model``),以及 Script Builder 中
**Agent** 分類下的指令。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v42_features_doc
doc/new_features/v43_features_doc
doc/new_features/v44_features_doc
doc/new_features/v45_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
5 changes: 5 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,10 @@
from je_auto_control.utils.voice import (
VoiceCommand, VoiceRouter, default_voice_router,
)
# Coordinate-space mapping (model grid <-> physical pixels)
from je_auto_control.utils.coordinate_space import (
CoordinateSpace, downscale_png, normalized_space, xga_space,
)
# Background popup/interrupt watchdog (unattended automation)
from je_auto_control.utils.watchdog import (
PopupWatchdog, WatchdogRule, default_popup_watchdog,
Expand Down Expand Up @@ -705,6 +709,7 @@ def start_autocontrol_gui(*args, **kwargs):
"format_currency", "format_date", "format_decimal", "parse_decimal",
"parse_number",
"VoiceCommand", "VoiceRouter", "default_voice_router",
"CoordinateSpace", "downscale_png", "normalized_space", "xga_space",
# MCP server
"AuditLogger", "HttpMCPServer", "MCPContent", "MCPPrompt",
"MCPPromptArgument", "MCPResource", "MCPServer", "MCPTool",
Expand Down
22 changes: 22 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1019,6 +1019,28 @@ def _add_misc_specs(specs: List[CommandSpec]) -> None:
fields=(),
description="Remove all registered voice commands.",
))
specs.append(CommandSpec(
"AC_to_physical", "Agent", "Coords: Model -> Physical",
fields=(
FieldSpec("x", FieldType.FLOAT), FieldSpec("y", FieldType.FLOAT),
FieldSpec("physical_w", FieldType.INT),
FieldSpec("physical_h", FieldType.INT),
FieldSpec("model_w", FieldType.INT),
FieldSpec("model_h", FieldType.INT),
),
description="Map a model-grid coordinate to physical pixels.",
))
specs.append(CommandSpec(
"AC_to_model", "Agent", "Coords: Physical -> Model",
fields=(
FieldSpec("x", FieldType.INT), FieldSpec("y", FieldType.INT),
FieldSpec("physical_w", FieldType.INT),
FieldSpec("physical_h", FieldType.INT),
FieldSpec("model_w", FieldType.INT),
FieldSpec("model_h", FieldType.INT),
),
description="Map a physical-pixel coordinate to a model grid.",
))
specs.append(CommandSpec(
"AC_generate_sop", "Report", "Generate SOP Document",
fields=(
Expand Down
8 changes: 8 additions & 0 deletions je_auto_control/utils/coordinate_space/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""Coordinate-space mapping between model grids and physical pixels."""
from je_auto_control.utils.coordinate_space.coordinate_space import (
CoordinateSpace, downscale_png, normalized_space, xga_space,
)

__all__ = [
"CoordinateSpace", "downscale_png", "normalized_space", "xga_space",
]
76 changes: 76 additions & 0 deletions je_auto_control/utils/coordinate_space/coordinate_space.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
"""Map coordinates between a model's grid and physical screen pixels.

Computer-use / VLA models do not click in physical pixels: Anthropic recommends
downscaling the screenshot to XGA (~1024x768) and mapping clicks back; Gemini
computer-use returns a normalized **1000x1000** grid; others assume the declared
display size. A :class:`CoordinateSpace` captures the physical resolution and the
model's grid and converts both ways, so an agent loop can send the model a
right-sized screenshot and translate its clicks back to real coordinates.

Pure arithmetic for the mapping (no dependency); :func:`downscale_png` uses
Pillow, which is already a core dependency. Imports no ``PySide6``.
"""
from dataclasses import dataclass
from typing import Tuple


@dataclass(frozen=True)
class CoordinateSpace:
"""A mapping between physical pixels and a model coordinate grid."""

physical_w: int
physical_h: int
model_w: int
model_h: int

def to_physical(self, x: float, y: float) -> Tuple[int, int]:
"""Map a model-space ``(x, y)`` to physical pixels (clamped, rounded)."""
px = round(x * self.physical_w / self.model_w)
py = round(y * self.physical_h / self.model_h)
return (_clamp(px, self.physical_w), _clamp(py, self.physical_h))

def to_model(self, x: int, y: int) -> Tuple[int, int]:
"""Map physical pixels ``(x, y)`` to model space (clamped, rounded)."""
mx = round(x * self.model_w / self.physical_w)
my = round(y * self.model_h / self.physical_h)
return (_clamp(mx, self.model_w), _clamp(my, self.model_h))

@property
def model_size(self) -> Tuple[int, int]:
"""The model grid as ``(width, height)``."""
return (self.model_w, self.model_h)


def _clamp(value: int, size: int) -> int:
return max(0, min(int(value), size - 1))


def xga_space(physical_w: int, physical_h: int, *, max_w: int = 1024,
max_h: int = 768) -> CoordinateSpace:
"""Build a space that fits the screen within ``max_w`` x ``max_h``.

The aspect ratio is preserved (the larger downscale factor wins), matching
the Anthropic "downscale to XGA" recommendation.
"""
scale = min(max_w / physical_w, max_h / physical_h, 1.0)
model_w = max(1, round(physical_w * scale))
model_h = max(1, round(physical_h * scale))
return CoordinateSpace(physical_w, physical_h, model_w, model_h)


def normalized_space(physical_w: int, physical_h: int, *,
grid: int = 1000) -> CoordinateSpace:
"""Build a square normalized grid (default 1000x1000, Gemini-style)."""
return CoordinateSpace(physical_w, physical_h, grid, grid)


def downscale_png(png: bytes, space: CoordinateSpace) -> bytes:
"""Resize a PNG screenshot to ``space``'s model size (for model input)."""
import io

from PIL import Image
with Image.open(io.BytesIO(png)) as image:
resized = image.convert("RGB").resize(space.model_size)
buffer = io.BytesIO()
resized.save(buffer, format="PNG")
return buffer.getvalue()
20 changes: 20 additions & 0 deletions je_auto_control/utils/executor/action_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3195,6 +3195,24 @@ def _voice_clear() -> Dict[str, Any]:
return {"cleared": True}


def _to_physical(x: float, y: float, physical_w: int, physical_h: int,
model_w: int, model_h: int) -> Dict[str, Any]:
"""Adapter: map a model-grid coordinate to physical pixels."""
from je_auto_control.utils.coordinate_space import CoordinateSpace
px, py = CoordinateSpace(physical_w, physical_h, model_w,
model_h).to_physical(x, y)
return {"x": px, "y": py}


def _to_model(x: int, y: int, physical_w: int, physical_h: int,
model_w: int, model_h: int) -> Dict[str, Any]:
"""Adapter: map a physical-pixel coordinate to a model grid."""
from je_auto_control.utils.coordinate_space import CoordinateSpace
mx, my = CoordinateSpace(physical_w, physical_h, model_w,
model_h).to_model(x, y)
return {"x": mx, "y": my}


class Executor:
"""
Executor
Expand Down Expand Up @@ -3465,6 +3483,8 @@ def __init__(self):
"AC_voice_dispatch": _voice_dispatch,
"AC_voice_list": _voice_list,
"AC_voice_clear": _voice_clear,
"AC_to_physical": _to_physical,
"AC_to_model": _to_model,
"AC_a11y_record_start": _a11y_record_start,
"AC_a11y_record_stop": _a11y_record_stop,
"AC_a11y_record_events": _a11y_record_events,
Expand Down
29 changes: 28 additions & 1 deletion je_auto_control/utils/mcp_server/tools/_factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -3049,6 +3049,33 @@ def voice_tools() -> List[MCPTool]:
]


def coordinate_space_tools() -> List[MCPTool]:
_DIMS = {"x": {"type": "number"}, "y": {"type": "number"},
"physical_w": {"type": "integer"},
"physical_h": {"type": "integer"},
"model_w": {"type": "integer"}, "model_h": {"type": "integer"}}
_REQ = ["x", "y", "physical_w", "physical_h", "model_w", "model_h"]
return [
MCPTool(
name="ac_to_physical",
description=("Map a model-grid coordinate (e.g. a 1000x1000 or XGA "
"click from a computer-use model) to physical screen "
"pixels. Returns {x, y}."),
input_schema=schema(dict(_DIMS), list(_REQ)),
handler=h.to_physical,
annotations=READ_ONLY,
),
MCPTool(
name="ac_to_model",
description=("Map a physical-pixel coordinate to a model grid "
"(inverse of ac_to_physical). Returns {x, y}."),
input_schema=schema(dict(_DIMS), list(_REQ)),
handler=h.to_model,
annotations=READ_ONLY,
),
]


def unattended_tools() -> List[MCPTool]:
return [
MCPTool(
Expand Down Expand Up @@ -4110,7 +4137,7 @@ def media_assert_tools() -> List[MCPTool]:
credential_lease_tools, egress_tools, approval_testing_tools,
trajectory_eval_tools, compliance_tools, agent_trace_tools,
video_report_tools, fuzzy_tools, artifact_store_tools, image_dedup_tools,
locale_tools, voice_tools,
locale_tools, voice_tools, coordinate_space_tools,
screen_record_tools,
process_and_shell_tools, remote_desktop_tools, gamepad_tools,
usb_passthrough_tools, assertion_tools, data_source_tools,
Expand Down
14 changes: 14 additions & 0 deletions je_auto_control/utils/mcp_server/tools/_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1471,6 +1471,20 @@ def voice_clear():
return {"cleared": True}


def to_physical(x, y, physical_w, physical_h, model_w, model_h):
from je_auto_control.utils.coordinate_space import CoordinateSpace
px, py = CoordinateSpace(physical_w, physical_h, model_w,
model_h).to_physical(x, y)
return {"x": px, "y": py}


def to_model(x, y, physical_w, physical_h, model_w, model_h):
from je_auto_control.utils.coordinate_space import CoordinateSpace
mx, my = CoordinateSpace(physical_w, physical_h, model_w,
model_h).to_model(x, y)
return {"x": mx, "y": my}


def vlm_locate(description: str,
screen_region: Optional[List[int]] = None,
model: Optional[str] = None) -> Optional[List[int]]:
Expand Down
Loading
Loading