Integration-Automation · JE-Chen · Jun 20, 2026 · Jun 20, 2026
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
 
 ## Table of Contents
 
+- [What's new (2026-06-20) — Coordinate-Space Mapping (Model Grid ⇄ Physical Pixels)](#whats-new-2026-06-20--coordinate-space-mapping-model-grid--physical-pixels)
 - [What's new (2026-06-20) — Voice-Command Router](#whats-new-2026-06-20--voice-command-router)
 - [What's new (2026-06-20) — Locale-Aware Number, Currency & Date Parsing](#whats-new-2026-06-20--locale-aware-number-currency--date-parsing)
 - [What's new (2026-06-20) — Perceptual-Hash Image Dedupe](#whats-new-2026-06-20--perceptual-hash-image-dedupe)
@@ -97,6 +98,12 @@
 
 ---
 
+## What's new (2026-06-20) — Coordinate-Space Mapping (Model Grid ⇄ Physical Pixels)
+
+Translate computer-use model clicks to real pixels. Full reference: [`docs/source/Eng/doc/new_features/v45_features_doc.rst`](docs/source/Eng/doc/new_features/v45_features_doc.rst).
+
+- **`CoordinateSpace` / `xga_space` / `normalized_space` / `downscale_png`** (`AC_to_physical` / `AC_to_model`, `ac_*`): computer-use/VLA models click in a fixed grid (Anthropic downscales to XGA; Gemini returns a 1000×1000 grid), not physical pixels. This maps both ways (round + clamp), `xga_space` aspect-preserves without upscaling, and `downscale_png` resizes a screenshot to the model's input size (Pillow, already core). Pure-arithmetic mapping — unit-tested without a model/GPU.
+
 ## What's new (2026-06-20) — Voice-Command Router
 
 Trigger flows hands-free from recognized speech. Full reference: [`docs/source/Eng/doc/new_features/v44_features_doc.rst`](docs/source/Eng/doc/new_features/v44_features_doc.rst).

diff --git a/README/README_zh-CN.md b/README/README_zh-CN.md
@@ -12,6 +12,7 @@
 
 ## 目录
 
+- [本次更新 (2026-06-20) — 坐标空间映射(模型网格 ⇄ 物理像素)](#本次更新-2026-06-20--坐标空间映射模型网格--物理像素)
 - [本次更新 (2026-06-20) — 语音指令路由器](#本次更新-2026-06-20--语音指令路由器)
 - [本次更新 (2026-06-20) — 区域设置感知的数字、货币与日期解析](#本次更新-2026-06-20--区域设置感知的数字货币与日期解析)
 - [本次更新 (2026-06-20) — 感知哈希图像去重](#本次更新-2026-06-20--感知哈希图像去重)
@@ -96,6 +97,12 @@
 
 ---
 
+## 本次更新 (2026-06-20) — 坐标空间映射(模型网格 ⇄ 物理像素)
+
+将电脑操作模型的点击转成物理像素。完整参考:[`docs/source/Zh/doc/new_features/v45_features_doc.rst`](../docs/source/Zh/doc/new_features/v45_features_doc.rst)。
+
+- **`CoordinateSpace` / `xga_space` / `normalized_space` / `downscale_png`**(`AC_to_physical` / `AC_to_model`、`ac_*`):电脑操作/VLA 模型以固定网格点击(Anthropic 缩小到 XGA;Gemini 返回 1000×1000 网格),而非物理像素。本功能双向映射(四舍五入 + 夹限),`xga_space` 保持长宽比且不放大,`downscale_png` 将截图缩到模型输入尺寸(Pillow,已是核心)。纯算术映射 —— 无需模型/GPU 即可单元测试。
+
 ## 本次更新 (2026-06-20) — 语音指令路由器
 
 以已识别语音免手动触发流程。完整参考:[`docs/source/Zh/doc/new_features/v44_features_doc.rst`](../docs/source/Zh/doc/new_features/v44_features_doc.rst)。

diff --git a/README/README_zh-TW.md b/README/README_zh-TW.md
@@ -12,6 +12,7 @@
 
 ## 目錄
 
+- [本次更新 (2026-06-20) — 座標空間對映(模型網格 ⇄ 實體像素)](#本次更新-2026-06-20--座標空間對映模型網格--實體像素)
 - [本次更新 (2026-06-20) — 語音指令路由器](#本次更新-2026-06-20--語音指令路由器)
 - [本次更新 (2026-06-20) — 區域設定感知的數字、貨幣與日期解析](#本次更新-2026-06-20--區域設定感知的數字貨幣與日期解析)
 - [本次更新 (2026-06-20) — 感知雜湊影像去重](#本次更新-2026-06-20--感知雜湊影像去重)
@@ -96,6 +97,12 @@
 
 ---
 
+## 本次更新 (2026-06-20) — 座標空間對映(模型網格 ⇄ 實體像素)
+
+將電腦操作模型的點擊轉成真實像素。完整參考:[`docs/source/Zh/doc/new_features/v45_features_doc.rst`](../docs/source/Zh/doc/new_features/v45_features_doc.rst)。
+
+- **`CoordinateSpace` / `xga_space` / `normalized_space` / `downscale_png`**(`AC_to_physical` / `AC_to_model`、`ac_*`):電腦操作/VLA 模型以固定網格點擊(Anthropic 縮小到 XGA;Gemini 回傳 1000×1000 網格),而非實體像素。本功能雙向對映(四捨五入 + 夾限),`xga_space` 保持長寬比且不放大,`downscale_png` 將截圖縮到模型輸入尺寸(Pillow,已是核心)。純算術對映 —— 無需模型/GPU 即可單元測試。
+
 ## 本次更新 (2026-06-20) — 語音指令路由器
 
 以已辨識語音免手動觸發流程。完整參考:[`docs/source/Zh/doc/new_features/v44_features_doc.rst`](../docs/source/Zh/doc/new_features/v44_features_doc.rst)。

diff --git a/docs/source/Eng/doc/new_features/v45_features_doc.rst b/docs/source/Eng/doc/new_features/v45_features_doc.rst
@@ -0,0 +1,45 @@
+Coordinate-Space Mapping (Model Grid ⇄ Physical Pixels)
+=======================================================
+
+Computer-use / VLA models do not click in physical pixels. Anthropic recommends
+downscaling the screenshot to XGA (~1024×768) and mapping clicks back; Gemini's
+computer-use model returns a normalized **1000×1000** grid; others assume the
+display size you declared. ``CoordinateSpace`` captures the physical resolution
+and the model's grid and converts both ways, so an agent loop can feed the model
+a right-sized screenshot and translate its clicks back to real coordinates.
+
+The mapping is pure arithmetic (no dependency); :func:`downscale_png` uses Pillow
+(already a core dependency). Imports no ``PySide6``.
+
+Headless API
+------------
+
+.. code-block:: python
+
+    from je_auto_control import (
+        CoordinateSpace, xga_space, normalized_space, downscale_png)
+
+    space = normalized_space(1920, 1080, grid=1000)   # Gemini-style 1000x1000
+    space.to_physical(500, 500)        # -> (960, 540)   model click -> real pixel
+    space.to_model(960, 540)           # -> (500, 500)   real pixel -> model grid
+
+    xga = xga_space(2560, 1440)        # Anthropic-style downscale, aspect-preserved
+    small_png = downscale_png(screenshot_png, xga)   # send this to the model
+
+``xga_space`` preserves aspect ratio and never upscales; ``normalized_space``
+builds a square grid. Both ``to_physical`` / ``to_model`` round and clamp to valid
+pixel/grid bounds.
+
+Executor commands
+-----------------
+
+================================ ===================================================
+Command                          Effect
+================================ ===================================================
+``AC_to_physical``               Map a model-grid ``(x, y)`` to physical pixels.
+``AC_to_model``                  Map physical pixels to a model grid (inverse).
+================================ ===================================================
+
+Both take ``x, y, physical_w, physical_h, model_w, model_h`` and return
+``{x, y}``. The same operations are exposed as MCP tools (``ac_to_physical`` /
+``ac_to_model``) and as Script Builder commands under **Agent**.
diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst
@@ -67,6 +67,7 @@ Comprehensive guides for all AutoControl features.
    doc/new_features/v42_features_doc
    doc/new_features/v43_features_doc
    doc/new_features/v44_features_doc
+   doc/new_features/v45_features_doc
    doc/ocr_backends/ocr_backends_doc
    doc/observability/observability_doc
    doc/operations_layer/operations_layer_doc

diff --git a/docs/source/Zh/doc/new_features/v45_features_doc.rst b/docs/source/Zh/doc/new_features/v45_features_doc.rst
@@ -0,0 +1,42 @@
+座標空間對映(模型網格 ⇄ 實體像素)
+====================================
+
+電腦操作 / VLA 模型並不是以實體像素點擊。Anthropic 建議將螢幕截圖縮小到 XGA
+(~1024×768)再把點擊映射回去;Gemini 的電腦操作模型回傳正規化的 **1000×1000** 網格;
+其他模型則假設你宣告的顯示尺寸。``CoordinateSpace`` 捕捉實體解析度與模型網格並雙向轉
+換,因此 agent loop 可餵給模型一張尺寸正確的截圖,並把它的點擊轉回真實座標。
+
+對映為純算術(無相依);:func:`downscale_png` 使用 Pillow(已是核心相依)。不匯入
+``PySide6``。
+
+無頭 API
+--------
+
+.. code-block:: python
+
+    from je_auto_control import (
+        CoordinateSpace, xga_space, normalized_space, downscale_png)
+
+    space = normalized_space(1920, 1080, grid=1000)   # Gemini 式 1000x1000
+    space.to_physical(500, 500)        # -> (960, 540)   模型點擊 -> 真實像素
+    space.to_model(960, 540)           # -> (500, 500)   真實像素 -> 模型網格
+
+    xga = xga_space(2560, 1440)        # Anthropic 式縮小,保持長寬比
+    small_png = downscale_png(screenshot_png, xga)   # 把這張送給模型
+
+``xga_space`` 會保持長寬比且永不放大;``normalized_space`` 建立方形網格。
+``to_physical`` / ``to_model`` 皆會四捨五入並夾限到有效的像素/網格範圍內。
+
+執行器指令
+----------
+
+================================ ===================================================
+指令                             效果
+================================ ===================================================
+``AC_to_physical``               將模型網格 ``(x, y)`` 對映到實體像素。
+``AC_to_model``                  將實體像素對映到模型網格(反向)。
+================================ ===================================================
+
+兩者皆接受 ``x, y, physical_w, physical_h, model_w, model_h`` 並回傳 ``{x, y}``。相同操
+作亦提供為 MCP 工具(``ac_to_physical`` / ``ac_to_model``),以及 Script Builder 中
+**Agent** 分類下的指令。
diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst
@@ -67,6 +67,7 @@ AutoControl 所有功能的完整使用指南。
    doc/new_features/v42_features_doc
    doc/new_features/v43_features_doc
    doc/new_features/v44_features_doc
+   doc/new_features/v45_features_doc
    doc/ocr_backends/ocr_backends_doc
    doc/observability/observability_doc
    doc/operations_layer/operations_layer_doc

diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py
@@ -251,6 +251,10 @@
 from je_auto_control.utils.voice import (
     VoiceCommand, VoiceRouter, default_voice_router,
 )
+# Coordinate-space mapping (model grid <-> physical pixels)
+from je_auto_control.utils.coordinate_space import (
+    CoordinateSpace, downscale_png, normalized_space, xga_space,
+)
 # Background popup/interrupt watchdog (unattended automation)
 from je_auto_control.utils.watchdog import (
     PopupWatchdog, WatchdogRule, default_popup_watchdog,
@@ -705,6 +709,7 @@ def start_autocontrol_gui(*args, **kwargs):
     "format_currency", "format_date", "format_decimal", "parse_decimal",
     "parse_number",
     "VoiceCommand", "VoiceRouter", "default_voice_router",
+    "CoordinateSpace", "downscale_png", "normalized_space", "xga_space",
     # MCP server
     "AuditLogger", "HttpMCPServer", "MCPContent", "MCPPrompt",
     "MCPPromptArgument", "MCPResource", "MCPServer", "MCPTool",

diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py
@@ -1019,6 +1019,28 @@ def _add_misc_specs(specs: List[CommandSpec]) -> None:
         fields=(),
         description="Remove all registered voice commands.",
     ))
+    specs.append(CommandSpec(
+        "AC_to_physical", "Agent", "Coords: Model -> Physical",
+        fields=(
+            FieldSpec("x", FieldType.FLOAT), FieldSpec("y", FieldType.FLOAT),
+            FieldSpec("physical_w", FieldType.INT),
+            FieldSpec("physical_h", FieldType.INT),
+            FieldSpec("model_w", FieldType.INT),
+            FieldSpec("model_h", FieldType.INT),
+        ),
+        description="Map a model-grid coordinate to physical pixels.",
+    ))
+    specs.append(CommandSpec(
+        "AC_to_model", "Agent", "Coords: Physical -> Model",
+        fields=(
+            FieldSpec("x", FieldType.INT), FieldSpec("y", FieldType.INT),
+            FieldSpec("physical_w", FieldType.INT),
+            FieldSpec("physical_h", FieldType.INT),
+            FieldSpec("model_w", FieldType.INT),
+            FieldSpec("model_h", FieldType.INT),
+        ),
+        description="Map a physical-pixel coordinate to a model grid.",
+    ))
     specs.append(CommandSpec(
         "AC_generate_sop", "Report", "Generate SOP Document",
         fields=(

diff --git a/je_auto_control/utils/coordinate_space/__init__.py b/je_auto_control/utils/coordinate_space/__init__.py
@@ -0,0 +1,8 @@
+"""Coordinate-space mapping between model grids and physical pixels."""
+from je_auto_control.utils.coordinate_space.coordinate_space import (
+    CoordinateSpace, downscale_png, normalized_space, xga_space,
+)
+
+__all__ = [
+    "CoordinateSpace", "downscale_png", "normalized_space", "xga_space",
+]
diff --git a/je_auto_control/utils/coordinate_space/coordinate_space.py b/je_auto_control/utils/coordinate_space/coordinate_space.py
@@ -0,0 +1,76 @@
+"""Map coordinates between a model's grid and physical screen pixels.
+
+Computer-use / VLA models do not click in physical pixels: Anthropic recommends
+downscaling the screenshot to XGA (~1024x768) and mapping clicks back; Gemini
+computer-use returns a normalized **1000x1000** grid; others assume the declared
+display size. A :class:`CoordinateSpace` captures the physical resolution and the
+model's grid and converts both ways, so an agent loop can send the model a
+right-sized screenshot and translate its clicks back to real coordinates.
+
+Pure arithmetic for the mapping (no dependency); :func:`downscale_png` uses
+Pillow, which is already a core dependency. Imports no ``PySide6``.
+"""
+from dataclasses import dataclass
+from typing import Tuple
+
+
+@dataclass(frozen=True)
+class CoordinateSpace:
+    """A mapping between physical pixels and a model coordinate grid."""
+
+    physical_w: int
+    physical_h: int
+    model_w: int
+    model_h: int
+
+    def to_physical(self, x: float, y: float) -> Tuple[int, int]:
+        """Map a model-space ``(x, y)`` to physical pixels (clamped, rounded)."""
+        px = round(x * self.physical_w / self.model_w)
+        py = round(y * self.physical_h / self.model_h)
+        return (_clamp(px, self.physical_w), _clamp(py, self.physical_h))
+
+    def to_model(self, x: int, y: int) -> Tuple[int, int]:
+        """Map physical pixels ``(x, y)`` to model space (clamped, rounded)."""
+        mx = round(x * self.model_w / self.physical_w)
+        my = round(y * self.model_h / self.physical_h)
+        return (_clamp(mx, self.model_w), _clamp(my, self.model_h))
+
+    @property
+    def model_size(self) -> Tuple[int, int]:
+        """The model grid as ``(width, height)``."""
+        return (self.model_w, self.model_h)
+
+
+def _clamp(value: int, size: int) -> int:
+    return max(0, min(int(value), size - 1))
+
+
+def xga_space(physical_w: int, physical_h: int, *, max_w: int = 1024,
+              max_h: int = 768) -> CoordinateSpace:
+    """Build a space that fits the screen within ``max_w`` x ``max_h``.
+
+    The aspect ratio is preserved (the larger downscale factor wins), matching
+    the Anthropic "downscale to XGA" recommendation.
+    """
+    scale = min(max_w / physical_w, max_h / physical_h, 1.0)
+    model_w = max(1, round(physical_w * scale))
+    model_h = max(1, round(physical_h * scale))
+    return CoordinateSpace(physical_w, physical_h, model_w, model_h)
+
+
+def normalized_space(physical_w: int, physical_h: int, *,
+                     grid: int = 1000) -> CoordinateSpace:
+    """Build a square normalized grid (default 1000x1000, Gemini-style)."""
+    return CoordinateSpace(physical_w, physical_h, grid, grid)
+
+
+def downscale_png(png: bytes, space: CoordinateSpace) -> bytes:
+    """Resize a PNG screenshot to ``space``'s model size (for model input)."""
+    import io
+
+    from PIL import Image
+    with Image.open(io.BytesIO(png)) as image:
+        resized = image.convert("RGB").resize(space.model_size)
+        buffer = io.BytesIO()
+        resized.save(buffer, format="PNG")
+        return buffer.getvalue()
diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py
@@ -3195,6 +3195,24 @@ def _voice_clear() -> Dict[str, Any]:
     return {"cleared": True}
 
 
+def _to_physical(x: float, y: float, physical_w: int, physical_h: int,
+                 model_w: int, model_h: int) -> Dict[str, Any]:
+    """Adapter: map a model-grid coordinate to physical pixels."""
+    from je_auto_control.utils.coordinate_space import CoordinateSpace
+    px, py = CoordinateSpace(physical_w, physical_h, model_w,
+                             model_h).to_physical(x, y)
+    return {"x": px, "y": py}
+
+
+def _to_model(x: int, y: int, physical_w: int, physical_h: int,
+              model_w: int, model_h: int) -> Dict[str, Any]:
+    """Adapter: map a physical-pixel coordinate to a model grid."""
+    from je_auto_control.utils.coordinate_space import CoordinateSpace
+    mx, my = CoordinateSpace(physical_w, physical_h, model_w,
+                             model_h).to_model(x, y)
+    return {"x": mx, "y": my}
+
+
 class Executor:
     """
     Executor
@@ -3465,6 +3483,8 @@ def __init__(self):
             "AC_voice_dispatch": _voice_dispatch,
             "AC_voice_list": _voice_list,
             "AC_voice_clear": _voice_clear,
+            "AC_to_physical": _to_physical,
+            "AC_to_model": _to_model,
             "AC_a11y_record_start": _a11y_record_start,
             "AC_a11y_record_stop": _a11y_record_stop,
             "AC_a11y_record_events": _a11y_record_events,

diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py
@@ -3049,6 +3049,33 @@ def voice_tools() -> List[MCPTool]:
     ]
 
 
+def coordinate_space_tools() -> List[MCPTool]:
+    _DIMS = {"x": {"type": "number"}, "y": {"type": "number"},
+             "physical_w": {"type": "integer"},
+             "physical_h": {"type": "integer"},
+             "model_w": {"type": "integer"}, "model_h": {"type": "integer"}}
+    _REQ = ["x", "y", "physical_w", "physical_h", "model_w", "model_h"]
+    return [
+        MCPTool(
+            name="ac_to_physical",
+            description=("Map a model-grid coordinate (e.g. a 1000x1000 or XGA "
+                         "click from a computer-use model) to physical screen "
+                         "pixels. Returns {x, y}."),
+            input_schema=schema(dict(_DIMS), list(_REQ)),
+            handler=h.to_physical,
+            annotations=READ_ONLY,
+        ),
+        MCPTool(
+            name="ac_to_model",
+            description=("Map a physical-pixel coordinate to a model grid "
+                         "(inverse of ac_to_physical). Returns {x, y}."),
+            input_schema=schema(dict(_DIMS), list(_REQ)),
+            handler=h.to_model,
+            annotations=READ_ONLY,
+        ),
+    ]
+
+
 def unattended_tools() -> List[MCPTool]:
     return [
         MCPTool(
@@ -4110,7 +4137,7 @@ def media_assert_tools() -> List[MCPTool]:
     credential_lease_tools, egress_tools, approval_testing_tools,
     trajectory_eval_tools, compliance_tools, agent_trace_tools,
     video_report_tools, fuzzy_tools, artifact_store_tools, image_dedup_tools,
-    locale_tools, voice_tools,
+    locale_tools, voice_tools, coordinate_space_tools,
     screen_record_tools,
     process_and_shell_tools, remote_desktop_tools, gamepad_tools,
     usb_passthrough_tools, assertion_tools, data_source_tools,

diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py
@@ -1471,6 +1471,20 @@ def voice_clear():
     return {"cleared": True}
 
 
+def to_physical(x, y, physical_w, physical_h, model_w, model_h):
+    from je_auto_control.utils.coordinate_space import CoordinateSpace
+    px, py = CoordinateSpace(physical_w, physical_h, model_w,
+                             model_h).to_physical(x, y)
+    return {"x": px, "y": py}
+
+
+def to_model(x, y, physical_w, physical_h, model_w, model_h):
+    from je_auto_control.utils.coordinate_space import CoordinateSpace
+    mx, my = CoordinateSpace(physical_w, physical_h, model_w,
+                             model_h).to_model(x, y)
+    return {"x": mx, "y": my}
+
+
 def vlm_locate(description: str,
                screen_region: Optional[List[int]] = None,
                model: Optional[str] = None) -> Optional[List[int]]: