Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## Table of Contents

- [What's new (2026-06-19) — Data Quality](#whats-new-2026-06-19--data-quality)
- [What's new (2026-06-19) — SBOM & Suite Sharding](#whats-new-2026-06-19--sbom--suite-sharding)
- [What's new (2026-06-19) — Reactive Observer](#whats-new-2026-06-19--reactive-observer)
- [What's new (2026-06-19) — WCAG 2.2 Audit](#whats-new-2026-06-19--wcag-22-audit)
Expand Down Expand Up @@ -71,6 +72,14 @@

---

## What's new (2026-06-19) — Data Quality

Three pure-stdlib data-quality helpers (the gate between `load_rows`/OCR and downstream entry), full stack. Full reference: [`docs/source/Eng/doc/new_features/v19_features_doc.rst`](docs/source/Eng/doc/new_features/v19_features_doc.rst).

- **Row schema validation** — `validate_rows(rows, schema)` (`AC_validate_rows`, `ac_validate_rows`): declarative per-field rules (type/required/regex/min/max/min_len/max_len/allowed/unique); returns `{ok, valid, invalid, errors}` so bad scraped/OCR data is caught before it corrupts an ERP/form.
- **Field extraction** — `extract_fields(text, fields, patterns)` (`AC_extract_fields`, `ac_extract_fields`): named regex presets (email/url/ipv4/phone/date_iso/amount/hashtag) + custom patterns over free text / OCR blobs.
- **Row masking** — `mask_rows(rows, rules)` (`AC_mask_rows`, `ac_mask_rows`): mask columns before export — `redact` / `hash` (SHA-256) / `partial` (keep last 4); complements the screenshot-only redaction.

## What's new (2026-06-19) — SBOM & Suite Sharding

Two pure-stdlib ops tools (security + scale research angles), full stack. Full reference: [`docs/source/Eng/doc/new_features/v18_features_doc.rst`](docs/source/Eng/doc/new_features/v18_features_doc.rst).
Expand Down
9 changes: 9 additions & 0 deletions README/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目录

- [本次更新 (2026-06-19) — 数据质量](#本次更新-2026-06-19--数据质量)
- [本次更新 (2026-06-19) — SBOM 与测试分片](#本次更新-2026-06-19--sbom-与测试分片)
- [本次更新 (2026-06-19) — 反应式观察器](#本次更新-2026-06-19--反应式观察器)
- [本次更新 (2026-06-19) — WCAG 2.2 审计](#本次更新-2026-06-19--wcag-22-审计)
Expand Down Expand Up @@ -70,6 +71,14 @@

---

## 本次更新 (2026-06-19) — 数据质量

三项纯标准库的数据质量辅助工具(介于 `load_rows`/OCR 与下游输入之间的闸),走完整五层。完整参考:[`docs/source/Zh/doc/new_features/v19_features_doc.rst`](../docs/source/Zh/doc/new_features/v19_features_doc.rst)。

- **数据行 schema 验证** — `validate_rows(rows, schema)`(`AC_validate_rows`、`ac_validate_rows`):声明式逐字段规则(type/required/regex/min/max/min_len/max_len/allowed/unique);返回 `{ok, valid, invalid, errors}`,在坏掉的抓取/OCR 数据污染 ERP/表单前拦下。
- **字段提取** — `extract_fields(text, fields, patterns)`(`AC_extract_fields`、`ac_extract_fields`):具名 regex 预设(email/url/ipv4/phone/date_iso/amount/hashtag)+自定义 patterns,作用于自由文本 / OCR 文本块。
- **数据行掩码** — `mask_rows(rows, rules)`(`AC_mask_rows`、`ac_mask_rows`):导出前掩码字段——`redact` / `hash`(SHA-256)/ `partial`(保留末 4 字);补足仅针对截图的脱敏。

## 本次更新 (2026-06-19) — SBOM 与测试分片

来自安全与规模研究角度的两项纯标准库运维工具,走完整五层。完整参考:[`docs/source/Zh/doc/new_features/v18_features_doc.rst`](../docs/source/Zh/doc/new_features/v18_features_doc.rst)。
Expand Down
9 changes: 9 additions & 0 deletions README/README_zh-TW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

## 目錄

- [本次更新 (2026-06-19) — 資料品質](#本次更新-2026-06-19--資料品質)
- [本次更新 (2026-06-19) — SBOM 與測試分片](#本次更新-2026-06-19--sbom-與測試分片)
- [本次更新 (2026-06-19) — 反應式觀察器](#本次更新-2026-06-19--反應式觀察器)
- [本次更新 (2026-06-19) — WCAG 2.2 稽核](#本次更新-2026-06-19--wcag-22-稽核)
Expand Down Expand Up @@ -70,6 +71,14 @@

---

## 本次更新 (2026-06-19) — 資料品質

三項純標準庫的資料品質輔助工具(介於 `load_rows`/OCR 與下游輸入之間的閘),走完整五層。完整參考:[`docs/source/Zh/doc/new_features/v19_features_doc.rst`](../docs/source/Zh/doc/new_features/v19_features_doc.rst)。

- **資料列 schema 驗證** — `validate_rows(rows, schema)`(`AC_validate_rows`、`ac_validate_rows`):宣告式逐欄規則(type/required/regex/min/max/min_len/max_len/allowed/unique);回傳 `{ok, valid, invalid, errors}`,在壞掉的抓取/OCR 資料汙染 ERP/表單前攔下。
- **欄位擷取** — `extract_fields(text, fields, patterns)`(`AC_extract_fields`、`ac_extract_fields`):具名 regex 預設(email/url/ipv4/phone/date_iso/amount/hashtag)+自訂 patterns,作用於自由文字 / OCR 文字塊。
- **資料列遮罩** — `mask_rows(rows, rules)`(`AC_mask_rows`、`ac_mask_rows`):匯出前遮罩欄位——`redact` / `hash`(SHA-256)/ `partial`(保留末 4 字);補足僅針對截圖的遮罩。

## 本次更新 (2026-06-19) — SBOM 與測試分片

來自安全與規模研究角度的兩項純標準庫維運工具,走完整五層。完整參考:[`docs/source/Zh/doc/new_features/v18_features_doc.rst`](../docs/source/Zh/doc/new_features/v18_features_doc.rst)。
Expand Down
69 changes: 69 additions & 0 deletions docs/source/Eng/doc/new_features/v19_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
==================================================
New Features (2026-06-19) — Data Quality
==================================================

Three pure-standard-library data-quality helpers from the data/validation
research angle — the quality gate between ingestion (``load_rows`` / OCR)
and downstream entry. Wired through the full stack (facade, ``AC_*``
executor commands, MCP tools, Script Builder).

.. contents::
:local:
:depth: 2


Row schema validation
====================

Validate scraped / loaded rows against a declarative schema before they
reach an ERP or form — bad data caught here doesn't corrupt downstream::

from je_auto_control import validate_rows

report = validate_rows(rows, {
"name": {"type": "str", "required": True},
"age": {"type": "int", "min": 0, "max": 130},
"email": {"regex": r".+@.+\..+"},
"id": {"unique": True},
"tier": {"allowed": ["gold", "silver"]},
})
report["ok"] # False if any row failed
report["valid"] # rows that passed
report["errors"] # [{"row": 1, "field": "age", "error": "above max 130"}]

Rules: ``type`` / ``required`` / ``regex`` / ``min`` / ``max`` /
``min_len`` / ``max_len`` / ``allowed`` / ``unique``. Exposed as
``AC_validate_rows`` / ``ac_validate_rows``.


Field extraction
===============

Pull structured values out of free text / OCR blobs with named regex
presets (plus your own ``patterns``)::

from je_auto_control import extract_fields

out = extract_fields("Mail ada@x.io on 2026-06-19",
fields=["email", "date_iso"])
# {"email": ["ada@x.io"], "date_iso": ["2026-06-19"]}

Presets: ``email`` / ``url`` / ``ipv4`` / ``phone`` / ``date_iso`` /
``amount`` / ``hashtag``. Exposed as ``AC_extract_fields`` /
``ac_extract_fields``.


Row masking
==========

Mask sensitive columns before exporting rows / reports (the existing
redaction is screenshot-only)::

from je_auto_control import mask_rows

safe = mask_rows(rows, {"ssn": "partial", "token": "redact",
"name": "hash"})
# ssn -> "*****6789", token -> "***", name -> sha256 hex

Modes: ``redact`` (``***``), ``hash`` (SHA-256 hex), ``partial`` (keep the
last 4 chars). Exposed as ``AC_mask_rows`` / ``ac_mask_rows``.
1 change: 1 addition & 0 deletions docs/source/Eng/eng_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Comprehensive guides for all AutoControl features.
doc/new_features/v16_features_doc
doc/new_features/v17_features_doc
doc/new_features/v18_features_doc
doc/new_features/v19_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
67 changes: 67 additions & 0 deletions docs/source/Zh/doc/new_features/v19_features_doc.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
========================================
新功能 (2026-06-19) — 資料品質
========================================

來自資料/驗證研究角度的三項純標準庫資料品質輔助工具——介於資料匯入
(``load_rows`` / OCR)與下游輸入之間的品質閘。走完整五層(facade、
``AC_*`` 執行器指令、MCP 工具、Script Builder)。

.. contents::
:local:
:depth: 2


資料列 schema 驗證
==================

在抓取/載入的資料列進入 ERP 或表單前,依宣告式 schema 驗證——在此攔下的
壞資料就不會汙染下游::

from je_auto_control import validate_rows

report = validate_rows(rows, {
"name": {"type": "str", "required": True},
"age": {"type": "int", "min": 0, "max": 130},
"email": {"regex": r".+@.+\..+"},
"id": {"unique": True},
"tier": {"allowed": ["gold", "silver"]},
})
report["ok"] # 任何列失敗則為 False
report["valid"] # 通過的資料列
report["errors"] # [{"row": 1, "field": "age", "error": "above max 130"}]

規則:``type`` / ``required`` / ``regex`` / ``min`` / ``max`` /
``min_len`` / ``max_len`` / ``allowed`` / ``unique``。對應
``AC_validate_rows`` / ``ac_validate_rows``。


欄位擷取
========

用具名的 regex 預設(也可加自訂 ``patterns``)從自由文字 / OCR 文字塊中
擷取結構化值::

from je_auto_control import extract_fields

out = extract_fields("Mail ada@x.io on 2026-06-19",
fields=["email", "date_iso"])
# {"email": ["ada@x.io"], "date_iso": ["2026-06-19"]}

預設:``email`` / ``url`` / ``ipv4`` / ``phone`` / ``date_iso`` /
``amount`` / ``hashtag``。對應 ``AC_extract_fields`` /
``ac_extract_fields``。


資料列遮罩
==========

在匯出資料列 / 報告前遮罩敏感欄位(既有的遮罩僅針對截圖)::

from je_auto_control import mask_rows

safe = mask_rows(rows, {"ssn": "partial", "token": "redact",
"name": "hash"})
# ssn -> "*****6789"、token -> "***"、name -> sha256 hex

模式:``redact``(``***``)、``hash``(SHA-256 hex)、``partial``(保留末 4
字)。對應 ``AC_mask_rows`` / ``ac_mask_rows``。
1 change: 1 addition & 0 deletions docs/source/Zh/zh_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ AutoControl 所有功能的完整使用指南。
doc/new_features/v16_features_doc
doc/new_features/v17_features_doc
doc/new_features/v18_features_doc
doc/new_features/v19_features_doc
doc/ocr_backends/ocr_backends_doc
doc/observability/observability_doc
doc/operations_layer/operations_layer_doc
Expand Down
5 changes: 5 additions & 0 deletions je_auto_control/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,10 @@
from je_auto_control.utils.sbom import build_sbom, write_sbom
# Duration-aware suite sharding + shard-result merge
from je_auto_control.utils.test_shard import merge_results, shard_flows
# Data-quality: row schema validation, field extraction, masking
from je_auto_control.utils.data_quality import (
extract_fields, mask_rows, validate_rows,
)
# Background popup/interrupt watchdog (unattended automation)
from je_auto_control.utils.watchdog import (
PopupWatchdog, WatchdogRule, default_popup_watchdog,
Expand Down Expand Up @@ -572,6 +576,7 @@ def start_autocontrol_gui(*args, **kwargs):
"image_predicate", "pixel_predicate", "text_predicate",
"build_sbom", "write_sbom",
"merge_results", "shard_flows",
"extract_fields", "mask_rows", "validate_rows",
# MCP server
"AuditLogger", "HttpMCPServer", "MCPContent", "MCPPrompt",
"MCPPromptArgument", "MCPResource", "MCPServer", "MCPTool",
Expand Down
19 changes: 19 additions & 0 deletions je_auto_control/gui/script_builder/command_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -658,6 +658,7 @@ def _add_misc_specs(specs: List[CommandSpec]) -> None:
_add_agent_specs(specs)
_add_office_specs(specs)
_add_memory_specs(specs)
_add_data_quality_specs(specs)
specs.append(CommandSpec(
"AC_wcag_audit", "Accessibility", "WCAG 2.2 Conformance Audit",
fields=(
Expand Down Expand Up @@ -738,6 +739,24 @@ def _add_observer_specs(specs: List[CommandSpec]) -> None:
description="Stop the background observer thread."))


def _add_data_quality_specs(specs: List[CommandSpec]) -> None:
specs.append(CommandSpec(
"AC_validate_rows", "Data", "Validate Rows (schema)",
description="Validate 'rows' against a 'schema' (both via JSON view).",
))
specs.append(CommandSpec(
"AC_extract_fields", "Data", "Extract Fields (regex)",
fields=(FieldSpec("text", FieldType.STRING),),
description="Pull email/url/phone/amount/... from text; 'fields' / "
"'patterns' via JSON view.",
))
specs.append(CommandSpec(
"AC_mask_rows", "Data", "Mask Rows",
description="Mask columns in 'rows' per 'rules' (redact/hash/partial),"
" via JSON view.",
))


def _add_memory_specs(specs: List[CommandSpec]) -> None:
db = FieldSpec("db", FieldType.FILE_PATH)
specs.append(CommandSpec(
Expand Down
6 changes: 6 additions & 0 deletions je_auto_control/utils/data_quality/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Data-quality helpers: row schema validation, field extraction, masking."""
from je_auto_control.utils.data_quality.data_quality import (
extract_fields, mask_rows, validate_rows,
)

__all__ = ["extract_fields", "mask_rows", "validate_rows"]
Loading
Loading