Skip to content

Conversation

@mawelborn
Copy link
Contributor

@mawelborn mawelborn commented Oct 14, 2025

This PR expands the results and etloutput APIs to make it easier use OCR information in Auto Review and Custom Output. This is staged for a new v7.2.2 release of the toolkit.

Enhancements

The DocumentExtraction class has gained tokens, tables, and cells attributes that are set by PredictionList.assign_ocr(etl_outputs) using the predictions' spans.

Shorthand convenience properties for singular token, table, and cell access like that of span are also available.

These attributes and properties default to falsey NULL_TOKEN, NULL_TABLE, and NULL_CELL constants rather than None. This reduces the number of conditional checks needed to make code type-safe, and simplifies the use of .where() and .groupby().

@dataclass
class DocumentExtraction(Extraction):
    tokens: list[Token] = field(default_factory=list)
    tables: list[Table] = field(default_factory=list)
    cells: list[Cell] = field(default_factory=list)

    @property
    def token(self) -> Token: ...
    @property
    def table(self) -> Table: ...
    @property
    def cell(self) -> Cell: ...
    @property
    def table_cells(self) -> Iterator[tuple[Table, Cell]]: ...


class PredictionList(List[PredictionType]):
    def assign_ocr(
        self, etl_outputs: Mapping[Document, EtlOutput], tokens: bool = True, tables: bool = True
    ) -> "Self":
        """
        Assign OCR tokens, tables, and/or cells using `etl_outputs`.
        Use `tokens` or `tables` to skip lookup and assignment of those attributes.
        """
        ...

Using this API, OCR is assigned once early in code execution, and then accessed directly from predictions thereafter. Tokens and Tables OCR can be selectively assigned to a filtered PredictionList when performance is a concern. OCR can be reassigned if needed after splitting predictions or manipulating spans.

Token, Table, and Cell lookups are fast. They jump to a specific page of the document, then bisect the tokens or table cells on that page using span information.

AR/CO Code Comparison

Prior to this PR, predictions and OCR were entirely separate. They were awkward to combine in code,
needed many with suppress(...): blocks, and duplicated a lot of boilerplate in each function.

Particularly common and egregious were the one-liners to sort predictions by OCR bounding box order. This is now greatly simplified. (Code examples assume predictions.assign_ocr(etl_outputs) has already been called.)

Before:

predictions.orderby(lambda extraction: etl_outputs[extraction.document].token_for(extraction.span).box)

After:

predictions.orderby(attrgetter("token.box"))

Expanding predictions to their cell's full text is common for dense tabular data. Previously it required many statements to access ETL Output, tokens, tables, and cells. Now it's straightforward.

Before:

def expand_to_cell(
    etl_outputs: Mapping[Document, EtlOutput],
    predictions: PredictionList[Prediction],
) -> None:
    """
    Expand each prediction's text to the cell that contains it.
    """
    for extraction in predictions.document_extractions:
        etl_output = etl_outputs[extraction.document]
        with suppress(TokenNotFoundError, TableCellNotFoundError):
            token = etl_output.token_for(extraction.span)
            table, cell = etl_output.table_cell_for(token)
            extraction.text = cell.text
            extraction.spans = cell.spans

After:

def expand_to_cell(predictions: PredictionList[Prediction]) -> None:
    """
    Expand each prediction's text to the cell that contains it.
    """
    for extraction in predictions.document_extractions.where(attrgetter("cell")):
        extraction.text = extraction.cell.text
        extraction.spans = extraction.cell.spans

Splitting a prediction along cell boundaries was previously very difficult (and required more functions than I'd like to include here). Now it's simple enough for a single function.

After:

def split_along_cell_boundaries(predictions: PredictionList[Prediction]) -> None:
    """
    Split each prediction along cells boundaries.
    """
    for extraction in predictions.document_extractions.where(lambda e: len(e.cells) > 1):
        for table, cell in extraction.table_cells:
            predictions.append(
                replace(
                    extraction.copy(),
                    tables=[table],
                    cells=[cell],
                    spans=cell.spans,
                    text=cell.text,
                )
            )
        extraction.reject()

Previously some useful operations like "group by table, then by row" could not be expressed using the PredictionList.groupby() API. Doing so required statements that must be written separately. Only grouping by line was straightforward. The updated API makes both possible and easy to do.

Before:

def group_predictions_by_line(
    etl_outputs: Mapping[Document, EtlOutput],
    predictions: PredictionList[Prediction],
) -> None:
    """
    Group predictions that appear on the same line.
    """
    extractions_by_page = predictions.document_extractions.orderby(
        lambda extraction: etl_outputs[extraction.document].token_for(extraction.span).box,
    ).groupby(
        lambda extraction: (extraction.document, extraction.page),
    )

    for (document, page), extractions in extractions_by_page.items():
        etl_output = etl_outputs[document]
        for extraction in extractions:
            token = etl_output.token_for(extraction.span)

            if token.box.top > current_bottom:
                ...

After:

def group_predictions_by_line(predictions: PredictionList[Prediction]) -> None:
    """
    Group predictions that appear on the same line.
    """
    extractions_by_page = predictions.document_extractions.orderby(
        attrgetter("token.box"),
    ).groupby(
        attrgetter("document", "page"),
    )

    for page_extractions in extractions_by_page.values():
        for extraction in page_extractions:
            if extraction.token.box.top > current_bottom:
                ...

def group_predictions_by_table_row(predictions: PredictionList[Prediction]) -> None:
    """
    Group predictions that appear on the same table row.
    """
    extractions_by_table_row = predictions.document_extractions.where(
        attrgetter("table"),
    ).groupby(
        attrgetter("table", "cell.range.row"),
    )

    for row_extractions in extractions_by_table_row.values():
        row_extractions.apply(lambda extraction: extraction.groups.add(group))

@mawelborn mawelborn force-pushed the mawelborn/assign-etl-output branch from cd77477 to 68c407d Compare October 20, 2025 21:22
@mawelborn mawelborn merged commit 45d6d28 into main Oct 20, 2025
10 checks passed
@mawelborn mawelborn deleted the mawelborn/assign-etl-output branch October 20, 2025 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants