Assign OCR to Predictions #221

mawelborn · 2025-10-14T23:17:20Z

This PR expands the results and etloutput APIs to make it easier use OCR information in Auto Review and Custom Output. This is staged for a new v7.2.2 release of the toolkit.

Enhancements

The DocumentExtraction class has gained tokens, tables, and cells attributes that are set by PredictionList.assign_ocr(etl_outputs) using the predictions' spans.

Shorthand convenience properties for singular token, table, and cell access like that of span are also available.

These attributes and properties default to falsey NULL_TOKEN, NULL_TABLE, and NULL_CELL constants rather than None. This reduces the number of conditional checks needed to make code type-safe, and simplifies the use of .where() and .groupby().

@dataclass
class DocumentExtraction(Extraction):
    tokens: list[Token] = field(default_factory=list)
    tables: list[Table] = field(default_factory=list)
    cells: list[Cell] = field(default_factory=list)

    @property
    def token(self) -> Token: ...
    @property
    def table(self) -> Table: ...
    @property
    def cell(self) -> Cell: ...
    @property
    def table_cells(self) -> Iterator[tuple[Table, Cell]]: ...


class PredictionList(List[PredictionType]):
    def assign_ocr(
        self, etl_outputs: Mapping[Document, EtlOutput], tokens: bool = True, tables: bool = True
    ) -> "Self":
        """
        Assign OCR tokens, tables, and/or cells using `etl_outputs`.
        Use `tokens` or `tables` to skip lookup and assignment of those attributes.
        """
        ...

Using this API, OCR is assigned once early in code execution, and then accessed directly from predictions thereafter. Tokens and Tables OCR can be selectively assigned to a filtered PredictionList when performance is a concern. OCR can be reassigned if needed after splitting predictions or manipulating spans.

Token, Table, and Cell lookups are fast. They jump to a specific page of the document, then bisect the tokens or table cells on that page using span information.

AR/CO Code Comparison

Prior to this PR, predictions and OCR were entirely separate. They were awkward to combine in code,
needed many with suppress(...): blocks, and duplicated a lot of boilerplate in each function.

Particularly common and egregious were the one-liners to sort predictions by OCR bounding box order. This is now greatly simplified. (Code examples assume predictions.assign_ocr(etl_outputs) has already been called.)

Before:

predictions.orderby(lambda extraction: etl_outputs[extraction.document].token_for(extraction.span).box)

After:

predictions.orderby(attrgetter("token.box"))

Expanding predictions to their cell's full text is common for dense tabular data. Previously it required many statements to access ETL Output, tokens, tables, and cells. Now it's straightforward.

Before:

def expand_to_cell(
    etl_outputs: Mapping[Document, EtlOutput],
    predictions: PredictionList[Prediction],
) -> None:
    """
    Expand each prediction's text to the cell that contains it.
    """
    for extraction in predictions.document_extractions:
        etl_output = etl_outputs[extraction.document]
        with suppress(TokenNotFoundError, TableCellNotFoundError):
            token = etl_output.token_for(extraction.span)
            table, cell = etl_output.table_cell_for(token)
            extraction.text = cell.text
            extraction.spans = cell.spans

After:

def expand_to_cell(predictions: PredictionList[Prediction]) -> None:
    """
    Expand each prediction's text to the cell that contains it.
    """
    for extraction in predictions.document_extractions.where(attrgetter("cell")):
        extraction.text = extraction.cell.text
        extraction.spans = extraction.cell.spans

Splitting a prediction along cell boundaries was previously very difficult (and required more functions than I'd like to include here). Now it's simple enough for a single function.

After:

def split_along_cell_boundaries(predictions: PredictionList[Prediction]) -> None:
    """
    Split each prediction along cells boundaries.
    """
    for extraction in predictions.document_extractions.where(lambda e: len(e.cells) > 1):
        for table, cell in extraction.table_cells:
            predictions.append(
                replace(
                    extraction.copy(),
                    tables=[table],
                    cells=[cell],
                    spans=cell.spans,
                    text=cell.text,
                )
            )
        extraction.reject()

Previously some useful operations like "group by table, then by row" could not be expressed using the PredictionList.groupby() API. Doing so required statements that must be written separately. Only grouping by line was straightforward. The updated API makes both possible and easy to do.

Before:

def group_predictions_by_line(
    etl_outputs: Mapping[Document, EtlOutput],
    predictions: PredictionList[Prediction],
) -> None:
    """
    Group predictions that appear on the same line.
    """
    extractions_by_page = predictions.document_extractions.orderby(
        lambda extraction: etl_outputs[extraction.document].token_for(extraction.span).box,
    ).groupby(
        lambda extraction: (extraction.document, extraction.page),
    )

    for (document, page), extractions in extractions_by_page.items():
        etl_output = etl_outputs[document]
        for extraction in extractions:
            token = etl_output.token_for(extraction.span)

            if token.box.top > current_bottom:
                ...

After:

def group_predictions_by_line(predictions: PredictionList[Prediction]) -> None:
    """
    Group predictions that appear on the same line.
    """
    extractions_by_page = predictions.document_extractions.orderby(
        attrgetter("token.box"),
    ).groupby(
        attrgetter("document", "page"),
    )

    for page_extractions in extractions_by_page.values():
        for extraction in page_extractions:
            if extraction.token.box.top > current_bottom:
                ...

def group_predictions_by_table_row(predictions: PredictionList[Prediction]) -> None:
    """
    Group predictions that appear on the same table row.
    """
    extractions_by_table_row = predictions.document_extractions.where(
        attrgetter("table"),
    ).groupby(
        attrgetter("table", "cell.range.row"),
    )

    for row_extractions in extractions_by_table_row.values():
        row_extractions.apply(lambda extraction: extraction.groups.add(group))

…n overlap algorithm

…kups

…whole document

mawelborn added 16 commits October 9, 2025 21:49

Move Box, Span, and utils to etloutput to avoid circular imports

2ecd797

Ensure null spans raise TokenNotFoundError instead of ValueError

d9a5363

Parse table spans

f586553

Add NULL_CELL, NULL_RANGE, NULL_TABLE, and NULL_TOKEN

c207996

Add DocumentExtraction properties for OCR tokens, tables, and cells

8fd5845

Add set-like overlap syntax for Box and Span

f44b8f1

Rewrite table cell lookup to support multiple cells using a naive spa…

95301f5

…n overlap algorithm

Return NULL_TOKEN rather than raising an error for failed token loo…

85552ea

…kups

Replace custom ResultError with idiomatic ValueError

b9712ef

Remove unused custom ETL Output and Result error classes

a436ebf

Clean up TYPE_CHECKING imports

6ebe77e

Clean up some comments and formatting

dc32773

Add PredictionList.assign_ocr() method

6e1ad81

Rewrite EtlOutput.table_cells_for() using a bisection algorithm

7fdf57c

Optimize table cell lookup by bisecting a single page instead of the …

9ba9fea

…whole document

Bump version and update changelog

fe419ff

mawelborn self-assigned this Oct 14, 2025

mawelborn requested review from Scott771, andrew8bit, annaliu-indico and nickesparza October 15, 2025 14:23

mawelborn added 3 commits October 20, 2025 10:10

Speed up .groupby("table") and .groupby("cell") with custom __hash__

477367e

Add prediction .copy() methods that only copy mutable state

fb797b6

Update changelog

68c407d

Scott771 approved these changes Oct 20, 2025

View reviewed changes

mawelborn force-pushed the mawelborn/assign-etl-output branch from cd77477 to 68c407d Compare October 20, 2025 21:22

mawelborn merged commit 45d6d28 into main Oct 20, 2025
10 checks passed

mawelborn deleted the mawelborn/assign-etl-output branch October 20, 2025 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assign OCR to Predictions #221

Assign OCR to Predictions #221

Uh oh!

mawelborn commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Assign OCR to Predictions #221

Assign OCR to Predictions #221

Uh oh!

Conversation

mawelborn commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhancements

AR/CO Code Comparison

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mawelborn commented Oct 14, 2025 •

edited

Loading