Skip to content

Word cell sometimes contains multiple words #186

@jaleigh

Description

@jaleigh

I have words in a line being combined together into a single word box by pdf_sanitator<PAGE_CELLS>::create_word_cells

The space between the end of one word and the next is < space_width_factor_for_merge so the word box contains multiple words.

Should a space character trigger a word regardless of the character box distances? i.e. rather than remove all space chars (pdf_sanitator<PAGE_CELLS>::create_word_cells (https://github.com/docling-project/docling-parse/blob/main/src/v2/pdf_sanitators/cells.h Line 136)) should they not be used as a marker to prevent merging.

    // remove all spaces 
    auto itr = word_cells.begin();
    while(itr!=word_cells.end())
      {
	if(utils::string::is_space(itr->text))
	  {
	    itr = word_cells.erase(itr);	    
	  }
	else
	  {
	    itr++;
	  }
      }

I'm currently having to post process the word boxes to check if it overlaps with ' ' box and break the word box up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions