Skip to content

openkb remove leaves orphan hash and reformats unrelated wiki #58

@SeungwookHan

Description

@SeungwookHan

A single openkb remove <doc> run surfaces two independent bugs at once. Reporting them together because
they share a single repro, but they have different root causes and need separate fixes.

Follow-up to #41 — both are regressions in the implementation shipped by PR #51.

Repro

  1. KB containing at least one document ingested before PR feat(cli): add openkb remove to safely delete a document (closes #41) #51 (hashes.json entry has only {name, type}, no doc_name key) and a handful of LLM-generated concept pages with pre-existing dangling
    wikilinks.
  2. openkb remove <that-doc> (e.g. openkb remove ollama).
  3. Observed: removal "succeeds" but git status / hashes.json show the symptoms below.

Bug 1 — hash entry is not removed for docs ingested before PR #51

cat .openkb/hashes.json still contains the removed doc's entry after openkb remove reports success.
Re-running openkb add <same-file> is then incorrectly treated as a duplicate via the SHA dedup.

Root cause

Commit c504e26 (within this same PR) fixed add_single_file so newly-ingested docs persist doc_name
into the registry. However, entries that already existed in hashes.json before that commit were not
backfilled
— they still carry only {name, type}.

HashRegistry.remove_by_doc_name (openkb/state.py:44-51) matches with meta.get("doc_name") == doc_name. For un-backfilled legacy entries the comparison evaluates to None == "<slug>" → always
False. The method silently returns False; nothing in the call chain checks the return value.

Meanwhile cli.py:670 (doc_name = meta.get("doc_name") or Path(name).stem) does fall back to the
filename stem to drive every other step, so summary/source/concept/index removal succeeds and the failure
is invisible at the surface.

Suggested fix

Either of the following — both are robust against un-backfilled legacy data:

  • Add a fallback in remove_by_doc_name that also matches when Path(meta["name"]).stem == doc_name, OR
  • Introduce remove_by_hash(file_hash) and call it from cli.py:842 since the CLI already has the
    matched hash in hand. Preferred — eliminates the slug round-trip and works regardless of doc_name
    presence.

A one-shot migration that backfills doc_name on the next openkb invocation would also clean this up,
but the call-site fix above is sufficient and avoids touching user data on read paths.


Bug 2 — unrelated wiki pages get reformatted

Removing a single doc produces a sprawling diff. In my repro, removing one ollama.md produced a
39-file / 1254-line diff; 27 of those were concept pages that didn't list ollama as a source.

Example (from a concept page unrelated to ollama):

- **Knowledge access**: agents need curated context such as [[LLM Wiki]]                                
+ **Knowledge access**: agents need curated context such as LLM Wiki                                      

Root cause

cli.py:815 calls fix_broken_links(wiki_dir) over the entire wiki on every remove.
openkb/lint.py:fix_broken_links strips every dangling wikilink in the KB, not only the ones created by
this removal. Pre-existing ghost links (LLM-generated, hand-edited, links to not-yet-added concepts, etc.)
get swept up too.

Impact

  • Removal commits are unreadable — actual deletion effects are buried under unrelated reformat noise.
  • Users lose [[wikilinks]] they may want to keep (e.g. links to a concept they plan to add later).
  • Violates least-surprise: the command name says "remove one doc," but the diff shows wiki-wide
    refactoring.

Suggested fix (preferred)

Limit ghost-link stripping to files actually touched by this removal: concept_result["modified"]
{index.md}. Preserves the original PR #49 intent (clean up dangling links the removal just created)
without sweeping the rest of the KB.

Alternatives

  • Snapshot the global ghost set before/after the removal and strip only the newly-introduced ghosts.
  • Make the global pass opt-in via a flag (e.g. --lint), default off.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions