Skip to content

Conversation

@janhoy
Copy link
Contributor

@janhoy janhoy commented Dec 23, 2025

This is a backport of #3784 to branch_9x, targeting Solr 9.11. We normally don't remove features in a minor version, but it was decided to do an exception due to Tika 1.x being EOL, see https://issues.apache.org/jira/browse/SOLR-18037 for details.

Since we still have XLSXResponseWriter in 9.x, using Apache POI, we cannot get rid of as many dependencies as in 10.0.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports the removal of the deprecated "local" Tika extraction backend from Solr 10.0 to Solr 9.11 (branch_9x). The change addresses security concerns by eliminating the vulnerable Tika 1.x library, requiring users to use an external Tika Server instead.

Key changes include:

  • Upgrading Tika from 1.28.5 to 3.2.3 and Apache POI from 5.2.2 to 5.5.1
  • Removing the LocalTikaExtractionBackend and related code
  • Removing numerous Tika 1.x parser dependencies (PDFBox, NetCDF, SIS, etc.)
  • Making tikaserver.url a required configuration parameter
  • Adding clear error messages for removed configuration options

Reviewed changes

Copilot reviewed 198 out of 206 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
versions.props, versions.lock Updated Tika to 3.2.3, POI to 5.5.1, removed ~60 Tika 1.x parser dependencies
ExtractingRequestHandler.java Removed local backend support, added validation requiring tikaserver.url, provides clear error messages for deprecated configs
TikaServerExtractionBackend.java Minor error message improvement, TODO comment about removing Tika dependency
ParseContextConfig.java, LocalTikaExtractionBackend.java Completely removed (local backend implementation files)
indexing-with-tika.adoc Updated documentation to reflect tikaserver-only approach, Docker setup instructions
test files Removed local backend tests, updated integration tests to use Docker-based Tika Server
sample configs Updated techproducts config to point to localhost:9998 Tika Server
license files Removed ~100 license files for removed Tika 1.x parser dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 255 to 256
Specifies the backend to use for extraction. As of Solr 10, only `tikaserver` is supported.
This parameter is optional since `tikaserver` is the only available backend.
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation incorrectly states "As of Solr 10" when this change is being backported to Solr 9.11 (branch_9x). This should be updated to "As of Solr 9.11" to accurately reflect when this feature was removed.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.. Thank you for taking on what feels like an important yet somewhat thankless task @janhoy.

I appreciate the documentation edits you made.

@epugh
Copy link
Contributor

epugh commented Dec 23, 2025

@janhoy I approved it, but obviosly there are test fixes needed... The thrust of the PR looks great. Let me know if you need additional manual testing.

@janhoy
Copy link
Contributor Author

janhoy commented Dec 25, 2025

@janhoy I approved it, but obviosly there are test fixes needed... The thrust of the PR looks great. Let me know if you need additional manual testing.

I got XLSXResponsewriter working. But of course there is also langid module that uses Tika1.x LanguageIdentifier. Tika3 has a completely new LanguageDetector API with an OptimizeLangDetector implementation quite different from the old Tika detector. I'm doing a shot at switching our TikaLanguageIdentifierUpdateProcessor to the new one, and if most existing tests pass, call it a good temporary solution for 9.x.

EDIT: Turned out that the old LanguageIdentifier class is still present in Tika3, so it ended up as a minor change and no user facing changes.

@janhoy janhoy requested a review from gerlowskija December 25, 2025 02:53
@janhoy
Copy link
Contributor Author

janhoy commented Dec 25, 2025

Appreciate any review, but especially ref-guide changes (so it is clear to users this is a breaking change).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants