-
Notifications
You must be signed in to change notification settings - Fork 792
SOLR-18037 Remove "local" tika extraction backend from branch_9x #3980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch_9x
Are you sure you want to change the base?
Conversation
Co-authored-by: Eric Pugh <epugh@opensourceconnections.com>
Updates versions.props and versions.lock to use Tika 3.2.3 instead of 1.28.5, matching the version update in the main branch. This is part of the backport of SOLR-17961.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR backports the removal of the deprecated "local" Tika extraction backend from Solr 10.0 to Solr 9.11 (branch_9x). The change addresses security concerns by eliminating the vulnerable Tika 1.x library, requiring users to use an external Tika Server instead.
Key changes include:
- Upgrading Tika from 1.28.5 to 3.2.3 and Apache POI from 5.2.2 to 5.5.1
- Removing the LocalTikaExtractionBackend and related code
- Removing numerous Tika 1.x parser dependencies (PDFBox, NetCDF, SIS, etc.)
- Making tikaserver.url a required configuration parameter
- Adding clear error messages for removed configuration options
Reviewed changes
Copilot reviewed 198 out of 206 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| versions.props, versions.lock | Updated Tika to 3.2.3, POI to 5.5.1, removed ~60 Tika 1.x parser dependencies |
| ExtractingRequestHandler.java | Removed local backend support, added validation requiring tikaserver.url, provides clear error messages for deprecated configs |
| TikaServerExtractionBackend.java | Minor error message improvement, TODO comment about removing Tika dependency |
| ParseContextConfig.java, LocalTikaExtractionBackend.java | Completely removed (local backend implementation files) |
| indexing-with-tika.adoc | Updated documentation to reflect tikaserver-only approach, Docker setup instructions |
| test files | Removed local backend tests, updated integration tests to use Docker-based Tika Server |
| sample configs | Updated techproducts config to point to localhost:9998 Tika Server |
| license files | Removed ~100 license files for removed Tika 1.x parser dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Specifies the backend to use for extraction. As of Solr 10, only `tikaserver` is supported. | ||
| This parameter is optional since `tikaserver` is the only available backend. |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation incorrectly states "As of Solr 10" when this change is being backported to Solr 9.11 (branch_9x). This should be updated to "As of Solr 9.11" to accurately reflect when this feature was removed.
...ules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java
Show resolved
Hide resolved
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc
Show resolved
Hide resolved
epugh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.. Thank you for taking on what feels like an important yet somewhat thankless task @janhoy.
I appreciate the documentation edits you made.
|
@janhoy I approved it, but obviosly there are test fixes needed... The thrust of the PR looks great. Let me know if you need additional manual testing. |
I got XLSXResponsewriter working. But of course there is also EDIT: Turned out that the old LanguageIdentifier class is still present in Tika3, so it ended up as a minor change and no user facing changes. |
|
Appreciate any review, but especially ref-guide changes (so it is clear to users this is a breaking change). |
This is a backport of #3784 to branch_9x, targeting Solr 9.11. We normally don't remove features in a minor version, but it was decided to do an exception due to Tika 1.x being EOL, see https://issues.apache.org/jira/browse/SOLR-18037 for details.
Since we still have
XLSXResponseWriterin 9.x, using Apache POI, we cannot get rid of as many dependencies as in 10.0.