Skip to content

NotDroidUser/Readability4J

 
 

Repository files navigation

Readability4J

JitPack

Readability4J is a Kotlin port of Mozilla's Readability.js, which is used for Firefox's reader view: https://github.com/mozilla/readability.

It tries to detect the relevant content of a website and removes all clutter from it such as advertisements, navigation bars, social media buttons, etc.

The extracted text then can be used for indexing web pages, to provide the user a pleasant reading experience and similar.

As it‘s compatible with Mozilla‘s Readability.js it produces almost exact the same output as you would see in Firefox‘s Reader View (just some differ due to Jsoup‘s don't behave exactly in some cases, yet some things that you can‘t see them anyway).

Setup

Add it in your root settings.gradle at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url 'https://jitpack.io' }
		}
	}

Step 2. Add the dependency

	dependencies {
	        implementation 'com.github.NotDroidUser:Readability4J:2.0.0-jitpack-beta'
	}

Usage

From Java:

String url = "some-page.com";
String html = "Some Bloated Article html source";

Readability4J readability4J = new Readability4J(url, html); // url is just needed to resolve relative urls
Article article = readability4J.parse();

// returns extracted content in a <div> element
String extractedContentHtml = article.getContent();
// to get content wrapped in <html> tags and encoding set to UTF-8, see chapter 'Output encoding'
String extractedContentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding();
String extractedContentPlainText = article.getTextContent();
String title = article.getTitle();
String byline = article.getByline();
String excerpt = article.getExcerpt();

From Kotlin:

val url = "somepage.com"
val html = "Some Bloated Article html source"

val readability4J = Readability4J(url, html) // url is just needed to resolve relative urls
val article = readability4J.parse()

// returns extracted content in a <div> element
val extractedContentHtml = article.getContent()
// to get content wrapped in <html> tags and encoding set to UTF-8, see chapter 'Output encoding'
val extractedContentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding()
val extractedContentPlainText = article.getTextContent()
val title = article.getTitle()
val byline = article.getByline()
val excerpt = article.getExcerpt()

Why i can't use Readability4JExtended now?

As readability code changed a lot from the latest commit (2018-2025), had first updated Readability4J code base to make the updating process the less stressfully, yet you can do some alike with classes like:

On Java:

String url = "some-specific-page.com";
String html = "Some Bloated Article html source that needs extra steps";

Readability4J readability4J = Readability4J(url, html);
ArticleGrabber extended = new ArticleGrabber(readability4J.getOptions(),new BaseRegexUtilExtended());
readability4J.setArticleGrabber(extended);

On Kotlin:

val url = "some-specific-page.com"
val html = "Some Bloated Article html source that needs extra steps"

val readability4J = Readability4J(url, html)
readability4J.articleGrabber = ArticleGrabber(readability4J.options,BaseRegexUtilExtended())

Yet some of original Readability4JExtended like data-src was implemented on the original one (srcset regex for example)

Output encoding

As users noted (see Issue #1 and #2) by default no encoding is applied to Readability4J's output resulting in incorrect display of non-ASCII characters.

The reason is like Readability.js Readability4J returns its output in a <div> element, and the only way to set the encoding in HTML is in a <head> <meta charset=""> tag.

So I added these convenience methods to Article class:

On Java:

String contentHtmlWithUtf8Encoding = article.getContentWithUtf8Encoding();
// or (tries to apply site's charset, if set, or if not uses UTF-8 as fallback
String contentWithDocumentsCharsetOrUtf8 = article.getContentWithDocumentsCharsetOrUtf8();
// or
String contentHtmlWithCustomEncoding = article.getContentWithEncoding("ISO-8859-1");

On Kotlin:

var contentHtmlWithUtf8Encoding = article.contentWithUtf8Encoding
// or (tries to apply site's charset, if set, or if not uses UTF-8 as fallback
var contentWithDocumentsCharsetOrUtf8 = article.contentWithDocumentsCharsetOrUtf8
// or
var contentHtmlWithCustomEncoding = article.getContentWithEncoding("ISO-8859-1")

Which wrap the content in:

<html>
 <head>
  <meta charset="$encoding" /> 
 </head>
 <body>
 <!-- content -->
 </body>
</html>

Compatibility with Mozilla‘s Readability.js

As mentioned before, this is almost an exact copy of Mozilla's Readability.js. But since the code in only one file can be almost unreadable, I extracted some parts from the 2000+ lines of code into a new classes:

Readability.js function Readability4J location
_unwrapNoscriptImages(), _removeScripts() and _prepDocument() Preprocessor.unwrapNoscriptImages(), Preprocessor.removeScripts() and Preprocessor.prepDocument()
_grabArticle() ArticleGrabber.grabArticle()
_postProcessContent() Postprocessor.postProcessContent()
_getJSONLD(),_getArticleMetadata() MetadataParser.getJSONLD(), MetadataParser.getArticleMetadata()

I added some log functions on Util.kt so the nodes are logged as on Javascript for compare in test cases, also done a rollback to the latest compatible Jackson with Android API 19-25

Overview of which Mozilla‘s Readability.js commit a Readability4J version matches:

Version Commit Date
1.0 8da91b9 12/5/17
1.0.1 834672e 02/27/18
2.0.0-beta almost all test from [v0.6.0](https://github.com/mozilla/readability/commit/04fd32f72b448c12b02ba6c40928b67e510bac49) works 13/10/25
2.1.0-rc only 4 failing test (with minor differences) [d7949dc4](https://github.com/mozilla/readability/commit/d7949dc4) works 12/1/26

Testing

I had added readability.js as a submodule so it will be updated with their latest tests, also i don't get their results for done, i do a call to the readability.js inside HTMLUnit, with some regex changes, syntactic see rhino compat and non syntactic as it can run as a function than a class

Extensibility

I tried to maintain the library as extensible as possible. All above mentioned classes can be overwritten and passed to Readability4J's as a variable assignment.

Logging

Readability4J uses slf4j as logging facade.

So you can use any logger that supports slf4j, like Logback and log4j, to configure and get Readability4J's log output.

License

Copyright 2017 dankito 2025 NotDroidUser

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

About

A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • HTML 94.9%
  • Kotlin 5.1%