Skip to content

rgiovann/extrator-codigo-java

Repository files navigation

JavaFilesExtractor

JavaFilesExtractor is a Java tool that extracts all .java files from a GitHub repository, concatenating them into a single text file with metadata and without /* */ comment blocks. Ideal for code analysis, auditing, or documentation generation for Java projects, especially in contexts involving processing by artificial intelligence algorithms.

Goal

The main goal of JavaFilesExtractor is to extract Java source code from GitHub repositories and normalize it to reduce noise in the input of Large Language Model (LLM) algorithms. By removing /* */ comment blocks and structuring the code with clear metadata (file name, package, and declaration), the project makes it easier to use the code for tasks such as automated analysis, documentation generation, or AI model training, ensuring a cleaner and more standardized input.

Features

  • Automatic Download: Downloads the ZIP of the default branch of a GitHub repository.
  • File Extraction: Unzips the archive and identifies all .java files.
  • Concatenation with Metadata: Generates a text file containing:
    • File name (// FILE: FileName.java).
    • Declared package (// PACKAGE: package.name or (default)).
    • Main declaration (// DECLARATION: class Name or similar).
    • Source code without /* */ comment blocks.
    • End marker (// END_OF_FILE).
  • Comment Cleanup: Removes /* */ comment blocks while preserving // line comments.
  • Resource Management: Deletes temporary files (ZIP and directories) after use.

Prerequisites

  • Java 11 or higher: The project uses the HttpClient API and other modern features.
  • Dependencies:
<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-databind</artifactId>
    <version>2.16.0</version>
</dependency>

How to Use

  1. Clone the repository:

    git clone https://github.com/your-username/java-files-extractor.git
    cd java-files-extractor
  2. Import and compile/run the project (uses Maven) in your preferred IDE: Make sure the Jackson Databind dependency has been loaded by the IDE.

  3. Enter the Repository: When prompted, type the repository in the format user/repository (e.g., rgiovann/pratt_parser).

  4. Output: A text file (e.g., user_repository_timestamp.txt) will be generated in the current directory, containing all .java files concatenated with metadata.

Output Example

For a file App.java in the repository rgiovann/pratt_parser:

package parser.prat_parser;

public class App {
    public static void main(String[] args) {
        String input = "4+3*5 + 4/6 - (3+4)*7";
        PrattParser parser = new PrattParser(LexerFactory.createDefaultLexer(input));
        Expression parsed = parser.parse();
        System.out.println("INPUT : " + input);
        System.out.println("OUTPUT : " + parsed.toString());
    }
}

The output file (rgiovann_pratt_parser_20250509_123456.txt) will contain:

// FILE: App.java
// PACKAGE: parser.prat_parser
// DECLARATION: class App

package parser.prat_parser;

public class App {
    public static void main(String[] args) {
        String input = "4+3*5 + 4/6 - (3+4)*7";
        PrattParser parser = new PrattParser(LexerFactory.createDefaultLexer(input));
        Expression parsed = parser.parse();
        System.out.println("INPUT : " + input);
        System.out.println("OUTPUT : " + parsed.toString());
    }
}
// END_OF_FILE

Code Structure

The project follows good object-oriented programming practices, with an emphasis on the SOLID, KISS, and DRY principles:

  • Single Responsibility Principle (SRP): Each method has a single responsibility (e.g., extractPackageName extracts the package, removeComments removes comment blocks).
  • Open/Closed Principle (OCP): The logic is extensible, allowing new features to be added (e.g., new tags) without modifying the core.
  • Keep It Simple, Stupid (KISS): Simple solutions, such as regex for parsing packages and comments, avoid unnecessary complexity.
  • Don't Repeat Yourself (DRY): Reusable functions (e.g., extractDeclaration) avoid code duplication.

The code uses Java's modern API (HttpClient, Files, Path) for efficiency and robustness, with proper error and resource handling (e.g., automatic closing with try-with-resources).

Suitability for Small and Medium Projects

JavaFilesExtractor is ideal for Java projects of small to medium size, with up to ~4,000 lines of code or ~20,000 tokens. Tokens are units of text processed by AI, where ~4-5 characters correspond to one token in Java code (e.g., public class may be ~2-3 tokens). Most modern AIs support contexts of:

  • Grok 3 (xAI): Up to ~128,000 tokens.
  • GPT-4 (OpenAI): Up to ~32,000 or 128,000 tokens, depending on the variant.
  • Claude (Anthropic): Up to ~200,000 tokens.

For accurate responses, we recommend providing files with 10,000 to 20,000 tokens, reserving space for AI questions and answers. Projects in this range, such as rgiovann/ds-catalog, are processed well without overloading the model.

Example: rgiovann/ds-catalog Repository

The public repository rgiovann/ds-catalog is a practical example. The file generated by JavaFilesExtractor for this project contains:

  • ~85,257 characters (excluding spaces, tabs, and line breaks).
  • ~21,300 tokens (average of ~4 characters per token).
  • ~2,500 total lines, with ~2,125 non-empty lines.

With ~21,300 tokens, the file is at the upper limit for medium projects and can be processed by AIs such as Grok 3 or GPT-4. For specific tasks (e.g., analyzing only the ProductService class), we recommend extracting smaller excerpts (e.g., ~5,000-10,000 tokens) for greater accuracy.

Note for Large Projects

For very large projects, with more than ~4,000 lines or ~20,000 tokens, the generated file may exceed the context limits of some AIs or reduce response quality due to dilution of focus. In these cases, manually splitting the generated file into smaller files is suggested.

In a future version of this extractor, there are plans to inform the user of the option to generate normalized files organized by package.

Documentation

The file passo_a_passo_java_extrator.pdf describes the extractor's operation in detail.

About

A Java tool that extracts and normalizes source code in Java for AI processing or analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages