Skip to content

feat: improve API-ref extraction quality and coverage in the RAG#83

Open
omkarjoshi0304 wants to merge 3 commits intoopenstack-lightspeed:mainfrom
omkarjoshi0304:OSPRH-19255-api-ref
Open

feat: improve API-ref extraction quality and coverage in the RAG#83
omkarjoshi0304 wants to merge 3 commits intoopenstack-lightspeed:mainfrom
omkarjoshi0304:OSPRH-19255-api-ref

Conversation

@omkarjoshi0304
Copy link
Contributor

Description:-

This PR improves the quality of API reference data ingested into the RAG pipeline and ensures consistent coverage across OpenStack 2025.2 projects.

Use pandoc + html2text for cleaner API extraction (removes nav, headers, and boilerplate)

Apply heuristic filtering to index.txt files to keep meaningful single-page APIs and drop navigation-only indices

Auto-detect api-ref and api-guide sources to avoid missing project docs

Filter out placeholder/junk files using logo metadata detection

@omkarjoshi0304 omkarjoshi0304 requested a review from a team as a code owner January 14, 2026 16:15
@omkarjoshi0304 omkarjoshi0304 marked this pull request as draft January 14, 2026 16:40
@omkarjoshi0304 omkarjoshi0304 marked this pull request as ready for review January 14, 2026 18:50
Copy link
Contributor

@Akrog Akrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule the source of truth is the git repository, so we should try to include as much information as possible in the commit message.

In this case I see there is more information on the PR description than the commit message itself.

if [ "$OS_API_DOCS" = "true" ] && [ -d "./api-ref/source" ]; then
if ! grep -q "text-api-ref" tox.ini; then
echo "$tox_text_api_ref_target" >> tox.ini
# Build API-Ref
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why remote the "if enabled" part of the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad i accidentally removed it

text_file="./$api_dir/build/text/${rel_path%.html}.txt"
mkdir -p "$(dirname "$text_file")"

if command -v pandoc &> /dev/null; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's a good idea to allow building things with 2 different tools, as they would create different outputs, so we may start having different DB contents and not realize. This would make it harder to find out WHY things don't behave like they did beefore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the html2text fallback and standardized on Pandoc only.
The reasoning is that Pandoc produces significantly better structural output (especially for tables and code blocks) compared to html2text


if command -v pandoc &> /dev/null; then
# shellcheck disable=SC2015
pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" 2>/dev/null && ((converted_count++)) || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it OK for conversion to fail?
This could hide problems with the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the || true safety net. The script will now fail explicitly if a file cannot be converted, preventing us from silently indexing corrupted or missing data.


if [ "$api_file_count" -gt 0 ]; then
echo "API-Ref: Found $api_file_count content files for $project"
# Only remove index.txt if other content files exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this comment very helpful. I can see in the code at a glance that's what is happening, but it's not explained WHY we are deleting it. For example because when there are other files index.txt is a navigation page pointing to other pages, so we don't want it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment

api_file_count=$(find "./$api_dir/build/text" -name "*.txt" -type f -size +1k 2>/dev/null | wc -l)

if [ "$api_file_count" -gt 0 ]; then
echo "API-Ref: Found $api_file_count content files for $project"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this content calculation message wrong if we remove some files on L273?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the final file count calculation to the very end of the function so the log message accurately reflects the number of files

@omkarjoshi0304
Copy link
Contributor Author

As a rule the source of truth is the git repository, so we should try to include as much information as possible in the commit message.

In this case I see there is more information on the PR description than the commit message itself.

Thanks for addressing i will edit the commit message

@omkarjoshi0304 omkarjoshi0304 force-pushed the OSPRH-19255-api-ref branch 2 times, most recently from 42f3a1d to ce60cfb Compare January 17, 2026 02:15
Copy link
Contributor

@lpiwowar lpiwowar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two comments that crossed my mind this morning.

# neutron-lib is a library project that only generates API-Ref documentation.
# Its regular doc build produces no usable output, but its API-Ref is needed by Neutron.
if [ "$project" != "neutron-lib" ]; then
if ! grep -q "text-docs" tox.ini; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is this really needed? Aren't we adding the $tox_text_docs_target on line 195?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes no need of this line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your response I understand that we want to remove this line, but I see it still present in the code 🙈 . I'm keeping this open:). Let me know if I misinterpreted.

@omkarjoshi0304 omkarjoshi0304 force-pushed the OSPRH-19255-api-ref branch 2 times, most recently from 951ea86 to 212b8ed Compare January 21, 2026 03:41
# Replace .txt with .html
remaining_path = remaining_path.replace(".txt", ".html")

# Build API-Ref URL: /api-ref/{service}/{remaining_path}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The path in the comment doesn't look too useful to me. In my opinion the actual path doesn't add value since we can see it very clearly on the next line.
It is even misleading, as it has {service} but then we use {service_name}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it

find ./api-ref/build/text -mindepth 1 -depth -type d -empty -delete 2>/dev/null || true
if ! tox -etext-api-ref; then
echo "WARNING: API-Ref build failed for $project"
api_ref_failed="true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: Why don't we fail the process if we fail to build the api-ref?
Wouldn't this hide errors?
I think we should fail, but maybe I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it to exit 1

[ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs
# Copy regular docs if they were built (not for neutron-lib)
if [ -d "doc/build/text" ]; then
cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why multiple double quotes? `"${project_output_dir}/${_output_version}_docs"

# Copy API-Ref documentation if it was built successfully
if [ "$OS_API_DOCS" = "true" ] && [ "$api_ref_failed" != "true" ] && \
[ -n "$api_dir" ] && [ -d "$api_dir/build/text" ]; then
cp -r "$api_dir/build/text" "$project_output_dir"/"$_output_version"_api-ref
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same here about multiple double quotes

umago
umago previously approved these changes Feb 6, 2026
Copy link
Contributor

@umago umago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume many part of bash scripting code has been AI generated ? I don't necessarily understand all of the commands there but the logic seems OK.

Assuming this has been tested and works I'm OK with it as we already have other parts of the code that has been assisted as well/.

@omkarjoshi0304
Copy link
Contributor Author

I assume many part of bash scripting code has been AI generated ? I don't necessarily understand all of the commands there but the logic seems OK.

Assuming this has been tested and works I'm OK with it as we already have other parts of the code that has been assisted as well/.

I assume many part of bash scripting code has been AI generated ? I don't necessarily understand all of the commands there but the logic seems OK.

Assuming this has been tested and works I'm OK with it as we already have other parts of the code that has been assisted as well/.

index.txt
I have attached the one of the api ref doc file of cinder which has been parsed , quality seems good to me for other project as well . I took help of cursor to generate the script:) . just an brief of script

  • check for api directories (api-ref/source and api-guide/source)
  • configures tox
  • builds html with tox
  • coverts html to plain text using pandoc
  • deletes the junk files

@lpiwowar
Copy link
Contributor

lpiwowar commented Feb 9, 2026

Code review

Found 1 issue:

  1. converted_count is initialized to 0 but never incremented inside the while loop, so the log message at line 254 will always report "Converted 0 HTML files to text" regardless of how many files were actually converted. A ((converted_count++)) or equivalent is missing from the loop body.

converted_count=0
while IFS= read -r -d '' html_file; do
rel_path="${html_file#./"$api_dir"/build/html/}"
text_file="./$api_dir/build/text/${rel_path%.html}.txt"
mkdir -p "$(dirname "$text_file")"
# Convert HTML to plain text using pandoc (consistent output)
pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" || {
echo "ERROR: Failed to convert $html_file"
return 1
}
done < <(find "./$api_dir/build/html" -name "*.html" -type f -print0)
echo "Converted $converted_count HTML files to text for $project"

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@omkarjoshi0304
Copy link
Contributor Author

Code review

Found 1 issue:

  1. converted_count is initialized to 0 but never incremented inside the while loop, so the log message at line 254 will always report "Converted 0 HTML files to text" regardless of how many files were actually converted. A ((converted_count++)) or equivalent is missing from the loop body.

converted_count=0
while IFS= read -r -d '' html_file; do
rel_path="${html_file#./"$api_dir"/build/html/}"
text_file="./$api_dir/build/text/${rel_path%.html}.txt"
mkdir -p "$(dirname "$text_file")"
# Convert HTML to plain text using pandoc (consistent output)
pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" || {
echo "ERROR: Failed to convert $html_file"
return 1
}
done < <(find "./$api_dir/build/html" -name "*.html" -type f -print0)
echo "Converted $converted_count HTML files to text for $project"

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

Thanks for address counter incremented

Copy link
Contributor

@lpiwowar lpiwowar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, these are just small things, but I think it should be fixed.

# Only copy if text docs were built (skipped for neutron-lib)
[ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs
# Copy regular docs if they were built (not for neutron-lib)
if [ -d "doc/build/text" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): I'm going to put this as a blocking because there is nothing from urging us to merge this PR immediately. As far as I can tell, this change is not related to this PR and does not modify the meaning of the code. Let's remove this change or move it to a separate commit. I know it is a nit, but I think it is a good practice to keep the PRs focused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is still modified 🙈 as part of this PR. The change should be either:

  • part of a separate commit
  • or not part of this PR at all.

@omkarjoshi0304
Copy link
Contributor Author

Apologies, these are just small things, but I think it should be fixed.

No worries I will fix it:)

@omkarjoshi0304 omkarjoshi0304 force-pushed the OSPRH-19255-api-ref branch 5 times, most recently from e890d85 to 347b7d0 Compare February 11, 2026 12:55
class OpenStackDocsMetadataProcessor(MetadataProcessor):
"""Metadata processor for OpenStack documentation."""

API_REF_SERVICE_MAPPING = {
Copy link
Contributor

@lpiwowar lpiwowar Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The commit message should be polished. Also, there is an unwanted commit titled: Merge remote-tracking branch 'refs/remotes/upstream/main' into OSPRH-19255-api-ref

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

echo "Using ${num_running_subproc}/${NUM_WORKERS} workers. Waiting ..."
wait -n || log_and_die "Subprocess generating text documentation failed!"
echo "Using $(( --num_running_subproc ))/${NUM_WORKERS} workers."
echo "Using $(( --num_running_subproc ))/${NUM_WORKERS} workers."
Copy link
Contributor

@lpiwowar lpiwowar Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Thank you for fixing the line and the inconsistent tabs/spaces usage:). The thing is that this should be a separate commit ideally (because it is not related to the core of the PR).

# Only copy if text docs were built (skipped for neutron-lib)
[ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs
# Copy regular docs if they were built (not for neutron-lib)
if [ -d "doc/build/text" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is still modified 🙈 as part of this PR. The change should be either:

  • part of a separate commit
  • or not part of this PR at all.

@omkarjoshi0304 omkarjoshi0304 force-pushed the OSPRH-19255-api-ref branch 2 times, most recently from 69254f1 to 347b7d0 Compare February 16, 2026 13:11
@omkarjoshi0304 omkarjoshi0304 force-pushed the OSPRH-19255-api-ref branch 3 times, most recently from d4b1b21 to 16e2f3f Compare February 17, 2026 13:04
- Standardize on Pandoc for consistent RAG output; remove html2text fallback.
- Add strict error handling for conversion failures and missing tools.
- Fix "Url not reachable" errors by recursively pruning empty navigation index files.
- Auto-detect api-ref and api-guide sources to avoid missing project docs.

PR Comment fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants