feat: improve API-ref extraction quality and coverage in the RAG#83
feat: improve API-ref extraction quality and coverage in the RAG#83omkarjoshi0304 wants to merge 3 commits intoopenstack-lightspeed:mainfrom
Conversation
ef662f1 to
9516e0e
Compare
9516e0e to
55cdbad
Compare
Akrog
left a comment
There was a problem hiding this comment.
As a rule the source of truth is the git repository, so we should try to include as much information as possible in the commit message.
In this case I see there is more information on the PR description than the commit message itself.
| if [ "$OS_API_DOCS" = "true" ] && [ -d "./api-ref/source" ]; then | ||
| if ! grep -q "text-api-ref" tox.ini; then | ||
| echo "$tox_text_api_ref_target" >> tox.ini | ||
| # Build API-Ref |
There was a problem hiding this comment.
nit: Why remote the "if enabled" part of the comment?
There was a problem hiding this comment.
My bad i accidentally removed it
| text_file="./$api_dir/build/text/${rel_path%.html}.txt" | ||
| mkdir -p "$(dirname "$text_file")" | ||
|
|
||
| if command -v pandoc &> /dev/null; then |
There was a problem hiding this comment.
Not sure it's a good idea to allow building things with 2 different tools, as they would create different outputs, so we may start having different DB contents and not realize. This would make it harder to find out WHY things don't behave like they did beefore.
There was a problem hiding this comment.
I have removed the html2text fallback and standardized on Pandoc only.
The reasoning is that Pandoc produces significantly better structural output (especially for tables and code blocks) compared to html2text
|
|
||
| if command -v pandoc &> /dev/null; then | ||
| # shellcheck disable=SC2015 | ||
| pandoc -f html -t plain --wrap=preserve "$html_file" -o "$text_file" 2>/dev/null && ((converted_count++)) || true |
There was a problem hiding this comment.
Why is it OK for conversion to fail?
This could hide problems with the process.
There was a problem hiding this comment.
I removed the || true safety net. The script will now fail explicitly if a file cannot be converted, preventing us from silently indexing corrupted or missing data.
|
|
||
| if [ "$api_file_count" -gt 0 ]; then | ||
| echo "API-Ref: Found $api_file_count content files for $project" | ||
| # Only remove index.txt if other content files exist |
There was a problem hiding this comment.
I don't find this comment very helpful. I can see in the code at a glance that's what is happening, but it's not explained WHY we are deleting it. For example because when there are other files index.txt is a navigation page pointing to other pages, so we don't want it.
There was a problem hiding this comment.
I have updated the comment
| api_file_count=$(find "./$api_dir/build/text" -name "*.txt" -type f -size +1k 2>/dev/null | wc -l) | ||
|
|
||
| if [ "$api_file_count" -gt 0 ]; then | ||
| echo "API-Ref: Found $api_file_count content files for $project" |
There was a problem hiding this comment.
Wouldn't this content calculation message wrong if we remove some files on L273?
There was a problem hiding this comment.
moved the final file count calculation to the very end of the function so the log message accurately reflects the number of files
Thanks for addressing i will edit the commit message |
42f3a1d to
ce60cfb
Compare
lpiwowar
left a comment
There was a problem hiding this comment.
Just two comments that crossed my mind this morning.
| # neutron-lib is a library project that only generates API-Ref documentation. | ||
| # Its regular doc build produces no usable output, but its API-Ref is needed by Neutron. | ||
| if [ "$project" != "neutron-lib" ]; then | ||
| if ! grep -q "text-docs" tox.ini; then |
There was a problem hiding this comment.
question: Is this really needed? Aren't we adding the $tox_text_docs_target on line 195?
There was a problem hiding this comment.
yes no need of this line
There was a problem hiding this comment.
From your response I understand that we want to remove this line, but I see it still present in the code 🙈 . I'm keeping this open:). Let me know if I misinterpreted.
951ea86 to
212b8ed
Compare
| # Replace .txt with .html | ||
| remaining_path = remaining_path.replace(".txt", ".html") | ||
|
|
||
| # Build API-Ref URL: /api-ref/{service}/{remaining_path} |
There was a problem hiding this comment.
nit: The path in the comment doesn't look too useful to me. In my opinion the actual path doesn't add value since we can see it very clearly on the next line.
It is even misleading, as it has {service} but then we use {service_name}.
There was a problem hiding this comment.
I have removed it
| find ./api-ref/build/text -mindepth 1 -depth -type d -empty -delete 2>/dev/null || true | ||
| if ! tox -etext-api-ref; then | ||
| echo "WARNING: API-Ref build failed for $project" | ||
| api_ref_failed="true" |
There was a problem hiding this comment.
-1: Why don't we fail the process if we fail to build the api-ref?
Wouldn't this hide errors?
I think we should fail, but maybe I'm missing something.
There was a problem hiding this comment.
updated it to exit 1
| [ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs | ||
| # Copy regular docs if they were built (not for neutron-lib) | ||
| if [ -d "doc/build/text" ]; then | ||
| cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs |
There was a problem hiding this comment.
nit: Why multiple double quotes? `"${project_output_dir}/${_output_version}_docs"
| # Copy API-Ref documentation if it was built successfully | ||
| if [ "$OS_API_DOCS" = "true" ] && [ "$api_ref_failed" != "true" ] && \ | ||
| [ -n "$api_dir" ] && [ -d "$api_dir/build/text" ]; then | ||
| cp -r "$api_dir/build/text" "$project_output_dir"/"$_output_version"_api-ref |
There was a problem hiding this comment.
nit: Same here about multiple double quotes
b8fedc6 to
f5984b2
Compare
umago
left a comment
There was a problem hiding this comment.
I assume many part of bash scripting code has been AI generated ? I don't necessarily understand all of the commands there but the logic seems OK.
Assuming this has been tested and works I'm OK with it as we already have other parts of the code that has been assisted as well/.
index.txt
|
Code reviewFound 1 issue:
rag-content/scripts/get_openstack_plaintext_docs.sh Lines 240 to 255 in f5984b2 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
ece8977 to
eb09cb5
Compare
Thanks for address counter incremented |
lpiwowar
left a comment
There was a problem hiding this comment.
Apologies, these are just small things, but I think it should be fixed.
| # Only copy if text docs were built (skipped for neutron-lib) | ||
| [ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs | ||
| # Copy regular docs if they were built (not for neutron-lib) | ||
| if [ -d "doc/build/text" ]; then |
There was a problem hiding this comment.
issue (blocking): I'm going to put this as a blocking because there is nothing from urging us to merge this PR immediately. As far as I can tell, this change is not related to this PR and does not modify the meaning of the code. Let's remove this change or move it to a separate commit. I know it is a nit, but I think it is a good practice to keep the PRs focused.
There was a problem hiding this comment.
The line is still modified 🙈 as part of this PR. The change should be either:
- part of a separate commit
- or not part of this PR at all.
No worries I will fix it:) |
e890d85 to
347b7d0
Compare
| class OpenStackDocsMetadataProcessor(MetadataProcessor): | ||
| """Metadata processor for OpenStack documentation.""" | ||
|
|
||
| API_REF_SERVICE_MAPPING = { |
| echo "Using ${num_running_subproc}/${NUM_WORKERS} workers. Waiting ..." | ||
| wait -n || log_and_die "Subprocess generating text documentation failed!" | ||
| echo "Using $(( --num_running_subproc ))/${NUM_WORKERS} workers." | ||
| echo "Using $(( --num_running_subproc ))/${NUM_WORKERS} workers." |
There was a problem hiding this comment.
issue (blocking): Thank you for fixing the line and the inconsistent tabs/spaces usage:). The thing is that this should be a separate commit ideally (because it is not related to the core of the PR).
| # Only copy if text docs were built (skipped for neutron-lib) | ||
| [ -d "doc/build/text" ] && cp -r doc/build/text "$project_output_dir"/"$_output_version"_docs | ||
| # Copy regular docs if they were built (not for neutron-lib) | ||
| if [ -d "doc/build/text" ]; then |
There was a problem hiding this comment.
The line is still modified 🙈 as part of this PR. The change should be either:
- part of a separate commit
- or not part of this PR at all.
69254f1 to
347b7d0
Compare
d4b1b21 to
16e2f3f
Compare
- Standardize on Pandoc for consistent RAG output; remove html2text fallback. - Add strict error handling for conversion failures and missing tools. - Fix "Url not reachable" errors by recursively pruning empty navigation index files. - Auto-detect api-ref and api-guide sources to avoid missing project docs. PR Comment fix
78d3bf4 to
755363b
Compare

Description:-
This PR improves the quality of API reference data ingested into the RAG pipeline and ensures consistent coverage across OpenStack 2025.2 projects.
Use pandoc + html2text for cleaner API extraction (removes nav, headers, and boilerplate)
Apply heuristic filtering to index.txt files to keep meaningful single-page APIs and drop navigation-only indices
Auto-detect api-ref and api-guide sources to avoid missing project docs
Filter out placeholder/junk files using logo metadata detection