optimize rst #2505

ohmayr · 2025-12-09T12:47:52Z

This PR optimizes gapic/utils/rst.py to use native Python regex for the majority of docstring conversions (Markdown to RST), resolving a major performance bottleneck caused by shelling out to pandoc. pypandoc is retained only as a fallback for complex tables.

Performance Impact: Reduces generation time per API from ~52s to ~10s (for google-cloud-discoveryengine).

Explanation of Diffs
This PR introduces a large number of docstring changes. These are expected and acceptable for the following reasons:

Text Reflow (Cosmetic): We switched from Pandoc's text wrapping algorithm to Python's textwrap. This causes line breaks to shift, but the content remains semantically identical.
Link Fixes (Corrective): Reference-style links (e.g., [Message][pkg.Message]) were previously ignored by Pandoc and rendered as plain text. This change correctly converts them into valid RST hyperlinks.
List Markers (Cosmetic): The script now preserves the original * list bullets used in the source protos, whereas Pandoc normalized them to -.

gemini-code-assist · 2025-12-09T12:48:11Z

Summary of Changes

Hello @ohmayr, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the performance of Markdown to reStructuredText (RST) conversion within the gapic.utils.rst module. The primary goal is to minimize the performance overhead associated with calling the external pypandoc library by introducing intelligent caching, a lightweight Python-based converter for simple cases, and an optimized batch processing mechanism for more complex conversions. These changes are expected to lead to faster code generation by streamlining the docstring conversion process.

Highlights

Performance Optimization: Introduced a multi-layered optimization strategy for RST conversion, significantly reducing reliance on the external pypandoc library by implementing a caching mechanism and a fast, pure-Python Markdown-to-RST converter for common patterns.
Batch Conversion Capability: Added a new batch_convert_docstrings function that efficiently processes multiple docstrings. It first attempts the fast Python conversion and then groups any remaining complex conversions into a single pypandoc call, improving overall throughput.
Refactored rst Function: The core rst conversion function has been refactored to prioritize speed. It now includes a 'Super Fast Path' for plain text, checks a cache for previously converted strings, attempts the pure-Python conversion, and only as a last resort, falls back to pypandoc.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations for Markdown to reStructuredText conversion by adding a caching layer, a fast-path pure-Python converter for simple cases, and batch processing for pypandoc calls. This is a great improvement. My review focuses on improving the robustness and correctness of these new additions. Specifically, I've pointed out a bug in heading conversion, suggested adding error handling for pypandoc calls to prevent crashes, and recommended restoring the detailed docstring for the rst function for better maintainability.

gemini-code-assist · 2025-12-09T12:50:04Z

gapic/utils/rst.py

+    converted = re.sub(r"^# (.*)$", r"\1\n" + "=" * 10, converted, flags=re.MULTILINE)
+    converted = re.sub(r"^## (.*)$", r"\1\n" + "-" * 10, converted, flags=re.MULTILINE)


The regular expression substitution for headings uses a fixed-length underline ('=' * 10 and '-' * 10). In reStructuredText, the underline for a heading must be at least as long as the heading text itself. A fixed length may be too short for longer headings, resulting in invalid RST.

You can fix this by using a lambda function in re.sub to dynamically set the underline length based on the matched heading text's length.

Suggested change

converted = re.sub(r"^# (.*)$", r"\1\n" + "=" * 10, converted, flags=re.MULTILINE)

converted = re.sub(r"^## (.*)$", r"\1\n" + "-" * 10, converted, flags=re.MULTILINE)

converted = re.sub(r"^# (.*)$", lambda m: f"{m.group(1)}\n{'=' * len(m.group(1))}", converted, flags=re.MULTILINE)

converted = re.sub(r"^## (.*)$", lambda m: f"{m.group(1)}\n{'-' * len(m.group(1))}", converted, flags=re.MULTILINE)

gemini-code-assist · 2025-12-09T12:50:04Z

gapic/utils/rst.py

+            # The absolute last resort: Shell out to Pandoc
+            raw_rst = pypandoc.convert_text(
                text,
                "rst",
                format=source_format,
-                extra_args=["--columns=%d" % (width - indent)],
-            )
-            .strip()
-            .replace("\n", f"\n{' ' * indent}")
-        )
+                extra_args=["--columns=1000"]
+            ).strip()


The call to pypandoc.convert_text is not wrapped in a try...except block. If pypandoc fails for any reason (e.g., it's not installed correctly, or it encounters an input it can't handle), it will raise an exception and crash the entire generation process. It would be more robust to handle this exception and fall back gracefully, for instance by returning the original unformatted text.

# The absolute last resort: Shell out to Pandoc try: raw_rst = pypandoc.convert_text( text, "rst", format=source_format, extra_args=["--columns=1000"], ).strip() except Exception: # If pandoc fails, it's better to have unformatted text # than to crash. raw_rst = text

gemini-code-assist · 2025-12-09T12:50:04Z

gapic/utils/rst.py

+    except Exception:
+        return


Catching a bare Exception and silently returning can hide important issues with pypandoc or the environment. It's better to log the exception to aid in debugging potential failures in the batch conversion process. Using the warnings module is a good way to surface this without requiring logging configuration.

Suggested change

except Exception:

return

except Exception as e:

import warnings

warnings.warn(f"pypandoc batch conversion failed: {e}")

return

gemini-code-assist · 2025-12-09T12:50:05Z

gapic/utils/rst.py

+    if len(results) == len(pandoc_batch):
+        for original, converted in zip(pandoc_batch, results):
+            _RAW_RST_CACHE[original] = converted.strip()


If the number of converted results from pypandoc does not match the number of items in the batch, the function returns silently without processing the items. This could indicate a problem with the separator or with pypandoc's output. This condition should be reported to help with debugging.

if len(results) != len(pandoc_batch): import warnings warnings.warn( "pypandoc batch conversion returned unexpected number of results. " f"Expected {len(pandoc_batch)}, got {len(results)}." ) return for original, converted in zip(pandoc_batch, results): _RAW_RST_CACHE[original] = converted.strip()

gemini-code-assist · 2025-12-09T12:50:05Z

gapic/utils/rst.py

-    # (This makes code generation significantly faster; calling out to pandoc
-    # is by far the most expensive thing we do.)
-    if not re.search(r"[|*`_[\]]", text):
+    """Convert the given text to ReStructured Text."""


The detailed docstring for the rst function, which explained the purpose of the width, indent, nl, and source_format parameters, has been removed. While the function body has changed, the signature and parameters remain the same. Restoring the detailed docstring would improve code clarity and maintainability for future developers.

"""Convert the given text to ReStructured Text. Args: text (str): The text to convert. width (int): The number of columns. indent (int): The number of columns to indent each line of text (except the first). nl (bool): Whether to append a trailing newline. Defaults to appending a newline if the result is more than one line long. source_format (str): The source format. This is ``commonmark`` by default, which is what is used by convention in protocol buffers. Returns: str: The same text, in RST format. """

optimize rst

dd3e090

product-auto-label bot added the size: m Pull request size is medium. label Dec 9, 2025

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

ohmayr added 3 commits December 9, 2025 14:47

wip

f3db4bf

added detailed logs

42ff3eb

skip the heavy math for API.build pass 1

367a967

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize rst #2505

optimize rst #2505

Uh oh!

ohmayr commented Dec 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		converted = re.sub(r"^# (.)$", r"\1\n" + "=" 10, converted, flags=re.MULTILINE)
		converted = re.sub(r"^## (.)$", r"\1\n" + "-" 10, converted, flags=re.MULTILINE)

-    except Exception:
-        return
+    except Exception as e:
+        import warnings
+        warnings.warn(f"pypandoc batch conversion failed: {e}")
+        return

optimize rst #2505

Are you sure you want to change the base?

optimize rst #2505

Uh oh!

Conversation

ohmayr commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ohmayr commented Dec 9, 2025 •

edited

Loading