-
Notifications
You must be signed in to change notification settings - Fork 74
optimize rst #2505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
optimize rst #2505
Conversation
Summary of ChangesHello @ohmayr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the performance of Markdown to reStructuredText (RST) conversion within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant performance optimizations for Markdown to reStructuredText conversion by adding a caching layer, a fast-path pure-Python converter for simple cases, and batch processing for pypandoc calls. This is a great improvement. My review focuses on improving the robustness and correctness of these new additions. Specifically, I've pointed out a bug in heading conversion, suggested adding error handling for pypandoc calls to prevent crashes, and recommended restoring the detailed docstring for the rst function for better maintainability.
| converted = re.sub(r"^# (.*)$", r"\1\n" + "=" * 10, converted, flags=re.MULTILINE) | ||
| converted = re.sub(r"^## (.*)$", r"\1\n" + "-" * 10, converted, flags=re.MULTILINE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular expression substitution for headings uses a fixed-length underline ('=' * 10 and '-' * 10). In reStructuredText, the underline for a heading must be at least as long as the heading text itself. A fixed length may be too short for longer headings, resulting in invalid RST.
You can fix this by using a lambda function in re.sub to dynamically set the underline length based on the matched heading text's length.
| converted = re.sub(r"^# (.*)$", r"\1\n" + "=" * 10, converted, flags=re.MULTILINE) | |
| converted = re.sub(r"^## (.*)$", r"\1\n" + "-" * 10, converted, flags=re.MULTILINE) | |
| converted = re.sub(r"^# (.*)$", lambda m: f"{m.group(1)}\n{'=' * len(m.group(1))}", converted, flags=re.MULTILINE) | |
| converted = re.sub(r"^## (.*)$", lambda m: f"{m.group(1)}\n{'-' * len(m.group(1))}", converted, flags=re.MULTILINE) |
gapic/utils/rst.py
Outdated
| # The absolute last resort: Shell out to Pandoc | ||
| raw_rst = pypandoc.convert_text( | ||
| text, | ||
| "rst", | ||
| format=source_format, | ||
| extra_args=["--columns=%d" % (width - indent)], | ||
| ) | ||
| .strip() | ||
| .replace("\n", f"\n{' ' * indent}") | ||
| ) | ||
| extra_args=["--columns=1000"] | ||
| ).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to pypandoc.convert_text is not wrapped in a try...except block. If pypandoc fails for any reason (e.g., it's not installed correctly, or it encounters an input it can't handle), it will raise an exception and crash the entire generation process. It would be more robust to handle this exception and fall back gracefully, for instance by returning the original unformatted text.
# The absolute last resort: Shell out to Pandoc
try:
raw_rst = pypandoc.convert_text(
text,
"rst",
format=source_format,
extra_args=["--columns=1000"],
).strip()
except Exception:
# If pandoc fails, it's better to have unformatted text
# than to crash.
raw_rst = text
gapic/utils/rst.py
Outdated
| except Exception: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a bare Exception and silently returning can hide important issues with pypandoc or the environment. It's better to log the exception to aid in debugging potential failures in the batch conversion process. Using the warnings module is a good way to surface this without requiring logging configuration.
| except Exception: | |
| return | |
| except Exception as e: | |
| import warnings | |
| warnings.warn(f"pypandoc batch conversion failed: {e}") | |
| return |
gapic/utils/rst.py
Outdated
| if len(results) == len(pandoc_batch): | ||
| for original, converted in zip(pandoc_batch, results): | ||
| _RAW_RST_CACHE[original] = converted.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of converted results from pypandoc does not match the number of items in the batch, the function returns silently without processing the items. This could indicate a problem with the separator or with pypandoc's output. This condition should be reported to help with debugging.
if len(results) != len(pandoc_batch):
import warnings
warnings.warn(
"pypandoc batch conversion returned unexpected number of results. "
f"Expected {len(pandoc_batch)}, got {len(results)}."
)
return
for original, converted in zip(pandoc_batch, results):
_RAW_RST_CACHE[original] = converted.strip()
gapic/utils/rst.py
Outdated
| # (This makes code generation significantly faster; calling out to pandoc | ||
| # is by far the most expensive thing we do.) | ||
| if not re.search(r"[|*`_[\]]", text): | ||
| """Convert the given text to ReStructured Text.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The detailed docstring for the rst function, which explained the purpose of the width, indent, nl, and source_format parameters, has been removed. While the function body has changed, the signature and parameters remain the same. Restoring the detailed docstring would improve code clarity and maintainability for future developers.
"""Convert the given text to ReStructured Text.
Args:
text (str): The text to convert.
width (int): The number of columns.
indent (int): The number of columns to indent each line of text
(except the first).
nl (bool): Whether to append a trailing newline.
Defaults to appending a newline if the result is more than
one line long.
source_format (str): The source format. This is ``commonmark`` by
default, which is what is used by convention in protocol buffers.
Returns:
str: The same text, in RST format.
"""
This PR optimizes
gapic/utils/rst.pyto use native Python regex for the majority of docstring conversions (Markdown to RST), resolving a major performance bottleneck caused by shelling out to pandoc. pypandoc is retained only as a fallback for complex tables.Performance Impact: Reduces generation time per API from ~52s to ~10s (for google-cloud-discoveryengine).
Explanation of Diffs
This PR introduces a large number of docstring changes. These are expected and acceptable for the following reasons:
Text Reflow (Cosmetic): We switched from Pandoc's text wrapping algorithm to Python's textwrap. This causes line breaks to shift, but the content remains semantically identical.
Link Fixes (Corrective): Reference-style links (e.g., [Message][pkg.Message]) were previously ignored by Pandoc and rendered as plain text. This change correctly converts them into valid RST hyperlinks.
List Markers (Cosmetic): The script now preserves the original * list bullets used in the source protos, whereas Pandoc normalized them to -.