Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 54 additions & 29 deletions src/librustdoc/html/highlight.rs
Original file line number Diff line number Diff line change
Expand Up @@ -789,6 +789,9 @@ impl<'a> Iterator for TokenIter<'a> {
}
}

/// Used to know if a keyword followed by a `!` should never be treated as a macro.
const NON_MACRO_KEYWORDS: &[&str] = &["if", "while", "match", "break", "return", "impl"];

/// This iterator comes from the same idea than "Peekable" except that it allows to "peek" more than
/// just the next item by using `peek_next`. The `peek` method always returns the next item after
/// the current one whereas `peek_next` will return the next item after the last one peeked.
Expand Down Expand Up @@ -1010,6 +1013,19 @@ impl<'src> Classifier<'src> {
}
}

fn new_macro_span(
&mut self,
text: &'src str,
sink: &mut dyn FnMut(Span, Highlight<'src>),
before: u32,
file_span: Span,
) {
self.in_macro = true;
let span = new_span(before, text, file_span);
sink(DUMMY_SP, Highlight::EnterSpan { class: Class::Macro(span) });
sink(span, Highlight::Token { text, class: None });
}

/// Single step of highlighting. This will classify `token`, but maybe also a couple of
/// following ones as well.
///
Expand Down Expand Up @@ -1216,16 +1232,46 @@ impl<'src> Classifier<'src> {
LiteralKind::Float { .. } | LiteralKind::Int { .. } => Class::Number,
},
TokenKind::GuardedStrPrefix => return no_highlight(sink),
TokenKind::Ident | TokenKind::RawIdent
if let Some((TokenKind::Bang, _)) = self.peek_non_trivia() =>
{
self.in_macro = true;
let span = new_span(before, text, file_span);
sink(DUMMY_SP, Highlight::EnterSpan { class: Class::Macro(span) });
sink(span, Highlight::Token { text, class: None });
TokenKind::RawIdent if let Some((TokenKind::Bang, _)) = self.peek_non_trivia() => {
self.new_macro_span(text, sink, before, file_span);
return;
}
TokenKind::Ident => self.classify_ident(before, text),
// Macro non-terminals (meta vars) take precedence.
TokenKind::Ident if self.in_macro_nonterminal => {
self.in_macro_nonterminal = false;
Class::MacroNonTerminal
}
TokenKind::Ident => {
let file_span = self.file_span;
let span = || new_span(before, text, file_span);

match text {
"ref" | "mut" => Class::RefKeyWord,
"false" | "true" => Class::Bool,
"self" | "Self" => Class::Self_(span()),
"Option" | "Result" => Class::PreludeTy(span()),
"Some" | "None" | "Ok" | "Err" => Class::PreludeVal(span()),
_ if self.is_weak_keyword(text) || is_keyword(Symbol::intern(text)) => {
// So if it's not a keyword which can be followed by a value (like `if` or
// `return`) and the next non-whitespace token is a `!`, then we consider
// it's a macro.
if !NON_MACRO_KEYWORDS.contains(&text)
&& matches!(self.peek_non_trivia(), Some((TokenKind::Bang, _)))
Comment on lines 1255 to 1259
Copy link
Member

@fmease fmease Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why don't we exclude all keywords except for self, super, crate (.is_path_segment_keyword())1? I.e., turning the list around since you can't invoke arbitrary keywords as macros (unless r#'ed ofc which we already handle).

For example, the list keeps growing, there's not only impl !Trait for () from above (so impl) but also #[cfg(false)] impl const !Trait for () {} (so const).

Footnotes

  1. Since #[cfg(false)] self!(…) and so on is valid even though you can't define a macro called self / r#self.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work because try is not a path segment keyword, and yet you can name your macro try. So we're blocked on the constant. :')

Copy link
Member

@fmease fmease Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, true, well until we start factoring in the requested/ambient edition (see #148221), then that will no longer pose a problem.

Still, we could invert this check by rejecting all keywords except self, super, crate (.is_path_segment_keyword()) as mentioned and edition-sensitive keywords like async, try, gen (HACK: .is_reserved(Rust2015) (since we've never unreserved a keyword so far)).

However, I guess it's fine to keep your current approach given a fix of #148221 would be the ultimate & proper fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I had in the back of my mind to use the actual rustc parser for this to get an AST for quite some time (and completely forgot until @yotamofek just asked me today why we didn't do it 🤣 ).

Copy link
Member

@fmease fmease Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yotamofek

Well, it's an AST, not a CST (concrete syntax tree), so you can't really faithfully / losslessly reconstruct the source code unless you meticulously call span_to_snippet basically everywhere and I'm talking everywhere and even then it's basically impossible.

Of course, we don't necessarily need to reconstruct the source à la rustfmt and could just use it to splice the source string in a few selected places but that wouldn't allow us to highlight comments as they aren't represented in the AST obviously and some keywords most likely (again, we can do some span_to_snippet trickery but we will get this wrong similar to rustfmt which just swallows comments here & there (we would only fail to highlight things but still)).

I mean it's worth a try, maybe I'm missing some third approach that's miles better.

Okay, so we could follow a hybrid solution by lexing the source, going through the token stream like we do now to highlight comments only, then parse the token stream & use it to splice+highlight the source. Might be perf heavy. Still won't catch everything but alright, trade-offs are everywhere and it might be better than the current approach.

Copy link
Contributor

@yotamofek yotamofek Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be saying silly things due to lack of knowledge,
but can't we just use the lexer, without invoking the parser, to generate a stream of Tokens? Then we "just" have to map a TokenKind to a CSS class?

I do wonder why rustfmt doesn't do that, though. Is it because they need AST-level information? But isn't rustfmt context-unaware?

I'll try to read up on it, maybe we can talk about it at tomorrow's meeting. Seems like it would simplify rustdoc quite a bit and be a much more robust solution, assuming it's actually feasible. But again - no idea what I'm talking about here 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was gonna say the lexer is probably much slower since it also allows for recovery, suggestions, and can't assume the code is actually syntactically valid (which we can, I think?)

Copy link
Member

@fmease fmease Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the sake of completely, I'll mention it: While I'm pretty sure all the ASTs created at the start of the rustdoc process were dropped already, the HIR should still be around. We could in theory visit it and splice the source according to all the spans we find in the HIR.

Now, the biggest drawback of that will probably be syntactic sugar like for loops and async bodies (the latter have been turned into state machines at this point) which we might not be able to highlight easily or at all but I could be wrong.

Copy link
Member

@fmease fmease Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be saying silly things due to lack of knowledge,
but can't we just use the lexer, without invoking the parser, to generate a stream of Tokens? Then we "just" have to map a TokenKind to a CSS class?

That's essentially what we're doing right now. We're currently only lexing the source using rustc_lexer and iterate through its Tokens (cc https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/enum.TokenKind.html).

rustc_parse's lexer which you've mentioned only transforms the token stream provided by rustc_lexer into a different representation that's "slightly easier" to parse. For all intents and purposes, however they're the same thing in a different color, rustdoc's approach wouldn't change on a macro scale by changing over to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, got it. Thanks for the explanation!

{
self.new_macro_span(text, sink, before, file_span);
return;
}
Class::KeyWord
}
// If it's not a keyword and the next non whitespace token is a `!`, then
// we consider it's a macro.
_ if matches!(self.peek_non_trivia(), Some((TokenKind::Bang, _))) => {
self.new_macro_span(text, sink, before, file_span);
return;
}
_ => Class::Ident(span()),
}
}
TokenKind::RawIdent | TokenKind::UnknownPrefix | TokenKind::InvalidIdent => {
Class::Ident(new_span(before, text, file_span))
}
Expand All @@ -1246,27 +1292,6 @@ impl<'src> Classifier<'src> {
}
}

fn classify_ident(&mut self, before: u32, text: &'src str) -> Class {
// Macro non-terminals (meta vars) take precedence.
if self.in_macro_nonterminal {
self.in_macro_nonterminal = false;
return Class::MacroNonTerminal;
}

let file_span = self.file_span;
let span = || new_span(before, text, file_span);

match text {
"ref" | "mut" => Class::RefKeyWord,
"false" | "true" => Class::Bool,
"self" | "Self" => Class::Self_(span()),
"Option" | "Result" => Class::PreludeTy(span()),
"Some" | "None" | "Ok" | "Err" => Class::PreludeVal(span()),
_ if self.is_weak_keyword(text) || is_keyword(Symbol::intern(text)) => Class::KeyWord,
_ => Class::Ident(span()),
}
}

fn is_weak_keyword(&mut self, text: &str) -> bool {
// NOTE: `yeet` (`do yeet $expr`), `catch` (`do catch $block`), `default` (specialization),
// `contract_{ensures,requires}`, `builtin` (builtin_syntax) & `reuse` (fn_delegation) are
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// This code crashed because a `if` followed by a `!` was considered a macro,
// creating an invalid class stack.
// Regression test for <https://github.com/rust-lang/rust/issues/148617>.

//@ compile-flags: -Zunstable-options --generate-macro-expansion

enum Enum {
Variant,
}

pub fn repro() {
if !matches!(Enum::Variant, Enum::Variant) {}
}
30 changes: 30 additions & 0 deletions tests/rustdoc/source-code-pages/keyword-macros.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
// This test ensures that keywords which can be followed by values (and therefore `!`)
// are not considered as macros.
// This is a regression test for <https://github.com/rust-lang/rust/issues/148617>.

#![crate_name = "foo"]
#![feature(negative_impls)]

//@ has 'src/foo/keyword-macros.rs.html'

//@ has - '//*[@class="rust"]//*[@class="number"]' '2'
//@ has - '//*[@class="rust"]//*[@class="number"]' '0'
//@ has - '//*[@class="rust"]//*[@class="number"]' '1'
const ARR: [u8; 2] = [!0,! 1];

trait X {}

//@ has - '//*[@class="rust"]//*[@class="kw"]' 'impl'
impl !X for i32 {}

fn a() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add at least one test case where the ! is not separated by whitespace from the keyword?
e.g. if! true{}
To make sure it works, and doesn't regress in the future.

Also, do we have tests for !s following punctuation? e.g. something like const ARR: [u8; 2] = [!0,! 0];?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue cannot happen with something other than idents but added it just in case.

//@ has - '//*[@class="rust"]//*[@class="kw"]' 'if'
if! true{}
//@ has - '//*[@class="rust"]//*[@class="kw"]' 'match'
match !true { _ => {} }
//@ has - '//*[@class="rust"]//*[@class="kw"]' 'while'
let _ = while !true {
//@ has - '//*[@class="rust"]//*[@class="kw"]' 'break'
break !true;
};
}
Loading