parser: add (?:...) non-capturing and (?<name>...) named capture group syntax#4
parser: add (?:...) non-capturing and (?<name>...) named capture group syntax#4robertream wants to merge 14 commits into
Conversation
…p syntax Extends the POSIX ERE parser with two non-standard but widely-used group syntaxes: - `(?:...)` — non-capturing group: parsed and matched but produces no StartCapture/EndCapture transitions, so N stays smaller - `(?<name>...)` — named capture group: behaves identically to an unnamed group at the engine level; name is available via ERE::group_names() New parser variants: NonCapturingSubexpression, NamedSubexpression. New error: InvalidGroupName (empty or unclosed name). New method: ERE::group_names() — depth-first walk returning Vec<Option<String>> in group-number order, mirroring the simplified tree traversal. attr-macro: add named struct field binding with #[group(N)] Extends __compile_regex_attr to handle Fields::Named in addition to the existing Fields::Unnamed (tuple struct) path: - Fields without #[group(N)]: matched to capture groups by field name via ERE::group_names() — field order is independent of regex group order - Fields with #[group(N)]: bound to explicit capture group index N - #[group(N)] attributes are stripped from the emitted struct so the compiler does not see an unknown attribute - Out-of-bounds #[group(N)] index emits a clear compile error instead of panicking Also adds doc comment for Regex::exec describing the [Option<&str>; N] return shape and when None is returned. tests: add coverage for named groups, non-capturing groups, and #[group(N)] - compile_regex_group_extensions: verifies mixed (?<name>...)/(?:...) pattern compiles as Regex<3> across all four engines (non-capturing group excluded from N, named groups count same as unnamed) - non_capturing_group, named_capture_group_tuple_struct, named_field_struct: attr-macro smoke tests for new syntax with tuple and named structs - compile_fail tests via trybuild: unknown_field_name and group_index_out_of_bounds confirm macro emits clear compile errors docs: document named groups, non-capturing groups, and #[group(N)] (with doc tests) - README: add 'Named Groups and Non-Capturing Groups' section with live code examples for (?:...) and (?<name>...), plus a rust,ignore example for #[group(N)] (feature-gated, so not doc-tested by default) - ROADMAP: mark (?:...) non-capturing groups as [x] shipped; add [x] entry for (?<name>...) named capture groups - ere-macros: extend #[regex(...)] doc with two live doc-test examples — named struct with field-order independence, and optional named group mapping to Option<&'a str>
jtmoon79
left a comment
There was a problem hiding this comment.
I'm curious, was this AI created?
Either way, LGTM!
|
@jtmoon79 of course, who even has the time to write artisanal hand written code anymore? That's sooo 2025! :) |
|
On second thought, this is emitting a number of warnings during the build: |
Validate all capture groups (not just named ones) have corresponding struct fields, and point error spans to the regex attribute literal instead of the field list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove vestigial used_named_groups (replaced by used_groups), and cfg-gate group_names/collect_group_names behind unstable-attr-regex since they are only called from that feature's code path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove orphaned nfa_static serialization pipeline, unused U8NFA construction helpers (superseded by build()), duplicate make_label, unused Run::start_state field, Atom::serialize_check, and SimplifiedTreeNode::from_ere_no_group0. Suppress warnings for intentionally kept symmetric API (with_offset methods), in-progress visualization scaffolding, and QuantifierType::min/max. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I used Codex on the initial PR, I haven't figured out a workflow with Codex that is as competent and thorough as Claude yet. There were build warnings with the original PR. I included fixes for them here, but Claude deleted some dead code. I'm not sure if that's what you would want, but it's git, so you can always have Claude go back and patch based on history. :) |
|
Thanks for the update @robertream ! Some feedback: The changes from 21fbb45 now don't allow unnamed capture groups to be undeclared in the struct. For example, I want to match on this unusual datestring with these contrived potential formats Previously this worked: #[derive(Debug, PartialEq)]
#[regex(r"^(?<year>[21][0-9]{3})(-|-=)(?<month>0[1-9]|1[0-2])(-|-=)(?<day>[0123][0-9])")]
struct ERERegex3<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
month: &'a str,
day: &'a str,
}but now I get an error I could declare the unnamed groups by group index #[derive(Debug, PartialEq)]
#[regex(r"^(?<year>[21][0-9]{3})(-|-=)(?<month>0[1-9]|1[0-2])(-|-=)(?<day>[0123][0-9])")]
struct ERERegex3<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
#[group(2)]
_a: &'a str, // for the first separator
month: &'a str,
#[group(4)]
_b: &'a str, // for the second separator
day: &'a str,
}But I really liked the prior behavior where I could just ignore declaring the unnamed groups. I am dealing with some very long regexs and having to declare the uninteresting capture groups is tedious. What would be ideal is giving the user the option of
|
Unnamed capture groups (e.g. `(-|-=)`) no longer require corresponding struct fields when using `#[regex]` on a named struct. Only named capture groups must be bound. This restores the prior behavior where users could ignore uninteresting groups in long regexes without adding tedious `#[group(N)]` fields. Named groups and tuple struct validation remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8d50693 fixed this. |
|
Just curious @robertream , since you are so far into this with your AI code assistant, what about this change?
I'm just thinking, if it's a little more churning by your trained AI, could it be an easy feature-addition? |
|
On second thought @robertream , do you know why so much code in |
|
it was causing warnings, for dead code |
…ation
Adds `bind = Strict | Named | None` optional parameter to `#[regex(...)]`:
- Strict: all capture groups (named and unnamed) must have struct fields
- Named: only named groups must be bound, unnamed are silently skipped (default)
- None: no groups are required, only declared fields are populated
Default is `Named`, preserving backwards compatibility. Example:
#[regex(r"^(?<year>\d{4})(-|/)(?<month>\d{2})$", bind = None)]
struct YearOnly<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
}
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
376c681 amazing! |
|
I suggest modifying the docstring for |
|
Okay, one more request @robertream . Of course, I know you are working for free, so no worries if "no". Is it possible to also allow selecting the underlying engine, i.e. an |
Shows bind = Named (default, skips unnamed groups with a full timestamp regex), bind = Strict (all groups must have fields), and bind = None (only declared fields populated). Includes note on compile error behavior when required groups are missing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I have found that Claude does so well with Rust, this didn't take me more than a hand full of iterations with Claude. I recommend you at least get the $100 subscription, it is at least a 5x productivity power boost. It's made programming so much more fun again. :) |
Adds `engine = Auto | OnePassU8 | DfaU8 | FlatLockstepNfaU8 | FlatLockstepNfa | FixedOffset` optional parameter to `#[regex(...)]`. Default is Auto (existing pick_engine behavior). Engines that cannot handle the regex produce compile errors pointing at the regex literal. Also refactors the attribute parser to a loop supporting multiple optional params in any order with trailing comma support, and consolidates unit tests — doctests now serve as primary tests for bind and engine, with unit tests only for edge cases not in docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
(many more errors) |
The #[cfg(feature = "unstable-attr-regex")] gate was accidentally dropped when the Engine enum was inserted above the function, causing build failures without the feature flag enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I should have had it running the CI build locally before commit/push. Is there a reason why the CI build isn't running on PR? |
IIUC, it's because github changed default behavior for this code name: Cargo Test
on: [push, pull_request]
jobs:
test:
name: Cargo Test
...to only run for PRs from approved users. Project owners have to change their project policy to allow unapproved PRs to run the workflows. It was due to security flaw where attackers would clone+PR a project, the project workflows would run, and the malicious workflows in the PR would dump secrets to the logs. |
|
In case @robertream or @2kai2kai2 is curious, here are my benchmark results. BackgroundFirst some explanation and code snippets: I benchmarked crates PATTERN = r"^[\[\(](?<year>[0-9]{4})[/-](?<month>0[1-9]|1[0-2])[/-](?<day>[0-9]{2}) (?<hour>[0-9]{2}):(?<minute>[0-9]{2}):(?<second>[0-9]{2})(?<fractional>(\.|,)[0-9]{6})(\]|\))";The follow four haystacks are matched against, the last two fail to match const HAYSTACK1: &[u8] =
b"[2001/01/01 11:21:12.111222] ../source3/smbd/oplock.c:1340(init_oplocks)";
const HAYSTACK2: &[u8] =
b"(2003-03-04 23:34:44,333444) ../source3/smbd/oplock.c:1340(init_oplocks)";
const HAYSTACK3: &[u8] =
b"2005-05-06 05:06:56.555666 ../source3/smbd/oplock.c:1340(init_oplocks)";
const HAYSTACK4: &[u8] =
b"[2007/07/08 17:18:58,777888 ../source3/smbd/oplock.c:1340(init_oplocks)";(those haystack strings are from smbd log messages) Each benchmark function:
ResultsBenchmark names are
|
|
@robertream it looks like the #[regex(
r"^(?<year>[12][0-9]{3})",
bind = None
)]
struct ERERegex__try6<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
month: &'a str,
}The error is Wasn't I also tried declaring #[regex(
r"^(?<year>[12][0-9]{3})",
bind = None
)]
struct ERERegex__try6<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
month: Option<&'a str>,
}which failed with the same error. ... after reviewing the code diff ... Oh! whoops, this was not supported. @robertream (feature requesting incoming!) could that arrangement ☝ be allowed? i.e. the declared struct has fields that may never be filled, when declared as Another acceptable arrangement would be an additional attribute like #[regex(
r"^(?<year>[12][0-9]{3})"
)]
struct ERERegex__try6<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
#[bind_optional]
month: Option<&'a str>,
} |
…ive ASCII matching Transforms the parse tree so ASCII letters match both cases. Handles NormalChar, bracket expression Singles, and Ranges (including mixed-type ranges like [0-F] where only the alphabetic portion is folded). POSIX character classes like [:lower:] and [:upper:] are intentionally not affected, as documented in the docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t `bind` on tuple structs With `bind=None`, named struct fields without a matching capture group are assigned `None` (field must be `Option<T>`). This enables structs that declare fields for future use or shared across multiple regexes. Also: `bind` on tuple structs now produces a compile error instead of being silently ignored. And `#[group(N)]` now emits explicit errors for malformed values instead of silently falling through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s, and bind validation - ascii_case_insensitive with bracket expressions and mixed-type ranges - unbound Option fields with bind=None - compile-fail: unbound field rejected in default bind=Named mode - restore unbound_named_field.stderr to original error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks @robertream 😄 This is okay #[regex(
r"^(?<year>[12][0-9]{3})",
bind = None
)]
struct ERERegex__try6<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
month: Option<&'a str>,
}But when I remove the #[regex(
r"^(?<year>[12][0-9]{3})",
bind = None
)]
struct ERERegex__try6<'a> {
#[group(0)]
matched: &'a str,
year: &'a str,
month: &'a str,
}error Perfect! 🚀 |
|
@robertream may I get a modification to the prior change to allow even more flexibility? Currently, this code results in an error #[regex(
r"^(?<year>[12][0-9]{3})",
bind = None
)]
struct ERERegex__try6<'a> {
#[group(0)]
matched: &'a str,
year: Option<&'a str>
}The error is It is due to field When |
…one mode Uses .into() on non-optional captures so the field can be either &str (identity) or Option<&str> (via From<T> for Option<T>), without needing type detection in the macro. Also shortens the internal .expect() message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
a199e6a works! 😄 |
|
|
I am considering using this functionality in a project I'm working on. It would be helpful if you could merge this PR soon. :) |

Extends the POSIX ERE parser with two non-standard but widely-used group syntaxes:
(?:...)— non-capturing group: parsed and matched but produces no StartCapture/EndCapture transitions, so N stays smaller(?<name>...)— named capture group: behaves identically to an unnamed group at the engine level; name is available viaERE::group_names()New parser variants:
NonCapturingSubexpression,NamedSubexpression.New error:
InvalidGroupName(empty or unclosed name).New method:
ERE::group_names()— depth-first walk returningVec<Option<String>>in group-number order, mirroring the simplified tree traversal.attr-macro: add named struct field binding with
#[group(N)]Extends
__compile_regex_attrto handleFields::Namedin addition to the existingFields::Unnamed(tuple struct) path:#[group(N)]: matched to capture groups by field name viaERE::group_names()— field order is independent of regex group order#[group(N)]: bound to explicit capture group indexN#[group(N)]attributes are stripped from the emitted struct so the compiler does not see an unknown attribute#[group(N)]index emits a clear compile error instead of panickingAlso adds doc comment for
Regex::execdescribing the[Option<&str>; N]return shape and whenNoneis returned.Capture group validation
"Named capture group 'x' has no corresponding field in the struct.""Capture group N has no corresponding field in the struct. Add a field like '#[group(N)] captured: &'a str'."tests: add coverage for named groups, non-capturing groups, and
#[group(N)]compile_regex_group_extensions: verifies mixed(?<name>...)/(?:...)pattern compiles asRegex<3>across all four engines (non-capturing group excluded fromN, named groups count same as unnamed)non_capturing_group,named_capture_group_tuple_struct,named_field_struct: attr-macro smoke tests for new syntax with tuple and named structsunknown_field_name,group_index_out_of_bounds,missing_named_capture_field,missing_unnamed_capture_group_bindingconfirm macro emits clear compile errorsdocs: document named groups, non-capturing groups, and
#[group(N)](with doc tests)(?:...)and(?<name>...), plus arust,ignoreexample for#[group(N)](feature-gated, so not doc-tested by default)(?:...)non-capturing groups as[x]shipped; add[x]entry for(?<name>...)named capture groups#[regex(...)]docs with two live doc-test examples — named struct with field-order independence, and optional named group mapping toOption<&'a str>Cleanup: remove pre-existing dead code
Removed orphaned nfa_static serialization pipeline, unused U8NFA construction helpers (superseded by
build()), duplicatemake_label, unusedRun::start_statefield,Atom::serialize_check, andSimplifiedTreeNode::from_ere_no_group0. Suppressed warnings for intentionally kept symmetric API (with_offsetmethods), in-progress visualization scaffolding, andQuantifierType::min/max.Validation
cargo build— zero warningscargo test— all passcargo test -F unstable-attr-regex— all pass (including 4 compile-fail tests)Closes #2