diff --git a/doc/Index.md b/doc/Index.md index 22bf0fdc..ff9b052c 100644 --- a/doc/Index.md +++ b/doc/Index.md @@ -3,7 +3,6 @@ [![Gem Version](https://badge.fury.io/rb/lrama.svg)](https://badge.fury.io/rb/lrama) [![build](https://github.com/ruby/lrama/actions/workflows/test.yaml/badge.svg)](https://github.com/ruby/lrama/actions/workflows/test.yaml) - ## Overview Lrama is LALR (1) parser generator written by Ruby. The first goal of this project is providing error tolerant parser for CRuby with minimal changes on CRuby parse.y file. @@ -47,6 +46,29 @@ Enter the formula: => 9 ``` +## Documentation (Draft) + +Chapters are split into individual files under `doc/` to make the structure easy to extend. + +1. [Concepts](chapters/01-concepts.md) +2. [Examples](chapters/02-examples.md) +3. [Grammar Files](chapters/03-grammar-files.md) +4. [Parser Interface](chapters/04-parser-interface.md) +5. [Parser Algorithm](chapters/05-parser-algorithm.md) +6. [Error Recovery](chapters/06-error-recovery.md) +7. [Handling Context Dependencies](chapters/07-context-dependencies.md) +8. [Debugging](chapters/08-debugging.md) +9. [Invoking Lrama](chapters/09-invoking-lrama.md) +10. [Parsers in Other Languages](chapters/10-other-languages.md) +11. [History](chapters/11-history.md) +12. [Version Compatibility](chapters/12-version-compatibility.md) +13. [FAQ](chapters/13-faq.md) + +## Development + +1. [Compressed State Table](development/compressed_state_table/main.md) +2. [Profiling](development/profiling.md) + ## Supported Ruby version Lrama is executed with BASERUBY when building ruby from source code. Therefore Lrama needs to support BASERUBY, currently 3.1, or later version. diff --git a/doc/chapters/01-concepts.md b/doc/chapters/01-concepts.md new file mode 100644 index 00000000..37c57eff --- /dev/null +++ b/doc/chapters/01-concepts.md @@ -0,0 +1,47 @@ +# Concepts + +This section introduces the ideas behind Lrama and how it differs from GNU Bison. +Lrama is a Ruby implementation of an LALR(1) parser generator, built to be a +drop-in replacement for the Ruby parser toolchain while keeping compatibility +with Bison-style grammars. + +## Lrama at a glance + +- **LALR(1) parser generator**: Lrama produces C parsers from grammar files. +- **Bison-style grammar files**: Most Bison directives are accepted, but there + are compatibility constraints (see below). +- **Error tolerant parsing**: Lrama can generate parsers that attempt recovery + using a subset of the algorithm described in *Repairing Syntax Errors in LR + Parsers*. +- **Ruby-focused**: Lrama is written in Ruby and is used in the CRuby build + process. + +## Compatibility assumptions + +Lrama is not a full Bison reimplementation. It intentionally assumes the +following Bison configuration when reading a grammar file: + +- `b4_locations_if` is always true (location tracking is enabled). +- `b4_pure_if` is always true (pure parser). +- `b4_pull_if` is always false (no pull parser interface). +- `b4_lac_if` is always false (no LAC). + +These assumptions simplify the code generation path and reflect how CRuby uses +a Bison-compatible parser. + +## Inputs and outputs + +A typical Lrama run takes a `.y` grammar file and produces: + +- A parser implementation in C (default `y.tab.c`, or the file passed by `-o`). +- A header file (`y.tab.h`) when `-d` or `-H` is provided. +- Optional reports (`--report` / `--report-file`). +- Optional syntax diagram output (`--diagram`). + +## Workflow stages + +1. Write a grammar file (`.y`) using Bison-compatible syntax. +2. Run Lrama to generate the parser C code. +3. Compile the generated C code with the rest of your project. + +For worked examples, see the [Examples](02-examples.md) section. diff --git a/doc/chapters/02-examples.md b/doc/chapters/02-examples.md new file mode 100644 index 00000000..a28dc4a4 --- /dev/null +++ b/doc/chapters/02-examples.md @@ -0,0 +1,41 @@ +# Examples + +This chapter mirrors the structure of the Bison manual examples, but focuses on +what is present in the Lrama repository today. + +## Calculator example (sample/calc.y) + +The [`sample/calc.y`](../../sample/calc.y) grammar is the canonical example +for running Lrama. + +```shell +$ lrama -d sample/calc.y -o calc.c +$ gcc -Wall calc.c -o calc +$ ./calc +``` + +The grammar demonstrates: + +- Declaring tokens and precedence. +- Attaching semantic actions in C. +- Generating a header file with `-d`. + +## Minimal parser example (sample/parse.y) + +[`sample/parse.y`](../../sample/parse.y) is a smaller grammar intended to be +used by the build instructions and smoke tests. + +```shell +$ lrama -d sample/parse.y +``` + +## Additional grammars + +The `sample/` directory includes additional grammars that cover different +syntax styles: + +- [`sample/json.y`](../../sample/json.y) +- [`sample/sql.y`](../../sample/sql.y) + +These are good starting points when verifying compatibility or experimenting +with new directives. diff --git a/doc/chapters/03-grammar-files.md b/doc/chapters/03-grammar-files.md new file mode 100644 index 00000000..78a6121d --- /dev/null +++ b/doc/chapters/03-grammar-files.md @@ -0,0 +1,113 @@ +# Grammar Files + +Lrama reads Bison-style grammar files. Each grammar file has four sections in +order: + +1. **Prologue**: C code copied verbatim into the generated parser. +2. **Declarations**: Bison-style directives such as `%token` and `%start`. +3. **Grammar rules**: The productions and semantic actions. +4. **Epilogue**: C code appended to the end of the generated parser. + +A minimal grammar looks like this: + +```yacc +%token INTEGER +%% +input: INTEGER '\n'; +%% +``` + +## Symbols + +- **Terminals** are tokens returned by the lexer. +- **Nonterminals** are syntactic groupings defined by rules. + +Lrama accepts the common `%token`, `%type`, `%left`, `%right`, and +`%precedence` declarations in the declarations section. + +## Rules and actions + +Grammar rules use the standard Bison syntax. Semantic actions are C code blocks +that run when a rule is reduced. + +```yacc +expr: + expr '+' expr { $$ = $1 + $3; } + | INTEGER { $$ = $1; } + ; +``` + +## Parameterized rules + +Lrama extends Bison-style rules with parameterization. A nonterminal definition +may accept other symbols as parameters, allowing you to reuse rule templates. +Parameterized rules are defined with `%rule` and invoked like a nonterminal. + +```yacc +%rule option(X) + : /* empty */ + | X + ; + +program: + option(statement) + ; +``` + +When Lrama expands a parameterized rule, it creates a concrete nonterminal +whose name encodes the parameters. The example above expands to a rule named +`option_statement`. + +### Parameterized rules in the standard library + +Lrama ships a standard library of reusable parameterized rules in +[`lib/lrama/grammar/stdlib.y`](../../lib/lrama/grammar/stdlib.y). Common +patterns include: + +- `option(X)`: optional symbol. +- `list(X)`: zero or more repetitions. +- `nonempty_list(X)`: one or more repetitions. +- `separated_list(separator, X)`: separated list with optional empty case. +- `separated_nonempty_list(separator, X)`: separated list with at least one + element. +- `delimited(opening, X, closing)`: wrap a symbol with delimiters. + +You can reference these directly by including the standard library in your +grammar or copy them into your own grammar file. + +### Semantic values and locations + +Parameterized rules support the same semantic action syntax as ordinary rules. +If you add actions to a parameterized rule, the generated nonterminal keeps the +action and location references intact. When you call a parameterized rule, the +resulting nonterminal can be used like any other symbol in subsequent rules. + +## Inlining + +The `%inline` directive replaces all references to a symbol with its +definition. It is useful for eliminating extra nonterminals, removing +shift/reduce conflicts, or keeping small helper rules from polluting the symbol +list. + +```yacc +%inline opt_newline + : /* empty */ + | '\n' + ; + +lines: + lines opt_newline line + | line + ; +``` + +An inline rule does not create a standalone nonterminal in the output. Instead, +its productions are substituted wherever the inline symbol is referenced. This +is why `%inline` is often paired with parameterized rules (for example, +`%inline ioption(X)` in the standard library) to build reusable templates +without growing the symbol table. + +## Error recovery + +Use `error` tokens in rules and enable recovery with `-e` when generating the +parser. For guidance, see the [Error Recovery](06-error-recovery.md) chapter. diff --git a/doc/chapters/04-parser-interface.md b/doc/chapters/04-parser-interface.md new file mode 100644 index 00000000..f6da6046 --- /dev/null +++ b/doc/chapters/04-parser-interface.md @@ -0,0 +1,35 @@ +# Parser Interface + +Lrama generates a C parser that follows the same API style as Bison’s default +C interface. The entry point is `yyparse`, which calls `yylex` to obtain tokens +from the lexer and uses `yyerror` for error reporting. + +## Required functions + +- `int yylex(void)` returns the next token and sets semantic values. +- `int yyparse(void)` drives the parser. +- `void yyerror(const char *message)` reports syntax errors. + +The signatures may vary if you configure `%parse-param` or `%lex-param` +arguments in your grammar. + +## Location tracking + +Location tracking is always enabled in Lrama’s compatibility model. Use `@n` +for the location of a right-hand side symbol and `@$` for the location of the +left-hand side. Define a location type via `%define api.location.type` or by +customizing the generated code. + +## Header generation + +Use `-d` or `-H` to emit a header file containing token definitions and shared +structures: + +```shell +$ lrama -d sample/parse.y +``` + +## Pure parser assumptions + +Lrama assumes a pure parser (`b4_pure_if` is always true). This means semantic +value and location information are passed explicitly rather than using globals. diff --git a/doc/chapters/05-parser-algorithm.md b/doc/chapters/05-parser-algorithm.md new file mode 100644 index 00000000..b757705c --- /dev/null +++ b/doc/chapters/05-parser-algorithm.md @@ -0,0 +1,32 @@ +# Parser Algorithm + +Lrama produces LALR(1) parsers. The generated parser uses the standard LR +algorithm with shift/reduce and reduce/reduce conflict resolution. + +## Conflicts and precedence + +Use `%left`, `%right`, and `%precedence` declarations to resolve +shift/reduce conflicts. Lrama reports conflicts in the `--report` output and +with `-v` (alias for `--report=state`). + +## Reports and diagnostics + +Lrama can emit detailed state and conflict reports during parser generation. +Common report options include: + +- `--report=state`: state machine summary (also `-v`). +- `--report=counterexamples`: generate conflict counterexamples. +- `--report=all`: include all reports. + +You can write the report to a file with `--report-file`. + +```shell +$ lrama -v --report-file=parser.report sample/parse.y +``` + +## Error tolerant parsing + +When `-e` is supplied, Lrama enables its error recovery extensions. This uses a +subset of the algorithm described in *Repairing Syntax Errors in LR Parsers*. +Refer to [Error Recovery](06-error-recovery.md) for guidance on structuring +rules. diff --git a/doc/chapters/06-error-recovery.md b/doc/chapters/06-error-recovery.md new file mode 100644 index 00000000..ef42e134 --- /dev/null +++ b/doc/chapters/06-error-recovery.md @@ -0,0 +1,29 @@ +# Error Recovery + +Lrama supports error tolerant parsing inspired by the algorithm described in +*Repairing Syntax Errors in LR Parsers*. + +## Enabling recovery + +Pass `-e` when generating the parser to enable recovery support. + +```shell +$ lrama -e sample/parse.y +``` + +## Writing recovery rules + +Use the special `error` token in grammar rules to specify recovery points. A +common pattern is to skip to a statement terminator or newline. + +```yacc +statement: + expr ';' + | error ';' { /* discard the rest of the statement */ } + ; +``` + +## Handling recovery in actions + +Make sure semantic actions can cope with partially parsed input. Keep actions +small and defensively check inputs for null values when necessary. diff --git a/doc/chapters/07-context-dependencies.md b/doc/chapters/07-context-dependencies.md new file mode 100644 index 00000000..e5b30b2f --- /dev/null +++ b/doc/chapters/07-context-dependencies.md @@ -0,0 +1,23 @@ +# Handling Context Dependencies + +Some grammars are difficult to express with pure context-free rules. +In these cases, the typical approach is to make the lexer or semantic actions +context aware. + +## Token-level context + +Emit different tokens depending on parser state. For example, you can track +whether you are inside a type declaration and return a distinct token for +identifiers in that context. + +## Semantic predicates + +Lrama does not provide Bison-style `%prec` predicates or GLR semantic +predicates. Instead, use regular semantic actions and explicit tokens to keep +state. + +## Parameterized rules + +Parameterized rules can help express repeated patterns without introducing +ambiguity. Use them to factor context-specific constructs while keeping the +grammar readable. See the [Grammar Files](03-grammar-files.md) chapter. diff --git a/doc/chapters/08-debugging.md b/doc/chapters/08-debugging.md new file mode 100644 index 00000000..41c381f8 --- /dev/null +++ b/doc/chapters/08-debugging.md @@ -0,0 +1,32 @@ +# Debugging + +Lrama offers both generation-time and runtime diagnostics. + +## Generator traces + +Use `--trace` to print internal generation traces. Useful values are: + +- `automaton`: print state transitions. +- `rules`: print grammar rules. +- `actions`: print rules with semantic actions. +- `time`: report generation time. +- `all`: enable all traces. + +```shell +$ lrama --trace=automaton,rules sample/parse.y +``` + +## Reports + +`--report` produces structured reports about states, conflicts, and unused +rules/terminals. See [Parser Algorithm](05-parser-algorithm.md) for details. + +## Syntax diagrams + +Use `--diagram` to emit an HTML diagram of the grammar rules. + +```shell +$ lrama --diagram=diagram.html sample/calc.y +``` + +The repository includes a sample output in [`sample/diagram.html`](../../sample/diagram.html). diff --git a/doc/chapters/09-invoking-lrama.md b/doc/chapters/09-invoking-lrama.md new file mode 100644 index 00000000..c44c378c --- /dev/null +++ b/doc/chapters/09-invoking-lrama.md @@ -0,0 +1,22 @@ +# Invoking Lrama + +Lrama is a command-line tool that reads a grammar file and emits parser code. + +```shell +$ lrama [options] FILE +``` + +## Common options + +- `-o, --output=FILE`: write parser output to FILE. +- `-H, --header=FILE`: also produce a header file named FILE. +- `-d`: emit `y.tab.h` next to the output file. +- `-v, --verbose`: same as `--report=state`. +- `-r, --report=REPORTS`: emit reports (`states`, `rules`, `counterexamples`, + etc.). +- `--report-file=FILE`: write report output to FILE. +- `--diagram[=FILE]`: generate HTML grammar diagrams. +- `--trace=TRACES`: print generation traces. +- `-e`: enable error recovery. + +Run `lrama --help` to see the full list of options. diff --git a/doc/chapters/10-other-languages.md b/doc/chapters/10-other-languages.md new file mode 100644 index 00000000..e20b5e6d --- /dev/null +++ b/doc/chapters/10-other-languages.md @@ -0,0 +1,9 @@ +# Parsers in Other Languages + +Lrama focuses on generating C parsers that integrate with CRuby. It does not +provide the multi-language backends that Bison offers (such as C++, Java, or +D). If you need those targets, consider using Bison directly. + +That said, the generated C parser can be embedded in other language runtimes as +long as the host can call into C and provide the required lexer and error +handlers. diff --git a/doc/chapters/11-history.md b/doc/chapters/11-history.md new file mode 100644 index 00000000..e5feed89 --- /dev/null +++ b/doc/chapters/11-history.md @@ -0,0 +1,10 @@ +# History + +Lrama was created to provide a Bison-compatible parser generator that can be +run with the Ruby toolchain used in CRuby builds. It maintains compatibility +with Bison-style grammars while adding features needed by Ruby's parser, +such as error tolerance and grammar parameterization. + +The project is maintained alongside the Ruby language development process and +shares many of the same constraints (BASERUBY compatibility, default gems +only). diff --git a/doc/chapters/12-version-compatibility.md b/doc/chapters/12-version-compatibility.md new file mode 100644 index 00000000..599e9e70 --- /dev/null +++ b/doc/chapters/12-version-compatibility.md @@ -0,0 +1,8 @@ +# Version Compatibility + +Lrama must run on BASERUBY when building CRuby. The supported version is +currently Ruby 3.1 or later. + +When updating Lrama, ensure that the grammar output remains compatible with the +Ruby version you are targeting. The repository maintains branches for older +Ruby versions; see `README.md` for the list. diff --git a/doc/chapters/13-faq.md b/doc/chapters/13-faq.md new file mode 100644 index 00000000..e331e70b --- /dev/null +++ b/doc/chapters/13-faq.md @@ -0,0 +1,21 @@ +# FAQ + +## Does Lrama implement all of Bison? + +No. Lrama supports Bison-style grammars but assumes specific settings (pure +parser, locations enabled, etc.). See [Concepts](01-concepts.md) for the exact +compatibility assumptions. + +## Where is the documentation hosted? + +The public documentation is published at https://ruby.github.io/lrama/. +This `doc/` directory is the source for that documentation. + +## Can I use Lrama without Ruby? + +Lrama is a Ruby tool and requires Ruby to run. The generated parser output is +in C, so you can compile and use it without Ruby once the code is generated. + +## How do I profile Lrama? + +See [Profiling](../development/profiling.md) for the profiling workflow.