Writing your own lexical scanner / tokenizer from scratch is the easiest aspect of any compiler, learn how to write your own from scratch to avoid relying on tools to do it for you.
Tip
Review an explanation of the process as well as the technical breakdown.
Run the lexical scanner yourself to see the results:
$ python ./src/main.pyvar age = 18;
func is18() {
if age >= 18 {
return 0;
}
return 1;
}
// Comments are awesome!
A lexical scanner is the first process of compilers or modern interpreters, with its primary job being to turn high level source code into a stream (or array) of tokens.
Tokens are non-literal representations of character(s) that are usually created within a struct (C/C++) with a type that holds data to be referenced in the future.
We can essentially "scan" (or analyze) a source file of all its high level code and search for specific keywords, if found, we map them to designated tokens, if unknown, we can map them to an unknown token (typically something like TOKEN_UNKNOWN).
If you do not understand, here is a very simple visual representation of the lexical scanning process:
var myVariable = 1;
[TOKEN_VAR, TOKEN_IDENTIFIER, TOKEN_EQUALS, TOKEN_INTEGER, TOKEN_SEMICOLON]
If tokens in itself were literal representations of code, we would have to create a token for every possible combination, hence why we use TOKEN_INTEGER and TOKEN_IDENTIFIER for something like numbers or variable names.
Each token does in fact hold information though, otherwise we would not be able to do anything with these tokens. Each token should at least hold the line number, column number, lexeme (or value) and optionally (though recommended), the position of the character in the file it's currently in (i.e. line 3, column 4, char. 50).
This scanner is a hand-written lookahead scanner, with a time complexity taking up
The core idea is simple:
- Store the current character in a variable (for example,
curr = peek(0)). - Branch based on what
curris (identifier start, digit, symbol, whitespace, etc.). - Use lookahead (
peek(1),peek(2), ...) when needed for multi-character tokens. - Consume exactly the characters that belong to the token, then emit it.
- Repeat until the end of input, then emit
TOKEN_EOF.
- If
curris a letter or_, begin scanning an identifier. - Keep consuming while the next character is a letter, digit, or
_. - Build the full lexeme.
- Check that lexeme in the keyword map.
- If present, emit a keyword token; otherwise emit
TOKEN_IDENTIFIER.
This keeps keyword detection reliable and prevents false splits like ifCount becoming if + Count.
For operators, branch on the current symbol and then inspect the next character:
- If
curris>, check whetherpeek(1)is=to choose between>=and>. - If
curris=, check whetherpeek(1)is=to choose between==and=.
In general, choose the longest valid token first.
