A fast CLI tool that counts lines of code and estimates tokens in your project.
tokloc scans directories and gives you a breakdown of files by type, lines of code, and estimated token counts. It respects .gitignore by default, so you don't get node_modules or build artifacts cluttering your results.
- Counts lines (total, non-empty, empty)
- Estimates token counts using type-specific density ratios
- Classifies files by type (code, docs, data, html, image, other)
- Respects
.gitignorerules - Parallel processing using all CPU cores (configurable with -j)
- Binary file detection and early skipping
- Multiple path support (directories and files)
- Include/exclude pattern filtering
- Accurate token counting with optional tokenizer (supports HuggingFace vocab files)
git clone https://github.com/nathanielcole/tokloc.git
cd toklocmkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .mkdir build && cd build
cmake .. -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug
cmake --build .cmake --install ../build/tokloc /path/to/your/project./build/tokloc dir1 dir2
./build/tokloc file.txt src/
./build/tokloc src/ tests/ README.md| Flag | Description |
|---|---|
-v, --verbose |
Show scanning progress for directories |
-a, --all |
Include ignored files (.gitignore rules) |
-i, --include |
Include only files matching pattern (e.g., *.cpp, *.py) |
-x, --exclude |
Exclude files matching pattern (e.g., *test*, */tmp/*) |
-j, --jobs |
Number of parallel jobs (default: auto, -j1 = single threaded) |
--tokenizer-path |
Load tokenizer vocabulary from local file |
--tokenizer-url |
Load tokenizer from remote JSON URL (in-memory) |
--tokenizer-url-max-mb |
Max download size for --tokenizer-url (default 100) |
Use a local tokenizer vocab file:
./build/tokloc . --tokenizer-path /path/to/vocab.txtUse a remote tokenizer JSON URL (downloaded into memory, not written to disk):
./build/tokloc . --tokenizer-url https://example.com/tokenizer.jsonOverride the default max URL download size (100 MB):
./build/tokloc . --tokenizer-url https://example.com/tokenizer.json --tokenizer-url-max-mb 250Notes:
--tokenizer-pathand--tokenizer-urlare mutually exclusive.- URL tokenizer content must be valid JSON; non-JSON payloads are rejected.
Include only specific file types:
./build/tokloc . -i "*.cpp"
./build/tokloc . -i "*.cpp,*.h"
./build/tokloc src/ -i "*.cpp" -i "*.h"Exclude specific file types or directories:
./build/tokloc . -x "*test*" # Exclude test files
./build/tokloc . -x "*/tmp/*" # Exclude tmp directories
./build/tokloc . -i "*.cpp" -x "*test*" # Include cpp files but exclude test files
./build/tokloc . -x "*/node_modules/*" -x "*/build/*" # Exclude multiple patterns./build/tokloc . --verboseType Files Lines Empty Tokens
-------------------------------------------------------
code 3 2923 633 28373
docs 2 198 73 1663
other 1 3 0 7
-------------------------------------------------------
Total 6 3124 706 30043
Elapsed: 0.00s | files/s: 2515.63 | lines/s: 1605813.46 | tok/s: 12596202.06
| Type | Extensions |
|---|---|
| code | .c, .cpp, .h, .hpp, .rs, .go, .py, .js, .ts |
| docs | .md, .txt |
| data | .json, .yaml, .yml, .xml |
| html | .html, .htm |
| image | .png, .jpg, .jpeg, .gif, .webp, .svg |
Tokens are estimated using type-specific density ratios (characters per token):
| Type | Chars/Token |
|---|---|
| Code | 4.2 |
| Docs | 3.6 |
| Data | 3.3 |
| HTML | 3.7 |
Images return 0 tokens.
- Uses parallel processing to utilize all CPU cores (configurable with
-j) - Binary files are detected and skipped early to save memory
- Streaming file processing avoids loading entire files into memory
- Large files (>10MB) are automatically chunked and processed in parallel
- All chunks are processed regardless of thread count (no truncation)
- Overlap handling preserves token boundaries at chunk edges
- Fast JSON parsing with yyjson for tokenizer vocab files
- Trie-based prefix matching for efficient tokenization
cd build
./tests # recommended
# or
ctest -VContributions are welcome! Open an issue or submit a pull request.