Skip to content

johnsoncodehk/monogram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

263 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Monogram

Write a language's grammar once, as an executable definition. Monogram runs it as a real parser, proves it against the language's official conformance suite, then derives the syntax highlighters β€” TextMate, tree-sitter, Monarch β€” from that same proven grammar. Highlighting correctness flows down from a parser-verified model instead of up from hand-tuned regex.

mono + grammar β€” one grammar definition, many derived artifacts.

Status β€” an active research project; four languages on one shared, language-agnostic engine, each proven as a parser before its highlighter is trusted:

  • TypeScript (typescript.ts) β€” mature: 100% valid-code coverage, 97.8% bidirectional vs tsc.
  • JavaScript (javascript.ts) β€” the standalone ECMAScript base TypeScript builds on (subset β†’ superset); parses real-world JS, with less conformance-corpus depth than TS so far.
  • HTML (html.ts) β€” the engine reaching past token streams into markup; ~95 lines, validated against parse5.
  • Vue (vue.ts) β€” a dialect of html.ts: SFC blocks that embed Monogram's own TS/JS/CSS, plus directives and {{ }} interpolation.

Per-grammar comparison vs the official parser as the neutral oracle (node test/coverage-table.ts --write).

Parser β€” Monogram's parser vs the official parser (test/src-coverage.ts). agree is the closeness number: Monogram and the official parser return the same verdict on each corpus file (both accept / both reject; structural parse-tree equality for HTML via parse5). covered is the share of the official parser's branches the corpus actually exercises β€” a blind-spot gauge; Monogram's behaviour on the uncovered remainder is untested, so read agree as "on the covered portion." For the non-HTML grammars agree is accept/reject, not tree-equality; their parse-structure correctness is exercised instead by the Highlighter axis below, whose token roles are read off the parse tree. (Each adapter's detailed output also prints a coverage-weighted branch-alignment %, which is more lenient than agree.)

Highlighter β€” Monogram's derived TextMate grammar vs the official one, both graded against the parser's token roles (test/scope-gap.ts); the vscode#203212 comparison.

Grammar Parser β€” agree Β· covered Highlighter β€” Monogram vs official
TypeScript 97.1% Β· 76.4% 99.2% vs 99.3%
JavaScript 96.3% Β· 65.5% 99.0% vs 83.6%
JSX 97.1% Β· 52.5% 94.3% vs 94.3%
TSX 96.7% Β· 65.7% 95.6% vs 95.4%
HTML 95.3% Β· 49.3% 100.0% vs 98.8%
YAML 100.0% Β· 73.9% 99.5% vs 99.5%
Vue β€” 98.8% vs 98.0%

Which β€œofficial” grammar each row compares against: HTML’s is the unmaintained textmate/html.tmbundle β€” the #203212 case Monogram targets. YAML’s is the maintained RedCMD/YAML-Syntax-Highlighter that VS Code switched to (microsoft/vscode#232244) β€” so YAML’s gap is Monogram vs a maintained grammar, not a dead bundle. JS/TS use Microsoft’s maintained TypeScript-TmLanguage.

Quick start

Requires Node 24+ (runs .ts directly β€” no build step, no tsx).

npm install
node src/cli.ts typescript.ts        # regenerate every artifact from the grammar
import { createParser } from './src/gen-parser.ts';
import grammar from './typescript.ts';

const { parse } = createParser(grammar);
const cst = parse('const x = f(a, b)');        // β†’ a concrete syntax tree

The idea

A TextMate grammar is a pile of regexes guessing at a language's structure. It's written by hand, independently of any parser, and perpetually wrong at the edges β€” VS Code's official TypeScript grammar carries 100+ open issues for exactly this reason. Everyone trying to fix it competes on the same losing axis: who can hand-write better regexes.

Take typeof x < y. A regex highlighter has to guess whether < opens a generic argument list or is a less-than comparison β€” and it guesses wrong somewhere, forever. A parser doesn't guess; the grammar already decides. Monogram inverts the dependency:

  1. Write the grammar, then prove it. The grammar is executable β€” Monogram runs it as a recursive-descent + Pratt (operator-precedence) parser over the TypeScript conformance suite, measured bidirectionally: it must accept every input tsc accepts and reject every input it rejects.

  2. Derive the highlighters from that proven grammar, never hand-write them. The TextMate, tree-sitter, and Monarch outputs are all generated from the one parser-validated definition, so their correctness is underwritten by the conformance run, not by regex tuning.

That single source reaches across grammars, too: an embedded snippet runs another Monogram grammar β€” a <script> body is highlighted by Monogram's own JavaScript, so <script>const x = 1 < 2</script> colours < as a JS operator, the same ambiguity resolved inside the embed. Where VS Code's embeds fray β€” two independently-written grammars meeting with nothing checking the seam β€” Monogram owns both sides, so self-verifying that seam becomes possible (a design goal beyond today's standard contentName injection).

Comparison

The same question, every language at once: take the bugs reported against each hand-written official grammar and ask whether the derived grammar solves them. Which does only the official solve, which does only Monogram solve β€” and which do both still get wrong (the shared frontier neither reaches today)?

Each hand-written official grammar vs Monogram's derived one, on the bugs filed against it: TypeScript 26/26 (official 8/26) Β· TSX 11/11 (official 5/11) Β· HTML 20/20 (official 13/20) Β· Vue 23/23 (official 18/23) Β· YAML 8/8 (official 8/8). Per-issue detail below β€” auto-generated by npm run bench:issues.

TypeScript

issue Monogram official
#1050 β€” typeof y < string is a relational operator not generic (cascade victim intact) βœ“ Β·
#978 β€” typeof x < string then function (cascade victim intact) βœ“ Β·
#859 β€” as cast inside < > comparison βœ“ Β·
#1020 β€” new Map<number, number>; (no parens) βœ“ Β·
#855 β€” new Map</* comment */string, IArgs>() βœ“ Β·
#853 β€” throw /foo/ is regex βœ“ Β·
#804 β€” /[a-b]/g char class recognized βœ“ Β·
#869 β€” x in obj ? x : fallback ternary works βœ“ Β·
#770 β€” function call parens are punctuation βœ“ Β·
#1021 β€” regex with the v (unicode-sets) flag is recognized βœ“ Β·
#1025 β€” for-of without surrounding space keeps of a loop keyword βœ“ Β·
#815 β€” a class method named new is a method name, not the operator βœ“ Β·
#992 β€” casting to a type named type does not break highlighting βœ“ Β·
#994 β€” JSDoc @template [Output=Value] default β€” Monogram colors the param name, official misses it βœ“ Β·
#891 β€” from as an ordinary variable is not a keyword βœ“ Β·
#814 β€” a instanceof B & c keeps the operand a value, not a type βœ“ Β·
#950 β€” default import named type β€” the binding is a variable, not the type keyword βœ“ Β·
#1058 β€” import defer should scope defer as a keyword βœ“ Β·
… and 8 more both grammars already handle (βœ“ / βœ“)
issue Monogram official
#1063 β€” /\cJ/ control char escape βœ“ βœ“
#736 β€” obj.example() method gets entity.name.function βœ“ βœ“
#788 β€” optional chaining ?. is the optional accessor βœ“ βœ“
#881 β€” override modifier on a method is storage.modifier βœ“ βœ“
#1066 β€” triple-slash reference directive is a comment βœ“ βœ“
#1027 β€” nested generic >> closes two type-arg lists, not a shift βœ“ βœ“
#956 β€” as const satisfies Foo colors the satisfies keyword and the type βœ“ βœ“
#907 β€” typeof x extends string ? 1 : 2 conditional-type ternary βœ“ βœ“

TSX

issue Monogram official
#967 β€” generic arrow with a default type in .tsx βœ“ Β·
#979 β€” const modifier on a type parameter in .tsx βœ“ Β·
#1042/#990 β€” default generic arrow function in .tsx βœ“ Β·
#627 β€” member-expression JSX tag name βœ“ Β·
#1033 β€” generic arrow with a default + destructured param in .tsx βœ“ Β·
#825 β€” < and tag name on separate lines βœ“ Β·
… and 5 more both grammars already handle (βœ“ / βœ“)
issue Monogram official
#794 β€” non-null ! then / (division) in a JSX-attribute object βœ“ βœ“
#585 β€” // line comment inside a JSX open tag βœ“ βœ“
#754 β€” JSX element right after a /**/ block comment βœ“ βœ“
#667 β€” arrow function + ternary inside a JSX attribute βœ“ βœ“
#624 β€” JSX element in an array after a template-literal attribute βœ“ βœ“

HTML

issue Monogram official
tmbundle#118 β€” trailing / in an unquoted URL value βœ“ Β·
tmbundle#108 β€” nested <svg> is a valid tag, not flagged invalid βœ“ Β·
tmbundle#113 β€” // in an onclick= JS string read as a comment βœ“ Β·
tmbundle#104 β€” mixed-case onChange= event handler still reads as JS βœ“ Β·
tmbundle#88 β€” inline style= value embeds CSS βœ“ Β·
tmbundle#65 β€” < of </script> is HTML punctuation, not source.js βœ“ Β·
tmbundle#74 β€” < of </style> is HTML punctuation, not source.css βœ“ Β·
… and 13 more both grammars already handle (βœ“ / βœ“)
issue Monogram official
tmbundle#124 β€” slash in unquoted value foo/ βœ“ βœ“
vscode#140360 β€” / inside an unquoted value (path) βœ“ βœ“
tmbundle#84 β€” tag name a prefix of a sibling (<i>/<input>) βœ“ βœ“
tmbundle#117 β€” SVG camelCase tag name βœ“ βœ“
tmbundle#122 β€” < inside a quoted attr value βœ“ βœ“
vscode#130284 β€” > inside a quoted attr value does not close the tag early βœ“ βœ“
tmbundle#97 β€” whitespace (incl. a line feed) before > in a raw-text end tag βœ“ βœ“
tmbundle#81 β€” character entity &amp; in text βœ“ βœ“
tmbundle#102 β€” <style> element CSS is tokenized, not a flat blob βœ“ βœ“
tmbundle#50 β€” onclick= event-handler value is colored as JS βœ“ βœ“
tmbundle#85 β€” //</script> on its own line still closes the script βœ“ βœ“
tmbundle#51 β€” self-closing / is tag punctuation βœ“ βœ“
tmbundle#82 β€” a />-style <script src=… /> does NOT self-close β€” its body is the script content βœ“ βœ“

Vue

issue Monogram official
#6007/#2096/#520 β€” as type assertion in directive value βœ“ Β·
#2060-inline β€” const a = 1;</script> (content on the close line) embeds + clean close βœ“ Β·
#2060-inline-adjacent β€” an unterminated union before a same-line </script>, then a second <script setup> block βœ“ Β·
#5660 β€” as const cast in a v-for value βœ“ Β·
#4716/#5571 β€” as cast followed by another attribute βœ“ Β·
… and 18 more both grammars already handle (βœ“ / βœ“)
issue Monogram official
#3400 β€” instanceof in {{ }} βœ“ βœ“
#5370 β€” typeof x !== in v-if βœ“ βœ“
#5118 β€” ?. / ?? in {{ }} βœ“ βœ“
#1675 β€” arrow => in {{ }} βœ“ βœ“
#6039/#4741 β€” < operator in {{ }} (not a tag!) βœ“ βœ“
#5722 β€” negated ternary + quotes in {{ }} βœ“ βœ“
#5538/#2060 β€” trailing export type before </script> βœ“ βœ“
#3999 β€” a force-wrapped multi-line <script lang="ts"> start tag keeps the body as the ts family (no .tsβ†’.js flip) βœ“ βœ“
#4769 β€” tag name starting with template βœ“ βœ“
#5701 β€” {{ inside a <script> string βœ“ βœ“
#6070 β€” capitalized component then a <style> block βœ“ βœ“
#4291 β€” <script lang="tsx"> body embeds the DECLARED source.tsx (not a source.js fallback) βœ“ βœ“
#4291-jsx β€” <script lang="jsx"> body embeds the DECLARED source.js.jsx βœ“ βœ“
generic="T" β€” generic="T extends U"> type-param list embeds as TS βœ“ βœ“
#4410 β€” dynamic directive argument :[attr] βœ“ βœ“
#3727 β€” .prop modifier shorthand βœ“ βœ“
#2666 β€” dynamic slot name from a template literal βœ“ βœ“
#2560/#1290 β€” type as a v-for loop variable βœ“ βœ“

YAML

No asymmetries β€” both grammars handle all 8 filed bugs below.

… and 8 more both grammars already handle (βœ“ / βœ“)
issue Monogram official
vscode#170032 β€” document markers --- / ... are document structure, not stray punctuation βœ“ βœ“
atom/language-yaml#114 β€” a # in a block-scalar body is content, not a comment βœ“ βœ“
tmbundle#38 β€” a block scalar with leading/internal EMPTY lines stays one string region βœ“ βœ“
tmbundle#18 β€” JSON-ish punctuation ({/}) and a tab indicator inside a block scalar stay content βœ“ βœ“
johnsoncodehk/monogram#12 β€” an anchor &a in explicit-key (?) position is still an anchor βœ“ βœ“
johnsoncodehk/monogram#12 β€” a bare ? opening an explicit multi-line sequence key is the map-key indicator βœ“ βœ“
atom/language-yaml#119 β€” an escape inside a double-quoted KEY is highlighted βœ“ βœ“
tmbundle#39 β€” a plain scalar resolving to null is lexically a string that resolves to a constant βœ“ βœ“

A sampled ledger of real tracker issues, not an exhaustive audit. Run npm run bench:issues to regenerate (needs the official grammars: VS Code's installed TS/JS/HTML, and the Vue fixtures β€” see test/vue-bench.ts). Sources: test/issue-cases.ts, test/html-issue-cases.ts, test/vue-issue-cases.ts.

The ceiling β€” and the bar for claiming it

Deriving from a proven parser wins the disambiguation that is TextMate-expressible but infeasible to hand-write β€” regex-vs-division, generic-vs-comparison, whitespace-fragile multiline generics β€” the only-Monogram column. The both-miss cases are ones neither grammar gets today β€” not, by default, ones TextMate can't.

"TextMate can't express X" is not a guess or an assertion; it is a claim to be proven from the model. TextMate is a line-oriented matcher whose only cross-line memory is a finite stack of scope contexts, so a proof exhibits an X whose correct highlighting provably needs memory that model lacks β€” unbounded lookback to a token that is not an enclosing context. A failed attempt to derive a pattern is not such a proof: a cleverer pattern may exist, and most "impossible for TextMate" folklore is exactly this error β€” the multiline / nested-generic cases turn out TM-expressible once a parser supplies the pattern, which is why the derived grammar gets them right. Where a construct provably exceeds the model, Monogram's tree-sitter target β€” a real parser over the whole tree β€” resolves it.

What you get

From one grammar definition (a small TypeScript combinator API), five outputs are fully functional:

  • A lexer β€” tokenizes source straight from the grammar's token definitions; usable on its own (createLexer(grammar).tokenize).
  • A CST parser β€” recursive descent + Pratt precedence on top of the lexer, producing a CST (concrete syntax tree): every token is a node, including punctuation and keywords β€” roughly 2Γ— an AST's nodes, by design, which is exactly what the highlighter and lossless source reconstruction need.
  • A TextMate grammar β€” a .tmLanguage.json for VS Code / Sublime syntax highlighting, derived from the same rules, including derived JSDoc-body and regex-internal sub-grammars. (TextMate scopes are the dot-separated labels β€” entity.name.function, keyword.control β€” that a theme maps to colors.)
  • A VS Code language configuration β€” language-configuration.json (comments, bracket pairs, auto-close/surround, folding) derived from the same tokens.
  • CST node types β€” a TypeScript discriminated union (keyed by rule) for typed tree consumers.

And β€” from the same grammar β€” generators for the rest of the ecosystem, at varying maturity:

  • tree-sitter β€” grammar.js + a structural queries/highlights.scm + an external scanner for context-sensitive lexing. tree-sitter's GLR absorbs the grammar and compiles to wasm; the derived query scores 95.9% token-family accuracy against a neutral tsc oracle β€” above the official tree-sitter's 92.7% β€” and is CI-gated by npm run gate:treesitter.
  • Monarch β€” a Monaco (web) tokenizer (functional, bounded by JS-regex limits).

The grammar is the source of truth

A grammar is a TypeScript module: tokens, operator precedence, and rules built from small combinators. A self-contained mini-example:

import {
  token, rule, defineGrammar, left, op, sep,
  seq, oneOf, range, plus, star, optPattern,
} from './src/api.ts';

const digit = range('0', '9');
const Ident = token(seq(
  oneOf(range('a', 'z'), range('A', 'Z'), '_', '$'),
  star(oneOf(range('a', 'z'), range('A', 'Z'), digit, '_', '$')),
), { identifier: true });
const Number = token(seq(plus(digit), optPattern(seq('.', plus(digit)))));

const Expr = rule($ => [
  Ident,
  Number,
  [$, op, $],                    // binary operators (precedence declared below)
  [$, '(', sep(Expr, ','), ')'], // call:    foo(a, b)
  [$, '.', Ident],               // member:  obj.name
]);

export default defineGrammar({
  name: 'mini',
  tokens: { Ident, Number },
  prec: [ left('+', '-'), left('*', '/') ],
  rules: { Expr },
  entry: Expr,
});

Token patterns are combinators, not regular expressions β€” seq / oneOf / range / noneOf / plus / star / altPattern / optPattern / … assemble a structured pattern IR (regex is a derived backend, not the source of truth). A bare RegExp is not a valid token pattern: token(/…/) is a TS2345 type error. Coming from regex:

RegExp Combinator
/[ \t]+/ plus(oneOf(' ', '\t'))
/[A-Z_][A-Z0-9_]*/ seq(oneOf(range('A', 'Z'), '_'), star(oneOf(range('A', 'Z'), range('0', '9'), '_')))
/"[^"]*"/ seq('"', star(noneOf('"')), '"')
/\d+(\.\d+)?/ seq(plus(digit), optPattern(seq('.', plus(digit))))

Note digit above is just range('0', '9') β€” patterns are plain values you name and reuse, not magic strings.

The parser uses these rules to build a CST. The highlighter reads the same rule shapes and infers most scopes structurally β€” with no per-rule annotation:

  • foo(x) β†’ foo is entity.name.function (from the $ '(' … call form)
  • obj.name β†’ name is entity.other.property (from the $ '.' Ident form)
  • 'class' Ident β†’ Ident is entity.name.type (from declaration structure)
  • Expr '<' Type '>' '(' β†’ a generic call, not a comparison (from rule structure)

Flat, irreducible facts β€” which keywords are control flow, which punctuation is an operator β€” are declared once in a small scopes map (β‰ˆ50 lines for TypeScript) rather than inferred. Structure is derived; vocabulary is declared.

A language-agnostic engine

Nothing in the engine knows about TypeScript. Everything language-specific lives in the grammar β€” keywords, which token is the identifier, template-literal delimiters, the regex-vs-division lexer ambiguity β€” all declared per token:

import { token, seq, altPattern, noneOf, anyChar, oneOf, plus, star, notFollowedBy } from './src/api.ts';

const escaped = seq('\\', anyChar());

const Template = token(seq(
  '`',
  star(altPattern(noneOf('`', '\\', '$'), escaped, seq('$', notFollowedBy('{')))),
  '`',
), {
  template: { open: '`', interpOpen: '${', interpClose: '}' },
});
const Regex = token(seq(
  '/',
  plus(altPattern(
    noneOf('/', '\\', '[', '\n'),
    escaped,
    seq('[', star(altPattern(noneOf(']', '\\', '\n'), escaped)), ']'),
  )),
  '/',
  star(oneOf('g', 'i', 'm', 's', 'u', 'y', 'd', 'v')),
), {
  regex: true,
  regexContext: {
    divisionAfterTypes: ['Ident', 'Number', 'String', 'Template'],
    divisionAfterTexts: [')', ']', 'this', 'true', /* … */],
    regexAfterTexts:    ['return', 'typeof', 'instanceof', /* … */],
  },
});

test/agnostic.ts proves it directly β€” the same engine parses a toy grammar whose identifier token is Word, with no templates or regex. The deeper proof is html.ts: markup shares nothing with TypeScript's token stream, yet the same engine handles it (and Vue layers SFC blocks + {{ }} interpolation on top).

Adding a language

A new language is one grammar file on the unchanged engine:

  1. Write the grammar with the combinator API (src/api.ts) β€” tokens, operator precedence, rules. Everything language-specific lives here.
  2. Prove it as a parser against the language's own official test suite, measured bidirectionally (accept what the reference accepts, reject what it rejects).
  3. Drop in the official TextMate grammar as the baseline, so highlighter coverage is measured against what you're replacing, not asserted.

The lexer, CST types, and all three highlighters fall out of step 1; a dialect (.tsx/.jsx via jsx.ts, or Vue on html.ts) reuses a base grammar's rules by name in a few lines. The conformance/highlighter harnesses are currently TypeScript-specific (they call tsc and read VS Code's grammar) β€” point them at your own reference compiler.

Known differences from the official highlighter

A handful of token patterns are scoped differently from VS Code's official TypeScript grammar β€” all intentional, and in some Monogram is arguably more correct (these are deliberate divergences, distinct from the bug-class fixes the ledger measures):

Token Monogram Official Why we keep ours
console in console.log support.variable variable.other.object We highlight built-in globals (console, window, …) distinctly β€” a deliberate, common choice.
transform (a function parameter) variable.parameter entity.name.function It is a parameter. Official's heuristic mis-reads name: (…) => T as a function definition; we're more correct.
error (the method in console.error(…)) entity.name.function variable.other.readwrite We scope a called method as a function name β€” arguably more informative.

Built-in class names in type position (e.g. Error in extends Error) correctly emit entity.name.type, matching official; in value position (new Error()) they remain support.class, also matching official.

Matching the official grammar exactly would, in cases like transform, make the output worse. The metric counts these as differences, not defects.

Architecture

typescript.ts                one grammar (TypeScript combinator API)
        β”‚
        β”œβ”€ src/gen-lexer.ts  ───────▢ lexer β†’ tokens        (standalone: createLexer)
        β”‚        β–² composed by
        β”œβ”€ src/gen-parser.ts ───────▢ CST parser   (recursive descent + Pratt + packrat memoization;
        β”‚                             run against the conformance suite = the grammar's proof)
        β”‚
        β”œβ”€ src/gen-tm.ts ───────────▢ typescript.tmLanguage.json            (TextMate highlighter)
        β”œβ”€ src/gen-vscode-config.ts β–Ά typescript.language-configuration.json (editor behavior)
        β”œβ”€ src/gen-treesitter.ts ───▢ tree-sitter/  (grammar.js + highlights.scm + scanner.c)
        β”œβ”€ src/gen-monarch.ts ──────▢ typescript.monarch.json
        └─ src/gen-ast-types.ts ────▢ typescript.cst-types.ts

shared  src/grammar-utils.ts          structural helpers used across stages
        src/api.ts, types.ts          the grammar's combinator + type surface

Every target is produced by the same structural scope-inference, retargeted per format β€” lexer, parser, and generators are generic runtimes; all language specifics live in the grammar.

Prior art

Tool Parser Highlighting Single source
TextMate grammars β€” manual regex β€”
tree-sitter yes queries (written separately) β€”
ANTLR yes β€” β€”
Langium yes Monarch (separate config) β€”
ungrammar AST types β€” β€”
Monogram CST, conformance-proven derived from the parser grammar yes

Every tool here has a real parser; none derives the highlighter from the parser's own grammar as a single source β€” the one thing Monogram is for.

About

Define syntax once, generate lexer, parser, TextMate, tree-sitter

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors