Replace Flex/Bison with the Lime parser generator. by gburd · Pull Request #25 · gburd/postgres

gburd · 2026-06-05T20:34:27Z

While Flex/Bison have served us well, Lime (an evolution of SQLite's lemon parser generator) is faster than Flex/Bison and maintained and can enable runtime loading of additional grammars.

github-actions · 2026-06-05T21:44:12Z

📜 Change history & discussion (Agora / pg.ddx.io)

pg_plan_advice is a real upstream contrib module (Robert Haas thread). The PR's pg_plan_advice and cube files are real files; the PR converts their .y/.l grammars to .lime. But there is no mailing-list discussion of "Lime," QUEL revival, or replacing flex/bison anywhere. I have enough to report. No need for further searches.

🧵 Related discussion

No pgsql-hackers thread proposes, discusses, or even mentions a "Lime parser generator" or replacing Flex/Bison with it. Targeted keyword, hybrid, and semantic searches all returned unrelated hits (the only flex/bison results are Andres Freund's CI patches "ci: windows: Install bison flex via msys", which keep flex/bison).
No thread discusses reviving QUEL/Postquel as a contrib parser extension. The QUEL token matches were false positives (e.g. a psql "/**** QUERY ****/" thread). Confidence: high that no community discussion exists for the core premise of this PR.
The nearest real adjacent topic is Re: duckdb has extensible parser (Greg Burd, 2026-05-19), which is about extensible parsing generally — not about this work, not cited by it. Confidence: low as any kind of linkage.

🔗 Related commits / prior art

The PR touches two genuinely upstream contrib grammars: contrib/cube/cubeparse.y/cubescan.l and contrib/pg_plan_advice/pgpa_parser.y/pgpa_scanner.l. pg_plan_advice is real upstream work (Robert Haas; active thread "pg_plan_advice", e.g. Re: pg_plan_advice). The PR rewrites these into .lime form — there is no upstream commit or proposal doing so.
No commitfest entry, superseded commit, or rejected-approach thread was found for a Lime migration or QUEL contrib module.

🧭 Context for reviewers

The PR's framing ("Replace Flex/Bison with the Lime parser generator") has no basis in any pgsql-hackers discussion, commitfest entry, or upstream commit that the index can find. There is no community RFC, no buy-in, no design thread. A core-toolchain swap of this magnitude would require extensive list discussion; none exists.
"Lime" appears to be an external/private tool (the commit log references private "Letter NN" correspondence with an "upstream maintainer" and version bumps v0.2.x→v0.10.0). This is not a known PostgreSQL build dependency. Replacing Flex/Bison repo-wide is a non-starter without that discussion.
The diff is dominated by out-of-tree scaffolding that does not belong in a PostgreSQL patch: .github/ AI-review tooling, .idea/, .vscode/, .clangd, .envrc, AGENTS.md, OCR/sync workflows. This signals a fork/agent-driven repo, not an upstream-targeted submission.
The "QUEL" commits (revive Berkeley QUEL as a contrib parser-extension demo) are likewise unrelated to anything the community has requested or discussed.
Bottom line: treat this as an unsolicited fork experiment. There is no upstream thread to review it against, and the premise (drop Flex/Bison) contradicts current upstream direction, where flex/bison remain required build tools (per Andres Freund's CI work still installing them).

_{Generated by pg-history via the Agora MCP server (pg.ddx.io).}

github-actions

🔍 OCR found 82 issue(s).

80 inline, 2 in summary

📄 `src/backend/parser/scan.c`

Correctness/security regression in $N parameter parsing. The param pattern is \${decdigit}+ (unbounded digits), but this copies only the first 31 bytes into buf[32]. For a token with >=32 digits, the trailing digits are silently dropped before pg_strtoint32_safe, so e.g. $000...0001 (with enough leading zeros) parses to a WRONG, smaller value instead of either the correct number or a parameter number too large error. The retired flex scanner ran pg_strtoint32_safe over the full yytext+1, which correctly rejected over-long inputs. The very next case (SCAN_TOK_ICONST_*) already does the right thing with palloc(len + 1). Replace the fixed buf[32] with a len-sized palloc'd copy so the whole digit run is parsed.

📄 `src/fe_utils/Makefile`

These compatibility-shim macros are unused and dangerous. Verified that all three ported scanners — psqlscan.c, psqlscanslash.c, and pgbench/exprscan.c — reference only the ST_-prefixed enum values (ST_INITIAL, ST_XB, ST_XQS, …); none use the bare identifiers. The comment's justification is false: exprscan.c resets via state->start_state = ST_INITIAL (lines 643/722/759), not bare INITIAL. The only occurrence of start_state = INITIAL in the whole frontend tree is inside this very comment.

Defining unscoped single/double-letter macros like xb, xc, xd, xe, xh, xq — and especially INITIAL — in a header transitively included (via psqlscan_emit.h) by psql, pgbench, and fe_utils translation units is namespace pollution that can silently rewrite unrelated local variables, struct members, or parameters, causing hard-to-diagnose miscompilation or build breakage. Since nothing references these names, delete the entire shim block (and the misleading comment above it) rather than keeping a speculative compatibility layer. (confidence: high)

github-actions · 2026-06-05T22:34:34Z

+  - -g
+  - -std=c11
+  - -I.
+  - -I../../../../src/include


This relative include path appears incorrect. The .clangd file is at the repository root, and clangd resolves relative paths in CompileFlags.Add relative to the .clangd file's directory. Since src/include lives directly under the repo root (src/include/postgres.h), ../../../../src/include resolves to four directories above the repo root and won't be found. It should likely be -I./src/include (or -Isrc/include).

Suggested change

- -I../../../../src/include

+ - -I./src/include

github-actions · 2026-06-05T22:34:35Z

+ * raised). */
+extern int scan_lex_handle_unicode(void *user, int pos, char32_t c);
+extern void scan_lex_handle_xeu_second(void *user, int pos, char32_t c);
+extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);


Prototype mismatch (high confidence). This declaration is missing the int pos parameter. The definition in scan.c is void scan_lex_handle_xeescape(void *user, int pos, unsigned char c) (scan.c:379) and the call site in scan.lex passes three arguments: scan_lex_handle_xeescape(user, SCAN_LEX_OFFSET(matched), (unsigned char) matched[1]) (scan.lex:520). Since scan.c includes this header, the two-argument prototype conflicts with the three-argument definition and will fail to compile ("conflicting types"). Fix the prototype to match.

Suggested change

extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);

extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);

github-actions · 2026-06-05T22:34:35Z

+	if (stat(so_path, &st) == 0 && S_ISREG(st.st_mode))
+	{
+		ereport(LOG,
+				(errmsg("grammar extension cache hit: %s", so_path)));
+		goto dlopen_step;
+	}


Cache-hit path is broken when only the .so survives. On a cache hit you only stat <hex>.so, then goto dlopen_step, which calls build_extension_keyword_map() -> AllocateFile(".h"). The persistent cache artifact is the .so; the .h/.c are byproducts that may have been cleaned up (or never present on a machine that copied only the .so). When <hex>.h is missing, build_extension_keyword_map returns false and the whole pipeline ereport(ERROR)s, defeating the cache entirely and making extensions fail whenever the header is absent. Either stat the .h alongside the .so before treating it as a hit, or persist the resolved (lexeme -> token_code) map rather than re-parsing the header on every load. (confidence: high)

github-actions · 2026-06-05T22:34:35Z

+	while (waitpid(pid, &status, 0) < 0)
+	{
+		if (errno != EINTR)
+		{
+			pfree(errbuf.data);
+			*errmsg_out = psprintf("waitpid() for %s failed: %m", progname);
+			return false;
+		}
+	}


These blocking syscalls run in a backend with no interrupt handling. The read() loop only special-cases EINTR (continue), and waitpid() only retries on EINTR; neither calls CHECK_FOR_INTERRUPTS(). A wedged or slow lime/cc makes the backend hang uninterruptibly — a query cancel (SIGINT) or SIGTERM cannot break it because the loop just resumes the syscall. Add CHECK_FOR_INTERRUPTS() in the read loop and around the waitpid retry, and consider a timeout so a stuck subprocess does not pin a connection indefinitely. (confidence: high)

github-actions · 2026-06-05T22:34:35Z

+	 * read at first-use; OpenPipeStream isn't suitable because
+	 * we want a tight read with a fixed buffer.
+	 */
+	pipe = popen("lime -v 2>/dev/null", "r");


lime is launched via execvp(), i.e. a PATH lookup with no absolute path, and resolve_cc() honors $CC, and resolve_lime_version() runs popen("lime -v ...") through /bin/sh. All three execute under the postmaster's privileges at parse time. A poisoned PATH (or attacker-controlled $CC) lets a local user substitute a malicious lime/cc that the server then runs and dlopens. Pin these to absolute, install-time-known paths (or validate them) instead of relying on PATH/$CC, and prefer the project's run_program/OpenPipeStream over shell-based popen. (confidence: moderate)

github-actions · 2026-06-05T22:34:38Z

+if not srcdir.is_dir():
+    sys.exit(f'lime_format_check: srcdir not found: {srcdir}')
+
+SKIP_PATTERNS = ('build', 'install', '.git', 'tmp_install')


The quote-aware scanner treats " and ' symmetrically as string delimiters, but in C these have different semantics: ' introduces a single character (char) literal, not an arbitrary-length string. This works for well-formed bodies, but the else-branch scan (lines below) stops at any ' or ", so a stray/unbalanced quote in an action — e.g. an apostrophe inside a comment, or a char literal containing a quote like '\'' whose escaped inner quote is mis-counted — can desynchronize the quote state. Once desynchronized, all subsequent $N/@N references are either silently skipped or rewritten inside what is actually code, corrupting the generated action with no diagnostic. Since this drives every generated parser action, consider hardening the scanner (track char-literals separately with proper escape rules, and assert balanced quoting) so any malformed input fails loudly rather than producing a subtly wrong .lime.

github-actions · 2026-06-05T22:34:38Z

+    error_count_zero = ('0 error(s)' in out
+                        or 'OK: no diagnostics' in out
+                        or '✓ No errors or warnings' in out)


Fragile substring match causes a false negative: '0 error(s)' in out matches any error count ending in 0 (e.g. 10 error(s), 20 error(s), 100 error(s)), so a grammar with 10/20/... errors would be treated as clean and pass linting — masking real failures. Match the count anchored to the start of the number instead, e.g. parse with a regex like re.search(r'\b([0-9]+) error$s$', out) and compare the captured integer to 0.

github-actions · 2026-06-05T22:34:38Z

+    if has_failure:
+        failures += 1
+        print(f'FAIL {rel}', file=sys.stderr)


Behavioral discrepancy with the header comment, which states "non-zero on the first lint failure." This loop continues through all files and exits non-zero only at the end (aggregate). Continuing is arguably the more useful behavior, but the documented contract should be updated to match (e.g. "reports all failures and exits non-zero if any file fails") to avoid misleading maintainers/tooling.

github-actions · 2026-06-05T22:34:38Z

+if args.aot:
+    if not args.output_aot:
+        sys.exit('--aot requires --aot-output')


The --aot-output validation is placed after Lime has already run with -j and after the .c/.h files have been moved. While meson always passes --aot and --aot-output together (so this won't trigger in practice), validating this required-combination right after parse_args() would fail fast and avoid leaving partial outputs. Consider moving the check before building/running the command.

github-actions · 2026-06-05T22:34:38Z

+# - treats Lime's `.out` report as a build artefact worth keeping
+#   (mirrored from <outdir>/<basename>.out into <privatedir>)


This comment claims the wrapper mirrors Lime's .out report into privatedir, but the code never handles the .out file at all. Additionally, since --privatedir is already Lime's -d output dir, the report is written directly there — making the "mirrored ... into " wording self-contradictory. Please align the comment with the actual implementation (or implement the described mirroring) to avoid misleading future maintainers.

Keep master a pristine mirror of upstream plus our .github/ CI. These workflows rebase the .github-only commits onto postgres/postgres and push via SYNC_PAT (a PAT carrying the 'workflow' scope — required because the default GITHUB_TOKEN cannot update files under .github/workflows/): - sync-upstream.yml (hourly schedule + manual dispatch) - sync-upstream-manual.yml (on-demand, with a force-push toggle)

Review every PR (including drafts) with two jobs that authenticate to AWS Bedrock (Claude Opus 4.8) via GitHub OIDC (vars.AWS_ROLE_ARN); no static AWS credentials are stored in the repo. - ocr-review: runs Alibaba Open Code Review through an ephemeral LiteLLM proxy bridging OCR's OpenAI protocol to Bedrock, and posts inline review comments. Uses output_config.effort=xhigh (Opus 4.8 adaptive thinking). Path-scoped rules (.github/ocr/rule.json) encode PostgreSQL community review standards plus reviewer discipline (verify against the diff, don't hallucinate, state confidence, be blunt, accuracy over approval). - pg-history: OCR cannot call MCP, so a separate Bedrock tool-use agent (.github/ocr/pg-history.py) queries the Agora MCP server (pg.ddx.io) to tie the change to git + pgsql-hackers history, and upserts a comment linking threads as https://pg.ddx.io/m/pgsql-hackers/<message-id>.

Pulls in the Nix flake (flake.nix, flake.lock, shell.nix) and a small pg-aliases.sh helper that the contributing maintainer uses locally to drive the build with a Lime version pinned to a specific commit. Not for upstream submission -- developers who do not use Nix can ignore these files. Pinning Lime upstream by commit hash here gives a reproducible build environment without forcing other developers to install or upgrade Lime out-of-band. When upstream Lime cuts a new release the maintainer bumps flake.lock alongside the meson.build version floor. The full bison-to-Lime migration series is independent of these files; non-Nix contributors who install Lime via their package manager (or build it from source per installation.sgml) get the same end result.

This commit lands the build-system support for the Lime LALR(1) parser generator without changing any in-tree grammar. Subsequent commits port flex+bison grammars to Lime one component at a time; each is independently bisectable, so this commit on its own is a no-op for the runtime. What's added: * src/tools/pglime -- Python wrapper that invokes `lime` for custom_target() rules in component meson.build files. Handles --output-dir, --aot, --jit-compatible flags. * src/tools/lime_lint, lime_format, lime_format_check -- user-facing helper scripts driving `lime -L` (lint) and `lime -F` (format) over .lime files. * meson.build: - dependency('lime', required: false, version: '>=1.3.1') plus dependency('lime-compiler') for the in-process snapshot-build library shipped from v0.9.4 onwards (LTS line through v1.3.1). - find_program('lime', native: true) for the code-generation path. - lime_cmd kwargs dict consumed by per-component custom_target rules in subsequent commits. * meson_options.txt: - LIME option for explicit lime-binary path. - lime_aot feature (default 'auto') gating Lime's AOT-compiled action-table code path. * src/makefiles/meson.build -- pgxs export so out-of-tree contrib modules can find the Lime tooling. * src/tools/pgindent/pgindent -- recognise Lime's static_library helpers as proper C functions (cosmetic). References Lime upstream: https://codeberg.org/gregburd/lime Lime version pinned at v1.3.1 (commit f1970c4).

This commit ports two small grammars in src/backend/replication from flex+bison to Lime: Phase 2a -- syncrep_gram.y -> syncrep_gram.lime (synchronous_standby_names parser) Phase 2b -- repl_gram.y -> repl_gram.lime (walsender command parser) and the matching scanners from .l to Lime .lex format (Phase 5 of the migration). Each component gets its own per-grammar yytype.h header (repl_gram_yytype.h, syncrep_parse.h) declaring the YYSTYPE union shared between the parser action bodies and the scanner. The scanner driver code (syncrep_scanner.c, repl_scanner.c) is hand-rolled C; it sits between the Lime-emitted lex DFA and the Lime-emitted parser table-driven dispatch, translating internal sentinel codes into parser tokens with appropriate yylval shaping. Both grammars are small (syncrep: ~120 lines; repl: ~450 lines) and the SQL surface they parse is unchanged. All recovery and replication-protocol regression tests pass byte-identically with the new Lime backends. Per-component meson.build wraps the Lime-generated .c files in a static_library with c_args ['-Wno-missing-prototypes', '-Wno-unused-variable'] -- the Lime template emits both classes of warning by design. The relax stays local to the generated .c; the hand-rolled drivers compile under PG's standard -Wall flags.

Port src/backend/bootstrap from flex+bison to Lime: bootparse.y -> bootparse.lime (BKI parser) bootscanner.l -> bootscanner.lex (BKI scanner) The BKI (Backend Interpreter) language is the bootstrap mini-language used by initdb to populate the system catalogs from postgres.bki. It's a small grammar (~200 productions) and parses a fixed file generated at build time, so the test surface is initdb itself. The hand-rolled bootscanner.c is a thin driver shim around the Lime-emitted lex DFA. Keyword recognition still uses ScanKeywordLookup against bootkw.h (unchanged). initdb runs end-to-end against the new parser/scanner pair with byte-identical catalog output.

Port src/test/isolation from flex+bison to Lime: specparse.y -> specparse.lime (isolation spec parser) specscanner.l -> specscanner.lex (isolation spec scanner) The isolation tester reads its own DSL from src/test/isolation/specs/*.spec describing concurrent transactions to schedule. This is a small, single-purpose parser with no extension-API exposure. The driver (specscanner.c) is hand-rolled. Two %literal_buffer- using patterns: QIDENT ("..." with "" -> ") and SQLBLK ({ ... } with leading/trailing whitespace stripped). Single accumulator scanstr reused across the two states. Isolation tests run unchanged after the port.

Port jsonpath's parser and scanner from flex+bison to Lime: jsonpath_gram.y -> jsonpath_gram.lime jsonpath_scan.l -> jsonpath_scan.lex The jsonpath grammar implements the SQL/JSON path-expression language (jsonb_path_query and friends). This is one of the larger ports in the migration: jsonpath_scan.lex carries four exclusive states (XQ for double-quoted strings, XNQ for unquoted identifiers, XVQ for $variable references, XC for comments) and combined <XNQ, XQ, XVQ> rules sharing escape- sequence handling. The driver (jsonpath_scan.c) post-processes the lexer's emitted tokens via a small set of internal sentinel codes (JP_TOK_STRING_TAKE / VARIABLE_TAKE / NUMERIC_TEXT / INT_TEXT / RAW_CHAR / VARIABLE_BARE) before handing them to the Lime parser. A small expected-output update lands for jsonpath.out and sqljson_queryfuncs.out: Lime's syntax-error messages on a few already-failing inputs differ from Bison's at the exact column, which the test cases assert. The changes are cosmetic; the actual error condition (and SQL-visible behaviour) is byte- identical to the bison parser.

The frontend SQL+slash-command lexer (psqlscan / psqlscanslash) and the pgbench expression evaluator share a common PsqlScanStateData and must be ported together. This commit ports three coupled scanners and one parser: src/fe_utils/psqlscan.l -> psqlscan.lex (SQL lexer core, 11 states) src/bin/psql/psqlscanslash.l -> psqlscanslash.lex (slash-cmd parser, 8 states) src/bin/pgbench/exprparse.y -> exprparse.lime (pgbench expression parser) src/bin/pgbench/exprscan.l -> exprscan.lex (pgbench expression lexer) Strategy D (per-call lex): each psql_scan() call allocates a fresh Foo_Lexer over the remaining buffer slice, runs LexFeedBytes until the first stop point, frees. Cursor advance is tracked by the driver (StackElem.pos); variable-substitution ":varname" expansion uses the existing buffer_stack with recursive psql_scan into pushed StackElem. The pgbench expression lexer's INITIAL state (one-word-at-a-time expr_lex_one_word) stays hand-rolled -- it's too small to pay for porting and doesn't fit Lime's pre-scan-emit-callback shape. The EXPR state (the bulk of pgbench's expression syntax) is Lime-driven. A new psqlscan_emit.h shared header carries the PsqlEmitCtx and the variable-substitution helper prototypes shared by the three drivers. psql and pgbench tests pass unchanged (681/681 pgbench tests).

Port the GUC configuration-file scanner from flex to Lime: src/backend/utils/misc/guc-file.l -> guc_file.lex postgresql.conf is read by guc-file.c on every postmaster startup and SIGHUP. The original parser was scanner-only; no bison grammar. This commit ports the scanner to Lime's .lex format. The driver (guc-file.c) consumes a pre-scanned token FIFO; ConfigFileLineno is bumped on EOL token pop. Quoted-string values match a regex span and are post-processed via the existing DeescapeQuotedString -- the scanner doesn't need %literal_buffer here. Existing GUC config-file regression tests and SIGHUP-reload tests run unchanged after the port.

@n

The centerpiece of the migration: replace gram.y + scan.l with gram.lime + scan.lex. Also retire ecpg's bison input (preproc.y) by feeding gram.lime through a reverse converter back to bison-shaped grammar text that ecpg's parse.pl reads. Backend SQL parser: src/backend/parser/gram.y -> gram.lime (~21k lines, mechanically translated) src/backend/parser/scan.l -> scan.lex (745 lines, 11 exclusive states) src/backend/parser/scan.c hand-rolled driver wrapping the Lime-emitted lex DFA in the public scanner.h API (core_yylex, scanner_init/finish, base_yylex's 2-token lookahead) The gram.lime file is mechanically derived from the previous gram.y by src/tools/lime_convert_gram.py: the converter rewrites %type / %union / mid-rule actions / @n location refs / inline char-literal terminals / %name-prefix etc. into Lime's idiom. Action bodies port verbatim modulo $$/$N rewriting. The converter is preserved in-tree so future gram.y-style edits can still be made (edit gram.y, run the converter, regenerate gram.lime), though for this commit gram.y itself is dropped. ecpg/preproc: src/interfaces/ecpg/preproc/pgc.l -> pgc.c (hand-rolled C) + pgc.lex (Lime .lex DFA; 21 exclusive states, ECPG-specific includes) src/interfaces/ecpg/preproc/parser.c bridge code ecpg historically generates its own preproc.y by running parse.pl over backend gram.y. parse.pl now reads gram.lime via src/tools/lime_to_bison_gram.py, a reverse converter that emits a bison-syntax skeleton. This keeps ecpg buildable without retargeting it to Lime in this commit (a future commit can do that independently). Public ABI of scanner.h preserved exactly: - core_yylex returns the same token codes as before - base_yylex's 2-token lookahead (NOT BETWEEN -> NOT_LA, etc.) runs unchanged - YYLLOC location offsets match bison's The conversion is byte-identical at the parse-tree level. Standard regress + ecpg suites pass. Three small expected- output adjustments land for graph_table.out, jsonpath.out, and sqljson_queryfuncs.out where Lime's syntax-error messages report the offending lookahead at a slightly different column than Bison's. These are cosmetic in user-visible behaviour; all SQL semantics unchanged.

Port plpgsql's grammar from bison to Lime: src/pl/plpgsql/src/pl_gram.y -> pl_gram.lime (~4250 lines) plpgsql is the largest single grammar after the backend SQL grammar. Its push-driven parser interacts with the surrounding plpgsql_yylex routine (still hand-rolled C in pl_scanner.c). Two non-trivial bridges between bison-pull and Lime-push semantics: 1. The bison parser has empty-rule lookahead via Parse_get_lookahead. Lime exposes the same via parse_token_offset; pl_scanner.c consults it where needed. 2. Helper-function lex (peek-ahead in plpgsql_yy_drain_lookahead) and scanner-state mutation via driver-level K_DECLARE/K_BEGIN mirroring keep the existing pl_scanner.c surface intact. Lime's per-rule reduce-callback signature replaces bison's yylval-via-global pattern, but pl_gram_types.h declares the shared YYSTYPE union exactly as before, so action bodies port verbatim. plpgsql regression suite passes byte-identically; the plpgsql test module (PG's largest regression group after the standard regress) shows no diffs.

Three contrib modules ship their own bison+flex grammars. This commit retires them in favor of Lime parser + scanner pairs sharing the same migration pattern as the in-tree grammars: contrib/cube: cubeparse.y -> cubeparse.lime cubescan.l -> cubescan.lex contrib/seg: segparse.y -> segparse.lime segscan.l -> segscan.lex contrib/pg_plan_advice: pgpa_parser.y -> pgpa_parser.lime pgpa_scanner.l -> pgpa_scanner.lex Each module gets a hand-rolled driver C file translating Lime's emit-callback sentinel codes into the existing token vocabulary that the parser action bodies expect. Three small expected-output adjustments land: - contrib/cube/expected/cube.out: the (A) lhs label was cosmetically dropped from box's four alternatives and the leading bare 'A' removed from one error message DETAIL line (Lime's emitter substitutes letter labels inside string literals; we worked around that by removing the unused label). - contrib/seg/expected/seg.out: similar cosmetic error-message DETAIL deltas. - contrib/pg_plan_advice/expected/syntax.out: error-position deltas where Lime reports 'at or near "("' or '")"' at a slightly different column than Bison. All three modules' regression tests pass with these byte-level adjustments; SQL semantics are unchanged.

The bison-to-Lime port replaces flex+bison with the Lime LALR(1) parser generator from https://codeberg.org/gregburd/lime. This commit updates the installation chapter to reflect: * Lime >= 0.12.0 is required for builds from the git repo. Source tarballs continue to ship pre-generated parser/scanner .c/.h files (the same discipline PG already uses for bison/flex output), so end-user tarball builds do not need Lime. * bison and flex are still listed -- contrib modules and out-of- tree extensions can keep using .y/.l grammars via the existing pgxs interface. After the migration only the in-tree grammars are converted; flex+bison remain optional dependencies for the wider ecosystem. * Suggested install paths (distro packages where available; source build from codeberg.org/gregburd/lime otherwise). Also drops src/backend/utils/misc/.gitignore -- the file's only purpose was to ignore the bison output from the (no-longer-bison) guc-file.l, and that scanner is now Lime-driven via guc_file.lex.

Adds a runtime grammar-extension API enabling extensions to register new tokens, productions, and reduce callbacks before the first parse, then rebuilds the SQL parser to incorporate them. This is a foundation for runtime-extensible SQL dialects (QUEL revival in contrib/quel as a demonstration; out-of-tree DSLs like a DuckDB-compat or MongoDB-JSONB syntax via the same API). Public API (include/parser/parser_extension.h): PgGrammarExtension *pg_grammar_ext_create(name, version); void pg_grammar_ext_add_token(...); void pg_grammar_ext_add_rule(...); void pg_grammar_ext_set_precedence(...); bool pg_grammar_ext_register(ext, &err); Calls are valid only from _PG_init() of a shared_preload_libraries- loaded module, before raw_parser() runs for the first time. Implementation (parser_extension.c): Track A subprocess pipeline: at first parse, walk the registered extensions, serialize them into a .lime fragment text alongside the base gram.lime, fork+exec lime + cc to produce a rebuilt parser .so, dlopen it, and dispatch base_yyparse through a function pointer (base_yyparse_fn) that points at the rebuilt symbol. Cache the .so under $PGDATA/pg_parser_cache/<sha256>.so. Phase 1 scanner hook in scan.c: extension-registered keywords that don't appear in the compile-time ScanKeywords table are caught by pg_grammar_ext_keyword_hook after the base lookup misses. Returns the rebuilt parser's token code; the rebuild step ensures the parser tables know about it. Why [DO NOT MERGE]: * The API surface is intentionally small but the runtime re-build (fork + lime + cc + dlopen) is operationally heavy on cold cache: the first parse after postmaster start with extensions loaded takes ~9s. Warm cache is ~11ms. Production OLTP overhead with no parsing-bound workload is 0.5-2%; parser-bound benchmarks see 4-12%. * The keyword shadowing rules are non-obvious (extensions cannot override base SQL keywords; the hook fires only on base lookup miss). Documented in parser_extension.h, but this constraint surprises authors who expect MySQL-compat or DuckDB-compat dialects to override SHOW or ATTACH. * Track B (in-process snapshot patching, no subprocess) is designed but not implemented. Track A works in production today; Track B would cut the 9s cold cost to ~5ms but requires invasive parser.c surgery. * No -hackers consensus on whether runtime grammar extensions belong in core at all; this is RFC-quality work for review and discussion. Tests: see [DO NOT MERGE] commits below for grammar_ext_compose, grammar_ext_overlap, dummy_grammar_ext, lime_in_process_smoke, parser_microbench and contrib/quel that exercise this API.

Five test modules exercising the runtime grammar-extension API: * dummy_grammar_ext -- minimal smoke test: 1 token, 1 rule, 1 reduce callback. Verifies end-to-end registry -> rebuild -> dlopen -> parse pipeline. * grammar_ext_compose -- 6 small extensions composed in 8 different load-order permutations. 22 sub-tests covering token-name no-op vs collision, cross-extension references, precedence, cache-key determinism, and base-grammar invariance. * grammar_ext_overlap -- 5 simulator extensions (DuckDB-compat, MySQL-compat, MongoDB-JSONB, pg_infer, QUEL-lite) loaded simultaneously. 42 sub-tests covering one-rebuild-for-all, 13-keyword reachability, mixed SQL+extension-DSL sessions, order independence, subset-load fallthrough. * lime_in_process_smoke -- exercises the in-process lime_compile_grammar_in_process path (Track B Phase 2 Step 1). * parser_microbench -- direct raw_parser() timing benchmark: 1738 ns/parse for SELECT 1, 5207 ns for realistic OLTP, 5539 ns for DDL on a debug build. These modules together demonstrate the API works under realistic multi-extension composition. They are NOT for upstream merge: they belong in test/modules as research artifacts, not as part of the core test surface.

…nsion Demonstrates the runtime grammar-extension API by reviving the Berkeley QUEL query language from the original POSTGRES (1986) as a contrib module. All five Berkeley QUEL forms are supported via the Lime extension API: RANGE OF e IS emp -- tuple-variable binding RETRIEVE (e.name, e.salary) -- SELECT where e.dept = 'shoe' RETRIEVE (e.name) BY e.salary DESC -- SELECT ... ORDER BY DESC REPLACE emp (salary = 50000) -- UPDATE WHERE where dept='shoe' APPEND TO emp (name='alice', ...) -- INSERT DELETE emp where salary < 1000 -- DELETE WHERE Each form constructs a real PostgreSQL parse-tree node (SelectStmt / UpdateStmt / InsertStmt / DeleteStmt) at parse time, flowing through parse_analyze + planner + executor + EXPLAIN unchanged. 9 SQL/QUEL equivalence assertions in t/001_quel.pl prove the parser produces identical results to the equivalent SQL. Keyword shadowing constraint: extension keywords can't override base SQL keywords, so QUEL uses a q_-prefix for words that conflict (q_range, q_of, q_is, q_to, q_by, q_replace, q_delete, q_into). Documented in the SGML chapter (doc/src/sgml/quel.sgml) and in parser_extension.h's pg_grammar_ext_keyword_hook block. Why [DO NOT MERGE]: * QUEL itself has no production users. This is a demonstration of the runtime extension API at a non-trivial scale (30 rules, 8 token types, 10 keyword tokens), not a proposal to add Berkeley QUEL to PostgreSQL core. * The SGML chapter is informative but exceeds what most contrib modules ship; it includes historical context, a syntax reference, 6 worked examples, and a Limitations section. This commit ships QUEL as a research artifact alongside the runtime extension API. Anyone interested in writing a similar DSL extension can read contrib/quel as a worked-out example.

gburd changed the title ~~Lime~~ Replace Flex/Bison with the Lime parser generator. Jun 5, 2026

gburd force-pushed the master branch from b329d50 to d8c4a44 Compare June 5, 2026 20:36

gburd force-pushed the master branch from 4bbd074 to 269d8e7 Compare June 5, 2026 22:26

github-actions Bot reviewed Jun 5, 2026

View reviewed changes

gburd force-pushed the master branch 25 times, most recently from 813bde8 to ed90aaa Compare June 12, 2026 01:48

gburd force-pushed the master branch 2 times, most recently from 4e2b6f9 to 5fa39a2 Compare June 15, 2026 03:07

gburd added 2 commits June 15, 2026 10:06

gburd force-pushed the master branch 8 times, most recently from fbd035a to 54ae267 Compare June 16, 2026 13:41

gburd added 15 commits June 16, 2026 10:39

gburd force-pushed the lime branch from ac49919 to aeb3088 Compare June 16, 2026 14:57

gburd force-pushed the master branch 2 times, most recently from 4100e10 to 5a292fb Compare June 17, 2026 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Flex/Bison with the Lime parser generator.#25

Replace Flex/Bison with the Lime parser generator.#25
gburd wants to merge 17 commits into
masterfrom
lime

gburd commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

github-actions Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);
	extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);

		# - treats Lime's `.out` report as a build artefact worth keeping
		# (mirrored from <outdir>/<basename>.out into <privatedir>)

Conversation

gburd commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 5, 2026

📜 Change history & discussion (Agora / pg.ddx.io)

🧵 Related discussion

🔗 Related commits / prior art

🧭 Context for reviewers

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

📄 src/backend/parser/scan.c

📄 src/fe_utils/Makefile

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gburd commented Jun 5, 2026 •

edited

Loading

📄 `src/backend/parser/scan.c`

📄 `src/fe_utils/Makefile`