Skip to content

Replace Flex/Bison with the Lime parser generator.#25

Draft
gburd wants to merge 17 commits into
masterfrom
lime
Draft

Replace Flex/Bison with the Lime parser generator.#25
gburd wants to merge 17 commits into
masterfrom
lime

Conversation

@gburd

@gburd gburd commented Jun 5, 2026

Copy link
Copy Markdown
Owner

While Flex/Bison have served us well, Lime (an evolution of SQLite's lemon parser generator) is faster than Flex/Bison and maintained and can enable runtime loading of additional grammars.

@gburd gburd changed the title Lime Replace Flex/Bison with the Lime parser generator. Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

📜 Change history & discussion (Agora / pg.ddx.io)

pg_plan_advice is a real upstream contrib module (Robert Haas thread). The PR's pg_plan_advice and cube files are real files; the PR converts their .y/.l grammars to .lime. But there is no mailing-list discussion of "Lime," QUEL revival, or replacing flex/bison anywhere. I have enough to report. No need for further searches.

🧵 Related discussion

  • No pgsql-hackers thread proposes, discusses, or even mentions a "Lime parser generator" or replacing Flex/Bison with it. Targeted keyword, hybrid, and semantic searches all returned unrelated hits (the only flex/bison results are Andres Freund's CI patches "ci: windows: Install bison flex via msys", which keep flex/bison).
  • No thread discusses reviving QUEL/Postquel as a contrib parser extension. The QUEL token matches were false positives (e.g. a psql "/**** QUERY ****/" thread). Confidence: high that no community discussion exists for the core premise of this PR.
  • The nearest real adjacent topic is Re: duckdb has extensible parser (Greg Burd, 2026-05-19), which is about extensible parsing generally — not about this work, not cited by it. Confidence: low as any kind of linkage.

🔗 Related commits / prior art

  • The PR touches two genuinely upstream contrib grammars: contrib/cube/cubeparse.y/cubescan.l and contrib/pg_plan_advice/pgpa_parser.y/pgpa_scanner.l. pg_plan_advice is real upstream work (Robert Haas; active thread "pg_plan_advice", e.g. Re: pg_plan_advice). The PR rewrites these into .lime form — there is no upstream commit or proposal doing so.
  • No commitfest entry, superseded commit, or rejected-approach thread was found for a Lime migration or QUEL contrib module.

🧭 Context for reviewers

  • The PR's framing ("Replace Flex/Bison with the Lime parser generator") has no basis in any pgsql-hackers discussion, commitfest entry, or upstream commit that the index can find. There is no community RFC, no buy-in, no design thread. A core-toolchain swap of this magnitude would require extensive list discussion; none exists.
  • "Lime" appears to be an external/private tool (the commit log references private "Letter NN" correspondence with an "upstream maintainer" and version bumps v0.2.x→v0.10.0). This is not a known PostgreSQL build dependency. Replacing Flex/Bison repo-wide is a non-starter without that discussion.
  • The diff is dominated by out-of-tree scaffolding that does not belong in a PostgreSQL patch: .github/ AI-review tooling, .idea/, .vscode/, .clangd, .envrc, AGENTS.md, OCR/sync workflows. This signals a fork/agent-driven repo, not an upstream-targeted submission.
  • The "QUEL" commits (revive Berkeley QUEL as a contrib parser-extension demo) are likewise unrelated to anything the community has requested or discussed.
  • Bottom line: treat this as an unsolicited fork experiment. There is no upstream thread to review it against, and the premise (drop Flex/Bison) contradicts current upstream direction, where flex/bison remain required build tools (per Andres Freund's CI work still installing them).

Generated by pg-history via the Agora MCP server (pg.ddx.io).

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 OCR found 82 issue(s).

  • 80 inline, 2 in summary

📄 src/backend/parser/scan.c

Correctness/security regression in $N parameter parsing. The param pattern is \${decdigit}+ (unbounded digits), but this copies only the first 31 bytes into buf[32]. For a token with >=32 digits, the trailing digits are silently dropped before pg_strtoint32_safe, so e.g. $000...0001 (with enough leading zeros) parses to a WRONG, smaller value instead of either the correct number or a parameter number too large error. The retired flex scanner ran pg_strtoint32_safe over the full yytext+1, which correctly rejected over-long inputs. The very next case (SCAN_TOK_ICONST_*) already does the right thing with palloc(len + 1). Replace the fixed buf[32] with a len-sized palloc'd copy so the whole digit run is parsed.


📄 src/fe_utils/Makefile

These compatibility-shim macros are unused and dangerous. Verified that all three ported scanners — psqlscan.c, psqlscanslash.c, and pgbench/exprscan.c — reference only the ST_-prefixed enum values (ST_INITIAL, ST_XB, ST_XQS, …); none use the bare identifiers. The comment's justification is false: exprscan.c resets via state->start_state = ST_INITIAL (lines 643/722/759), not bare INITIAL. The only occurrence of start_state = INITIAL in the whole frontend tree is inside this very comment.

Defining unscoped single/double-letter macros like xb, xc, xd, xe, xh, xq — and especially INITIAL — in a header transitively included (via psqlscan_emit.h) by psql, pgbench, and fe_utils translation units is namespace pollution that can silently rewrite unrelated local variables, struct members, or parameters, causing hard-to-diagnose miscompilation or build breakage. Since nothing references these names, delete the entire shim block (and the misleading comment above it) rather than keeping a speculative compatibility layer. (confidence: high)

Comment thread .clangd Outdated
- -g
- -std=c11
- -I.
- -I../../../../src/include

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relative include path appears incorrect. The .clangd file is at the repository root, and clangd resolves relative paths in CompileFlags.Add relative to the .clangd file's directory. Since src/include lives directly under the repo root (src/include/postgres.h), ../../../../src/include resolves to four directories above the repo root and won't be found. It should likely be -I./src/include (or -Isrc/include).

Suggested change
- -I../../../../src/include
+ - -I./src/include

* raised). */
extern int scan_lex_handle_unicode(void *user, int pos, char32_t c);
extern void scan_lex_handle_xeu_second(void *user, int pos, char32_t c);
extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prototype mismatch (high confidence). This declaration is missing the int pos parameter. The definition in scan.c is void scan_lex_handle_xeescape(void *user, int pos, unsigned char c) (scan.c:379) and the call site in scan.lex passes three arguments: scan_lex_handle_xeescape(user, SCAN_LEX_OFFSET(matched), (unsigned char) matched[1]) (scan.lex:520). Since scan.c includes this header, the two-argument prototype conflicts with the three-argument definition and will fail to compile ("conflicting types"). Fix the prototype to match.

Suggested change
extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);
extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c);

Comment on lines +1251 to +1256
if (stat(so_path, &st) == 0 && S_ISREG(st.st_mode))
{
ereport(LOG,
(errmsg("grammar extension cache hit: %s", so_path)));
goto dlopen_step;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache-hit path is broken when only the .so survives. On a cache hit you only stat <hex>.so, then goto dlopen_step, which calls build_extension_keyword_map() -> AllocateFile(".h"). The persistent cache artifact is the .so; the .h/.c are byproducts that may have been cleaned up (or never present on a machine that copied only the .so). When <hex>.h is missing, build_extension_keyword_map returns false and the whole pipeline ereport(ERROR)s, defeating the cache entirely and making extensions fail whenever the header is absent. Either stat the .h alongside the .so before treating it as a hit, or persist the resolved (lexeme -> token_code) map rather than re-parsing the header on every load. (confidence: high)

Comment on lines +1088 to +1096
while (waitpid(pid, &status, 0) < 0)
{
if (errno != EINTR)
{
pfree(errbuf.data);
*errmsg_out = psprintf("waitpid() for %s failed: %m", progname);
return false;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These blocking syscalls run in a backend with no interrupt handling. The read() loop only special-cases EINTR (continue), and waitpid() only retries on EINTR; neither calls CHECK_FOR_INTERRUPTS(). A wedged or slow lime/cc makes the backend hang uninterruptibly — a query cancel (SIGINT) or SIGTERM cannot break it because the loop just resumes the syscall. Add CHECK_FOR_INTERRUPTS() in the read loop and around the waitpid retry, and consider a timeout so a stuck subprocess does not pin a connection indefinitely. (confidence: high)

* read at first-use; OpenPipeStream isn't suitable because
* we want a tight read with a fixed buffer.
*/
pipe = popen("lime -v 2>/dev/null", "r");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lime is launched via execvp(), i.e. a PATH lookup with no absolute path, and resolve_cc() honors $CC, and resolve_lime_version() runs popen("lime -v ...") through /bin/sh. All three execute under the postmaster's privileges at parse time. A poisoned PATH (or attacker-controlled $CC) lets a local user substitute a malicious lime/cc that the server then runs and dlopens. Pin these to absolute, install-time-known paths (or validate them) instead of relying on PATH/$CC, and prefer the project's run_program/OpenPipeStream over shell-based popen. (confidence: moderate)

if not srcdir.is_dir():
sys.exit(f'lime_format_check: srcdir not found: {srcdir}')

SKIP_PATTERNS = ('build', 'install', '.git', 'tmp_install')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quote-aware scanner treats " and ' symmetrically as string delimiters, but in C these have different semantics: ' introduces a single character (char) literal, not an arbitrary-length string. This works for well-formed bodies, but the else-branch scan (lines below) stops at any ' or ", so a stray/unbalanced quote in an action — e.g. an apostrophe inside a comment, or a char literal containing a quote like '\'' whose escaped inner quote is mis-counted — can desynchronize the quote state. Once desynchronized, all subsequent $N/@N references are either silently skipped or rewritten inside what is actually code, corrupting the generated action with no diagnostic. Since this drives every generated parser action, consider hardening the scanner (track char-literals separately with proper escape rules, and assert balanced quoting) so any malformed input fails loudly rather than producing a subtly wrong .lime.

Comment thread src/tools/lime_lint
Comment on lines +65 to +67
error_count_zero = ('0 error(s)' in out
or 'OK: no diagnostics' in out
or '✓ No errors or warnings' in out)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile substring match causes a false negative: '0 error(s)' in out matches any error count ending in 0 (e.g. 10 error(s), 20 error(s), 100 error(s)), so a grammar with 10/20/... errors would be treated as clean and pass linting — masking real failures. Match the count anchored to the start of the number instead, e.g. parse with a regex like re.search(r'\b([0-9]+) error\(s\)', out) and compare the captured integer to 0.

Comment thread src/tools/lime_lint
Comment on lines +69 to +71
if has_failure:
failures += 1
print(f'FAIL {rel}', file=sys.stderr)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral discrepancy with the header comment, which states "non-zero on the first lint failure." This loop continues through all files and exits non-zero only at the end (aggregate). Continuing is arguably the more useful behavior, but the documented contract should be updated to match (e.g. "reports all failures and exits non-zero if any file fails") to avoid misleading maintainers/tooling.

Comment thread src/tools/pglime
Comment on lines +92 to +94
if args.aot:
if not args.output_aot:
sys.exit('--aot requires --aot-output')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --aot-output validation is placed after Lime has already run with -j and after the .c/.h files have been moved. While meson always passes --aot and --aot-output together (so this won't trigger in practice), validating this required-combination right after parse_args() would fail fast and avoid leaving partial outputs. Consider moving the check before building/running the command.

Comment thread src/tools/pglime
Comment on lines +12 to +13
# - treats Lime's `.out` report as a build artefact worth keeping
# (mirrored from <outdir>/<basename>.out into <privatedir>)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment claims the wrapper mirrors Lime's .out report into privatedir, but the code never handles the .out file at all. Additionally, since --privatedir is already Lime's -d output dir, the report is written directly there — making the "mirrored ... into " wording self-contradictory. Please align the comment with the actual implementation (or implement the described mirroring) to avoid misleading future maintainers.

@gburd gburd force-pushed the master branch 25 times, most recently from 813bde8 to ed90aaa Compare June 12, 2026 01:48
@gburd gburd force-pushed the master branch 2 times, most recently from 4e2b6f9 to 5fa39a2 Compare June 15, 2026 03:07
gburd added 2 commits June 15, 2026 10:06
Keep master a pristine mirror of upstream plus our .github/ CI. These
workflows rebase the .github-only commits onto postgres/postgres and push
via SYNC_PAT (a PAT carrying the 'workflow' scope — required because the
default GITHUB_TOKEN cannot update files under .github/workflows/):
  - sync-upstream.yml         (hourly schedule + manual dispatch)
  - sync-upstream-manual.yml  (on-demand, with a force-push toggle)
Review every PR (including drafts) with two jobs that authenticate to AWS
Bedrock (Claude Opus 4.8) via GitHub OIDC (vars.AWS_ROLE_ARN); no static
AWS credentials are stored in the repo.

- ocr-review: runs Alibaba Open Code Review through an ephemeral LiteLLM
  proxy bridging OCR's OpenAI protocol to Bedrock, and posts inline review
  comments. Uses output_config.effort=xhigh (Opus 4.8 adaptive thinking).
  Path-scoped rules (.github/ocr/rule.json) encode PostgreSQL community
  review standards plus reviewer discipline (verify against the diff, don't
  hallucinate, state confidence, be blunt, accuracy over approval).
- pg-history: OCR cannot call MCP, so a separate Bedrock tool-use agent
  (.github/ocr/pg-history.py) queries the Agora MCP server (pg.ddx.io) to
  tie the change to git + pgsql-hackers history, and upserts a comment
  linking threads as https://pg.ddx.io/m/pgsql-hackers/<message-id>.
@gburd gburd force-pushed the master branch 8 times, most recently from fbd035a to 54ae267 Compare June 16, 2026 13:41
gburd added 15 commits June 16, 2026 10:39
Pulls in the Nix flake (flake.nix, flake.lock, shell.nix) and a
small pg-aliases.sh helper that the contributing maintainer uses
locally to drive the build with a Lime version pinned to a
specific commit.  Not for upstream submission -- developers who
do not use Nix can ignore these files.

Pinning Lime upstream by commit hash here gives a reproducible
build environment without forcing other developers to install or
upgrade Lime out-of-band.  When upstream Lime cuts a new
release the maintainer bumps flake.lock alongside the
meson.build version floor.

The full bison-to-Lime migration series is independent of these
files; non-Nix contributors who install Lime via their package
manager (or build it from source per installation.sgml) get the
same end result.
This commit lands the build-system support for the Lime LALR(1)
parser generator without changing any in-tree grammar.  Subsequent
commits port flex+bison grammars to Lime one component at a time;
each is independently bisectable, so this commit on its own is a
no-op for the runtime.

What's added:

  * src/tools/pglime   -- Python wrapper that invokes `lime` for
    custom_target() rules in component meson.build files.  Handles
    --output-dir, --aot, --jit-compatible flags.

  * src/tools/lime_lint, lime_format, lime_format_check  --
    user-facing helper scripts driving `lime -L` (lint) and
    `lime -F` (format) over .lime files.

  * meson.build:
      - dependency('lime', required: false, version: '>=1.3.1')
        plus dependency('lime-compiler') for the in-process
        snapshot-build library shipped from v0.9.4 onwards (LTS line through v1.3.1).
      - find_program('lime', native: true) for the code-generation
        path.
      - lime_cmd kwargs dict consumed by per-component custom_target
        rules in subsequent commits.

  * meson_options.txt:
      - LIME option for explicit lime-binary path.
      - lime_aot feature (default 'auto') gating Lime's
        AOT-compiled action-table code path.

  * src/makefiles/meson.build  -- pgxs export so out-of-tree
    contrib modules can find the Lime tooling.

  * src/tools/pgindent/pgindent -- recognise Lime's static_library
    helpers as proper C functions (cosmetic).

References Lime upstream: https://codeberg.org/gregburd/lime
Lime version pinned at v1.3.1 (commit f1970c4).
This commit ports two small grammars in src/backend/replication
from flex+bison to Lime:

  Phase 2a -- syncrep_gram.y -> syncrep_gram.lime
              (synchronous_standby_names parser)
  Phase 2b -- repl_gram.y -> repl_gram.lime
              (walsender command parser)

and the matching scanners from .l to Lime .lex format
(Phase 5 of the migration).

Each component gets its own per-grammar yytype.h header
(repl_gram_yytype.h, syncrep_parse.h) declaring the YYSTYPE
union shared between the parser action bodies and the scanner.
The scanner driver code (syncrep_scanner.c, repl_scanner.c) is
hand-rolled C; it sits between the Lime-emitted lex DFA and
the Lime-emitted parser table-driven dispatch, translating
internal sentinel codes into parser tokens with appropriate
yylval shaping.

Both grammars are small (syncrep: ~120 lines; repl: ~450 lines)
and the SQL surface they parse is unchanged.  All recovery and
replication-protocol regression tests pass byte-identically with
the new Lime backends.

Per-component meson.build wraps the Lime-generated .c files in a
static_library with c_args ['-Wno-missing-prototypes',
'-Wno-unused-variable'] -- the Lime template emits both classes
of warning by design.  The relax stays local to the generated .c;
the hand-rolled drivers compile under PG's standard -Wall flags.
Port src/backend/bootstrap from flex+bison to Lime:

  bootparse.y -> bootparse.lime          (BKI parser)
  bootscanner.l -> bootscanner.lex       (BKI scanner)

The BKI (Backend Interpreter) language is the bootstrap mini-language
used by initdb to populate the system catalogs from
postgres.bki.  It's a small grammar (~200 productions) and
parses a fixed file generated at build time, so the test
surface is initdb itself.

The hand-rolled bootscanner.c is a thin driver shim around the
Lime-emitted lex DFA.  Keyword recognition still uses
ScanKeywordLookup against bootkw.h (unchanged).

initdb runs end-to-end against the new parser/scanner pair with
byte-identical catalog output.
Port src/test/isolation from flex+bison to Lime:

  specparse.y -> specparse.lime         (isolation spec parser)
  specscanner.l -> specscanner.lex      (isolation spec scanner)

The isolation tester reads its own DSL from
src/test/isolation/specs/*.spec describing concurrent
transactions to schedule.  This is a small, single-purpose
parser with no extension-API exposure.

The driver (specscanner.c) is hand-rolled.  Two %literal_buffer-
using patterns: QIDENT ("..." with "" -> ") and SQLBLK
({ ... } with leading/trailing whitespace stripped).  Single
accumulator scanstr reused across the two states.

Isolation tests run unchanged after the port.
Port jsonpath's parser and scanner from flex+bison to Lime:

  jsonpath_gram.y -> jsonpath_gram.lime
  jsonpath_scan.l -> jsonpath_scan.lex

The jsonpath grammar implements the SQL/JSON path-expression
language (jsonb_path_query and friends).  This is one of the
larger ports in the migration: jsonpath_scan.lex carries four
exclusive states (XQ for double-quoted strings, XNQ for
unquoted identifiers, XVQ for $variable references, XC for
comments) and combined <XNQ, XQ, XVQ> rules sharing escape-
sequence handling.

The driver (jsonpath_scan.c) post-processes the lexer's emitted
tokens via a small set of internal sentinel codes
(JP_TOK_STRING_TAKE / VARIABLE_TAKE / NUMERIC_TEXT / INT_TEXT /
RAW_CHAR / VARIABLE_BARE) before handing them to the Lime parser.

A small expected-output update lands for jsonpath.out and
sqljson_queryfuncs.out: Lime's syntax-error messages on a few
already-failing inputs differ from Bison's at the exact column,
which the test cases assert.  The changes are cosmetic; the
actual error condition (and SQL-visible behaviour) is byte-
identical to the bison parser.
The frontend SQL+slash-command lexer (psqlscan / psqlscanslash)
and the pgbench expression evaluator share a common
PsqlScanStateData and must be ported together.

This commit ports three coupled scanners and one parser:

  src/fe_utils/psqlscan.l           -> psqlscan.lex   (SQL lexer
                                                       core,
                                                       11 states)
  src/bin/psql/psqlscanslash.l      -> psqlscanslash.lex
                                                      (slash-cmd
                                                       parser,
                                                       8 states)
  src/bin/pgbench/exprparse.y       -> exprparse.lime
                                                      (pgbench
                                                       expression
                                                       parser)
  src/bin/pgbench/exprscan.l        -> exprscan.lex   (pgbench
                                                       expression
                                                       lexer)

Strategy D (per-call lex):  each psql_scan() call allocates a
fresh Foo_Lexer over the remaining buffer slice, runs
LexFeedBytes until the first stop point, frees.  Cursor advance
is tracked by the driver (StackElem.pos); variable-substitution
":varname" expansion uses the existing buffer_stack with
recursive psql_scan into pushed StackElem.

The pgbench expression lexer's INITIAL state (one-word-at-a-time
expr_lex_one_word) stays hand-rolled -- it's too small to pay for
porting and doesn't fit Lime's pre-scan-emit-callback shape.
The EXPR state (the bulk of pgbench's expression syntax) is
Lime-driven.

A new psqlscan_emit.h shared header carries the PsqlEmitCtx
and the variable-substitution helper prototypes shared by the
three drivers.

psql and pgbench tests pass unchanged (681/681 pgbench tests).
Port the GUC configuration-file scanner from flex to Lime:

  src/backend/utils/misc/guc-file.l  -> guc_file.lex

postgresql.conf is read by guc-file.c on every postmaster
startup and SIGHUP.  The original parser was scanner-only; no
bison grammar.  This commit ports the scanner to Lime's .lex
format.

The driver (guc-file.c) consumes a pre-scanned token FIFO;
ConfigFileLineno is bumped on EOL token pop.  Quoted-string
values match a regex span and are post-processed via the
existing DeescapeQuotedString -- the scanner doesn't need
%literal_buffer here.

Existing GUC config-file regression tests and SIGHUP-reload
tests run unchanged after the port.
The centerpiece of the migration: replace gram.y + scan.l with
gram.lime + scan.lex.  Also retire ecpg's bison input
(preproc.y) by feeding gram.lime through a reverse converter
back to bison-shaped grammar text that ecpg's parse.pl reads.

Backend SQL parser:

  src/backend/parser/gram.y    -> gram.lime  (~21k lines,
                                              mechanically
                                              translated)
  src/backend/parser/scan.l    -> scan.lex   (745 lines,
                                              11 exclusive
                                              states)
  src/backend/parser/scan.c    hand-rolled driver wrapping the
                                Lime-emitted lex DFA in the
                                public scanner.h API
                                (core_yylex, scanner_init/finish,
                                base_yylex's 2-token lookahead)

The gram.lime file is mechanically derived from the previous
gram.y by src/tools/lime_convert_gram.py: the converter rewrites
%type / %union / mid-rule actions / @n location refs / inline
char-literal terminals / %name-prefix etc. into Lime's idiom.
Action bodies port verbatim modulo $$/$N rewriting.  The
converter is preserved in-tree so future gram.y-style edits can
still be made (edit gram.y, run the converter, regenerate
gram.lime), though for this commit gram.y itself is dropped.

ecpg/preproc:

  src/interfaces/ecpg/preproc/pgc.l -> pgc.c (hand-rolled C) +
                                       pgc.lex (Lime .lex DFA;
                                       21 exclusive states,
                                       ECPG-specific includes)
  src/interfaces/ecpg/preproc/parser.c bridge code

ecpg historically generates its own preproc.y by running
parse.pl over backend gram.y.  parse.pl now reads gram.lime via
src/tools/lime_to_bison_gram.py, a reverse converter that
emits a bison-syntax skeleton.  This keeps ecpg buildable
without retargeting it to Lime in this commit (a future commit
can do that independently).

Public ABI of scanner.h preserved exactly:
  - core_yylex returns the same token codes as before
  - base_yylex's 2-token lookahead (NOT BETWEEN -> NOT_LA, etc.)
    runs unchanged
  - YYLLOC location offsets match bison's

The conversion is byte-identical at the parse-tree level.
Standard regress + ecpg suites pass.  Three small expected-
output adjustments land for graph_table.out, jsonpath.out, and
sqljson_queryfuncs.out where Lime's syntax-error messages
report the offending lookahead at a slightly different column
than Bison's.  These are cosmetic in user-visible behaviour;
all SQL semantics unchanged.
Port plpgsql's grammar from bison to Lime:

  src/pl/plpgsql/src/pl_gram.y  -> pl_gram.lime  (~4250 lines)

plpgsql is the largest single grammar after the backend SQL
grammar.  Its push-driven parser interacts with the surrounding
plpgsql_yylex routine (still hand-rolled C in pl_scanner.c).

Two non-trivial bridges between bison-pull and Lime-push
semantics:

  1. The bison parser has empty-rule lookahead via Parse_get_lookahead.
     Lime exposes the same via parse_token_offset; pl_scanner.c
     consults it where needed.

  2. Helper-function lex (peek-ahead in plpgsql_yy_drain_lookahead)
     and scanner-state mutation via driver-level K_DECLARE/K_BEGIN
     mirroring keep the existing pl_scanner.c surface intact.

Lime's per-rule reduce-callback signature replaces bison's
yylval-via-global pattern, but pl_gram_types.h declares the
shared YYSTYPE union exactly as before, so action bodies port
verbatim.

plpgsql regression suite passes byte-identically; the plpgsql
test module (PG's largest regression group after the standard
regress) shows no diffs.
Three contrib modules ship their own bison+flex grammars.
This commit retires them in favor of Lime parser + scanner pairs
sharing the same migration pattern as the in-tree grammars:

  contrib/cube:
    cubeparse.y -> cubeparse.lime
    cubescan.l  -> cubescan.lex

  contrib/seg:
    segparse.y -> segparse.lime
    segscan.l  -> segscan.lex

  contrib/pg_plan_advice:
    pgpa_parser.y -> pgpa_parser.lime
    pgpa_scanner.l -> pgpa_scanner.lex

Each module gets a hand-rolled driver C file translating Lime's
emit-callback sentinel codes into the existing token vocabulary
that the parser action bodies expect.

Three small expected-output adjustments land:

  - contrib/cube/expected/cube.out: the (A) lhs label was
    cosmetically dropped from box's four alternatives and the
    leading bare 'A' removed from one error message DETAIL line
    (Lime's emitter substitutes letter labels inside string
    literals; we worked around that by removing the unused label).
  - contrib/seg/expected/seg.out: similar cosmetic error-message
    DETAIL deltas.
  - contrib/pg_plan_advice/expected/syntax.out: error-position
    deltas where Lime reports 'at or near "("' or '")"' at
    a slightly different column than Bison.

All three modules' regression tests pass with these byte-level
adjustments; SQL semantics are unchanged.
The bison-to-Lime port replaces flex+bison with the Lime LALR(1)
parser generator from https://codeberg.org/gregburd/lime.  This
commit updates the installation chapter to reflect:

  * Lime >= 0.12.0 is required for builds from the git repo.
    Source tarballs continue to ship pre-generated parser/scanner
    .c/.h files (the same discipline PG already uses for bison/flex
    output), so end-user tarball builds do not need Lime.

  * bison and flex are still listed -- contrib modules and out-of-
    tree extensions can keep using .y/.l grammars via the existing
    pgxs interface.  After the migration only the in-tree
    grammars are converted; flex+bison remain optional dependencies
    for the wider ecosystem.

  * Suggested install paths (distro packages where available;
    source build from codeberg.org/gregburd/lime otherwise).

Also drops src/backend/utils/misc/.gitignore -- the file's only
purpose was to ignore the bison output from the (no-longer-bison)
guc-file.l, and that scanner is now Lime-driven via guc_file.lex.
Adds a runtime grammar-extension API enabling extensions to
register new tokens, productions, and reduce callbacks before
the first parse, then rebuilds the SQL parser to incorporate
them.  This is a foundation for runtime-extensible SQL dialects
(QUEL revival in contrib/quel as a demonstration; out-of-tree
DSLs like a DuckDB-compat or MongoDB-JSONB syntax via the same
API).

Public API (include/parser/parser_extension.h):

  PgGrammarExtension *pg_grammar_ext_create(name, version);
  void pg_grammar_ext_add_token(...);
  void pg_grammar_ext_add_rule(...);
  void pg_grammar_ext_set_precedence(...);
  bool pg_grammar_ext_register(ext, &err);

Calls are valid only from _PG_init() of a shared_preload_libraries-
loaded module, before raw_parser() runs for the first time.

Implementation (parser_extension.c):

  Track A subprocess pipeline: at first parse, walk the registered
  extensions, serialize them into a .lime fragment text alongside
  the base gram.lime, fork+exec lime + cc to produce a rebuilt
  parser .so, dlopen it, and dispatch base_yyparse through a
  function pointer (base_yyparse_fn) that points at the rebuilt
  symbol.  Cache the .so under $PGDATA/pg_parser_cache/<sha256>.so.

  Phase 1 scanner hook in scan.c: extension-registered keywords
  that don't appear in the compile-time ScanKeywords table are
  caught by pg_grammar_ext_keyword_hook after the base lookup
  misses.  Returns the rebuilt parser's token code; the rebuild
  step ensures the parser tables know about it.

Why [DO NOT MERGE]:

  * The API surface is intentionally small but the runtime
    re-build (fork + lime + cc + dlopen) is operationally
    heavy on cold cache: the first parse after postmaster
    start with extensions loaded takes ~9s.  Warm cache is
    ~11ms.  Production OLTP overhead with no parsing-bound
    workload is 0.5-2%; parser-bound benchmarks see 4-12%.

  * The keyword shadowing rules are non-obvious (extensions
    cannot override base SQL keywords; the hook fires only on
    base lookup miss).  Documented in parser_extension.h, but
    this constraint surprises authors who expect MySQL-compat
    or DuckDB-compat dialects to override SHOW or ATTACH.

  * Track B (in-process snapshot patching, no subprocess) is
    designed but not implemented.  Track A works in production
    today; Track B would cut the 9s cold cost to ~5ms but
    requires invasive parser.c surgery.

  * No -hackers consensus on whether runtime grammar extensions
    belong in core at all; this is RFC-quality work for review
    and discussion.

Tests: see [DO NOT MERGE] commits below for grammar_ext_compose,
grammar_ext_overlap, dummy_grammar_ext, lime_in_process_smoke,
parser_microbench and contrib/quel that exercise this API.
Five test modules exercising the runtime grammar-extension API:

  * dummy_grammar_ext           -- minimal smoke test: 1 token,
                                    1 rule, 1 reduce callback.
                                    Verifies end-to-end registry
                                    -> rebuild -> dlopen -> parse
                                    pipeline.

  * grammar_ext_compose         -- 6 small extensions composed in
                                    8 different load-order
                                    permutations.  22 sub-tests
                                    covering token-name no-op
                                    vs collision, cross-extension
                                    references, precedence,
                                    cache-key determinism, and
                                    base-grammar invariance.

  * grammar_ext_overlap         -- 5 simulator extensions
                                    (DuckDB-compat, MySQL-compat,
                                    MongoDB-JSONB, pg_infer,
                                    QUEL-lite) loaded
                                    simultaneously.  42 sub-tests
                                    covering one-rebuild-for-all,
                                    13-keyword reachability,
                                    mixed SQL+extension-DSL
                                    sessions, order independence,
                                    subset-load fallthrough.

  * lime_in_process_smoke       -- exercises the in-process
                                    lime_compile_grammar_in_process
                                    path (Track B Phase 2 Step 1).

  * parser_microbench           -- direct raw_parser() timing
                                    benchmark: 1738 ns/parse for
                                    SELECT 1, 5207 ns for realistic
                                    OLTP, 5539 ns for DDL on a
                                    debug build.

These modules together demonstrate the API works under realistic
multi-extension composition.  They are NOT for upstream merge:
they belong in test/modules as research artifacts, not as part
of the core test surface.
…nsion

Demonstrates the runtime grammar-extension API by reviving the
Berkeley QUEL query language from the original POSTGRES (1986)
as a contrib module.  All five Berkeley QUEL forms are
supported via the Lime extension API:

  RANGE OF e IS emp                    -- tuple-variable binding
  RETRIEVE (e.name, e.salary)          -- SELECT
    where e.dept = 'shoe'
  RETRIEVE (e.name) BY e.salary DESC   -- SELECT ... ORDER BY DESC
  REPLACE emp (salary = 50000)         -- UPDATE WHERE
    where dept='shoe'
  APPEND TO emp (name='alice', ...)    -- INSERT
  DELETE emp where salary < 1000       -- DELETE WHERE

Each form constructs a real PostgreSQL parse-tree node
(SelectStmt / UpdateStmt / InsertStmt / DeleteStmt) at parse
time, flowing through parse_analyze + planner + executor +
EXPLAIN unchanged.  9 SQL/QUEL equivalence assertions in
t/001_quel.pl prove the parser produces identical results
to the equivalent SQL.

Keyword shadowing constraint: extension keywords can't
override base SQL keywords, so QUEL uses a q_-prefix for
words that conflict (q_range, q_of, q_is, q_to, q_by,
q_replace, q_delete, q_into).  Documented in the SGML
chapter (doc/src/sgml/quel.sgml) and in
parser_extension.h's pg_grammar_ext_keyword_hook block.

Why [DO NOT MERGE]:

  * QUEL itself has no production users.  This is a
    demonstration of the runtime extension API at a non-trivial
    scale (30 rules, 8 token types, 10 keyword tokens), not a
    proposal to add Berkeley QUEL to PostgreSQL core.

  * The SGML chapter is informative but exceeds what most contrib
    modules ship; it includes historical context, a syntax
    reference, 6 worked examples, and a Limitations section.

This commit ships QUEL as a research artifact alongside the
runtime extension API.  Anyone interested in writing a similar
DSL extension can read contrib/quel as a worked-out example.
@gburd gburd force-pushed the master branch 2 times, most recently from 4100e10 to 5a292fb Compare June 17, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant