Conversation
📜 Change history & discussion (Agora / pg.ddx.io)
🧵 Related discussion
🔗 Related commits / prior art
🧭 Context for reviewers
Generated by pg-history via the Agora MCP server (pg.ddx.io). |
There was a problem hiding this comment.
🔍 OCR found 82 issue(s).
- 80 inline, 2 in summary
📄 src/backend/parser/scan.c
Correctness/security regression in $N parameter parsing. The param pattern is \${decdigit}+ (unbounded digits), but this copies only the first 31 bytes into buf[32]. For a token with >=32 digits, the trailing digits are silently dropped before pg_strtoint32_safe, so e.g. $000...0001 (with enough leading zeros) parses to a WRONG, smaller value instead of either the correct number or a parameter number too large error. The retired flex scanner ran pg_strtoint32_safe over the full yytext+1, which correctly rejected over-long inputs. The very next case (SCAN_TOK_ICONST_*) already does the right thing with palloc(len + 1). Replace the fixed buf[32] with a len-sized palloc'd copy so the whole digit run is parsed.
📄 src/fe_utils/Makefile
These compatibility-shim macros are unused and dangerous. Verified that all three ported scanners — psqlscan.c, psqlscanslash.c, and pgbench/exprscan.c — reference only the ST_-prefixed enum values (ST_INITIAL, ST_XB, ST_XQS, …); none use the bare identifiers. The comment's justification is false: exprscan.c resets via state->start_state = ST_INITIAL (lines 643/722/759), not bare INITIAL. The only occurrence of start_state = INITIAL in the whole frontend tree is inside this very comment.
Defining unscoped single/double-letter macros like xb, xc, xd, xe, xh, xq — and especially INITIAL — in a header transitively included (via psqlscan_emit.h) by psql, pgbench, and fe_utils translation units is namespace pollution that can silently rewrite unrelated local variables, struct members, or parameters, causing hard-to-diagnose miscompilation or build breakage. Since nothing references these names, delete the entire shim block (and the misleading comment above it) rather than keeping a speculative compatibility layer. (confidence: high)
| - -g | ||
| - -std=c11 | ||
| - -I. | ||
| - -I../../../../src/include |
There was a problem hiding this comment.
This relative include path appears incorrect. The .clangd file is at the repository root, and clangd resolves relative paths in CompileFlags.Add relative to the .clangd file's directory. Since src/include lives directly under the repo root (src/include/postgres.h), ../../../../src/include resolves to four directories above the repo root and won't be found. It should likely be -I./src/include (or -Isrc/include).
| - -I../../../../src/include | |
| + - -I./src/include |
| * raised). */ | ||
| extern int scan_lex_handle_unicode(void *user, int pos, char32_t c); | ||
| extern void scan_lex_handle_xeu_second(void *user, int pos, char32_t c); | ||
| extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c); |
There was a problem hiding this comment.
Prototype mismatch (high confidence). This declaration is missing the int pos parameter. The definition in scan.c is void scan_lex_handle_xeescape(void *user, int pos, unsigned char c) (scan.c:379) and the call site in scan.lex passes three arguments: scan_lex_handle_xeescape(user, SCAN_LEX_OFFSET(matched), (unsigned char) matched[1]) (scan.lex:520). Since scan.c includes this header, the two-argument prototype conflicts with the three-argument definition and will fail to compile ("conflicting types"). Fix the prototype to match.
| extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c); | |
| extern void scan_lex_handle_xeescape(void *user, int pos, unsigned char c); |
| if (stat(so_path, &st) == 0 && S_ISREG(st.st_mode)) | ||
| { | ||
| ereport(LOG, | ||
| (errmsg("grammar extension cache hit: %s", so_path))); | ||
| goto dlopen_step; | ||
| } |
There was a problem hiding this comment.
Cache-hit path is broken when only the .so survives. On a cache hit you only stat <hex>.so, then goto dlopen_step, which calls build_extension_keyword_map() -> AllocateFile(".h"). The persistent cache artifact is the .so; the .h/.c are byproducts that may have been cleaned up (or never present on a machine that copied only the .so). When <hex>.h is missing, build_extension_keyword_map returns false and the whole pipeline ereport(ERROR)s, defeating the cache entirely and making extensions fail whenever the header is absent. Either stat the .h alongside the .so before treating it as a hit, or persist the resolved (lexeme -> token_code) map rather than re-parsing the header on every load. (confidence: high)
| while (waitpid(pid, &status, 0) < 0) | ||
| { | ||
| if (errno != EINTR) | ||
| { | ||
| pfree(errbuf.data); | ||
| *errmsg_out = psprintf("waitpid() for %s failed: %m", progname); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
These blocking syscalls run in a backend with no interrupt handling. The read() loop only special-cases EINTR (continue), and waitpid() only retries on EINTR; neither calls CHECK_FOR_INTERRUPTS(). A wedged or slow lime/cc makes the backend hang uninterruptibly — a query cancel (SIGINT) or SIGTERM cannot break it because the loop just resumes the syscall. Add CHECK_FOR_INTERRUPTS() in the read loop and around the waitpid retry, and consider a timeout so a stuck subprocess does not pin a connection indefinitely. (confidence: high)
| * read at first-use; OpenPipeStream isn't suitable because | ||
| * we want a tight read with a fixed buffer. | ||
| */ | ||
| pipe = popen("lime -v 2>/dev/null", "r"); |
There was a problem hiding this comment.
lime is launched via execvp(), i.e. a PATH lookup with no absolute path, and resolve_cc() honors $CC, and resolve_lime_version() runs popen("lime -v ...") through /bin/sh. All three execute under the postmaster's privileges at parse time. A poisoned PATH (or attacker-controlled $CC) lets a local user substitute a malicious lime/cc that the server then runs and dlopens. Pin these to absolute, install-time-known paths (or validate them) instead of relying on PATH/$CC, and prefer the project's run_program/OpenPipeStream over shell-based popen. (confidence: moderate)
| if not srcdir.is_dir(): | ||
| sys.exit(f'lime_format_check: srcdir not found: {srcdir}') | ||
|
|
||
| SKIP_PATTERNS = ('build', 'install', '.git', 'tmp_install') |
There was a problem hiding this comment.
The quote-aware scanner treats " and ' symmetrically as string delimiters, but in C these have different semantics: ' introduces a single character (char) literal, not an arbitrary-length string. This works for well-formed bodies, but the else-branch scan (lines below) stops at any ' or ", so a stray/unbalanced quote in an action — e.g. an apostrophe inside a comment, or a char literal containing a quote like '\'' whose escaped inner quote is mis-counted — can desynchronize the quote state. Once desynchronized, all subsequent $N/@N references are either silently skipped or rewritten inside what is actually code, corrupting the generated action with no diagnostic. Since this drives every generated parser action, consider hardening the scanner (track char-literals separately with proper escape rules, and assert balanced quoting) so any malformed input fails loudly rather than producing a subtly wrong .lime.
| error_count_zero = ('0 error(s)' in out | ||
| or 'OK: no diagnostics' in out | ||
| or '✓ No errors or warnings' in out) |
There was a problem hiding this comment.
Fragile substring match causes a false negative: '0 error(s)' in out matches any error count ending in 0 (e.g. 10 error(s), 20 error(s), 100 error(s)), so a grammar with 10/20/... errors would be treated as clean and pass linting — masking real failures. Match the count anchored to the start of the number instead, e.g. parse with a regex like re.search(r'\b([0-9]+) error\(s\)', out) and compare the captured integer to 0.
| if has_failure: | ||
| failures += 1 | ||
| print(f'FAIL {rel}', file=sys.stderr) |
There was a problem hiding this comment.
Behavioral discrepancy with the header comment, which states "non-zero on the first lint failure." This loop continues through all files and exits non-zero only at the end (aggregate). Continuing is arguably the more useful behavior, but the documented contract should be updated to match (e.g. "reports all failures and exits non-zero if any file fails") to avoid misleading maintainers/tooling.
| if args.aot: | ||
| if not args.output_aot: | ||
| sys.exit('--aot requires --aot-output') |
There was a problem hiding this comment.
The --aot-output validation is placed after Lime has already run with -j and after the .c/.h files have been moved. While meson always passes --aot and --aot-output together (so this won't trigger in practice), validating this required-combination right after parse_args() would fail fast and avoid leaving partial outputs. Consider moving the check before building/running the command.
| # - treats Lime's `.out` report as a build artefact worth keeping | ||
| # (mirrored from <outdir>/<basename>.out into <privatedir>) |
There was a problem hiding this comment.
This comment claims the wrapper mirrors Lime's .out report into privatedir, but the code never handles the .out file at all. Additionally, since --privatedir is already Lime's -d output dir, the report is written directly there — making the "mirrored ... into " wording self-contradictory. Please align the comment with the actual implementation (or implement the described mirroring) to avoid misleading future maintainers.
813bde8 to
ed90aaa
Compare
4e2b6f9 to
5fa39a2
Compare
Keep master a pristine mirror of upstream plus our .github/ CI. These workflows rebase the .github-only commits onto postgres/postgres and push via SYNC_PAT (a PAT carrying the 'workflow' scope — required because the default GITHUB_TOKEN cannot update files under .github/workflows/): - sync-upstream.yml (hourly schedule + manual dispatch) - sync-upstream-manual.yml (on-demand, with a force-push toggle)
Review every PR (including drafts) with two jobs that authenticate to AWS Bedrock (Claude Opus 4.8) via GitHub OIDC (vars.AWS_ROLE_ARN); no static AWS credentials are stored in the repo. - ocr-review: runs Alibaba Open Code Review through an ephemeral LiteLLM proxy bridging OCR's OpenAI protocol to Bedrock, and posts inline review comments. Uses output_config.effort=xhigh (Opus 4.8 adaptive thinking). Path-scoped rules (.github/ocr/rule.json) encode PostgreSQL community review standards plus reviewer discipline (verify against the diff, don't hallucinate, state confidence, be blunt, accuracy over approval). - pg-history: OCR cannot call MCP, so a separate Bedrock tool-use agent (.github/ocr/pg-history.py) queries the Agora MCP server (pg.ddx.io) to tie the change to git + pgsql-hackers history, and upserts a comment linking threads as https://pg.ddx.io/m/pgsql-hackers/<message-id>.
fbd035a to
54ae267
Compare
Pulls in the Nix flake (flake.nix, flake.lock, shell.nix) and a small pg-aliases.sh helper that the contributing maintainer uses locally to drive the build with a Lime version pinned to a specific commit. Not for upstream submission -- developers who do not use Nix can ignore these files. Pinning Lime upstream by commit hash here gives a reproducible build environment without forcing other developers to install or upgrade Lime out-of-band. When upstream Lime cuts a new release the maintainer bumps flake.lock alongside the meson.build version floor. The full bison-to-Lime migration series is independent of these files; non-Nix contributors who install Lime via their package manager (or build it from source per installation.sgml) get the same end result.
This commit lands the build-system support for the Lime LALR(1)
parser generator without changing any in-tree grammar. Subsequent
commits port flex+bison grammars to Lime one component at a time;
each is independently bisectable, so this commit on its own is a
no-op for the runtime.
What's added:
* src/tools/pglime -- Python wrapper that invokes `lime` for
custom_target() rules in component meson.build files. Handles
--output-dir, --aot, --jit-compatible flags.
* src/tools/lime_lint, lime_format, lime_format_check --
user-facing helper scripts driving `lime -L` (lint) and
`lime -F` (format) over .lime files.
* meson.build:
- dependency('lime', required: false, version: '>=1.3.1')
plus dependency('lime-compiler') for the in-process
snapshot-build library shipped from v0.9.4 onwards (LTS line through v1.3.1).
- find_program('lime', native: true) for the code-generation
path.
- lime_cmd kwargs dict consumed by per-component custom_target
rules in subsequent commits.
* meson_options.txt:
- LIME option for explicit lime-binary path.
- lime_aot feature (default 'auto') gating Lime's
AOT-compiled action-table code path.
* src/makefiles/meson.build -- pgxs export so out-of-tree
contrib modules can find the Lime tooling.
* src/tools/pgindent/pgindent -- recognise Lime's static_library
helpers as proper C functions (cosmetic).
References Lime upstream: https://codeberg.org/gregburd/lime
Lime version pinned at v1.3.1 (commit f1970c4).
This commit ports two small grammars in src/backend/replication
from flex+bison to Lime:
Phase 2a -- syncrep_gram.y -> syncrep_gram.lime
(synchronous_standby_names parser)
Phase 2b -- repl_gram.y -> repl_gram.lime
(walsender command parser)
and the matching scanners from .l to Lime .lex format
(Phase 5 of the migration).
Each component gets its own per-grammar yytype.h header
(repl_gram_yytype.h, syncrep_parse.h) declaring the YYSTYPE
union shared between the parser action bodies and the scanner.
The scanner driver code (syncrep_scanner.c, repl_scanner.c) is
hand-rolled C; it sits between the Lime-emitted lex DFA and
the Lime-emitted parser table-driven dispatch, translating
internal sentinel codes into parser tokens with appropriate
yylval shaping.
Both grammars are small (syncrep: ~120 lines; repl: ~450 lines)
and the SQL surface they parse is unchanged. All recovery and
replication-protocol regression tests pass byte-identically with
the new Lime backends.
Per-component meson.build wraps the Lime-generated .c files in a
static_library with c_args ['-Wno-missing-prototypes',
'-Wno-unused-variable'] -- the Lime template emits both classes
of warning by design. The relax stays local to the generated .c;
the hand-rolled drivers compile under PG's standard -Wall flags.
Port src/backend/bootstrap from flex+bison to Lime: bootparse.y -> bootparse.lime (BKI parser) bootscanner.l -> bootscanner.lex (BKI scanner) The BKI (Backend Interpreter) language is the bootstrap mini-language used by initdb to populate the system catalogs from postgres.bki. It's a small grammar (~200 productions) and parses a fixed file generated at build time, so the test surface is initdb itself. The hand-rolled bootscanner.c is a thin driver shim around the Lime-emitted lex DFA. Keyword recognition still uses ScanKeywordLookup against bootkw.h (unchanged). initdb runs end-to-end against the new parser/scanner pair with byte-identical catalog output.
Port src/test/isolation from flex+bison to Lime:
specparse.y -> specparse.lime (isolation spec parser)
specscanner.l -> specscanner.lex (isolation spec scanner)
The isolation tester reads its own DSL from
src/test/isolation/specs/*.spec describing concurrent
transactions to schedule. This is a small, single-purpose
parser with no extension-API exposure.
The driver (specscanner.c) is hand-rolled. Two %literal_buffer-
using patterns: QIDENT ("..." with "" -> ") and SQLBLK
({ ... } with leading/trailing whitespace stripped). Single
accumulator scanstr reused across the two states.
Isolation tests run unchanged after the port.
Port jsonpath's parser and scanner from flex+bison to Lime: jsonpath_gram.y -> jsonpath_gram.lime jsonpath_scan.l -> jsonpath_scan.lex The jsonpath grammar implements the SQL/JSON path-expression language (jsonb_path_query and friends). This is one of the larger ports in the migration: jsonpath_scan.lex carries four exclusive states (XQ for double-quoted strings, XNQ for unquoted identifiers, XVQ for $variable references, XC for comments) and combined <XNQ, XQ, XVQ> rules sharing escape- sequence handling. The driver (jsonpath_scan.c) post-processes the lexer's emitted tokens via a small set of internal sentinel codes (JP_TOK_STRING_TAKE / VARIABLE_TAKE / NUMERIC_TEXT / INT_TEXT / RAW_CHAR / VARIABLE_BARE) before handing them to the Lime parser. A small expected-output update lands for jsonpath.out and sqljson_queryfuncs.out: Lime's syntax-error messages on a few already-failing inputs differ from Bison's at the exact column, which the test cases assert. The changes are cosmetic; the actual error condition (and SQL-visible behaviour) is byte- identical to the bison parser.
The frontend SQL+slash-command lexer (psqlscan / psqlscanslash)
and the pgbench expression evaluator share a common
PsqlScanStateData and must be ported together.
This commit ports three coupled scanners and one parser:
src/fe_utils/psqlscan.l -> psqlscan.lex (SQL lexer
core,
11 states)
src/bin/psql/psqlscanslash.l -> psqlscanslash.lex
(slash-cmd
parser,
8 states)
src/bin/pgbench/exprparse.y -> exprparse.lime
(pgbench
expression
parser)
src/bin/pgbench/exprscan.l -> exprscan.lex (pgbench
expression
lexer)
Strategy D (per-call lex): each psql_scan() call allocates a
fresh Foo_Lexer over the remaining buffer slice, runs
LexFeedBytes until the first stop point, frees. Cursor advance
is tracked by the driver (StackElem.pos); variable-substitution
":varname" expansion uses the existing buffer_stack with
recursive psql_scan into pushed StackElem.
The pgbench expression lexer's INITIAL state (one-word-at-a-time
expr_lex_one_word) stays hand-rolled -- it's too small to pay for
porting and doesn't fit Lime's pre-scan-emit-callback shape.
The EXPR state (the bulk of pgbench's expression syntax) is
Lime-driven.
A new psqlscan_emit.h shared header carries the PsqlEmitCtx
and the variable-substitution helper prototypes shared by the
three drivers.
psql and pgbench tests pass unchanged (681/681 pgbench tests).
Port the GUC configuration-file scanner from flex to Lime: src/backend/utils/misc/guc-file.l -> guc_file.lex postgresql.conf is read by guc-file.c on every postmaster startup and SIGHUP. The original parser was scanner-only; no bison grammar. This commit ports the scanner to Lime's .lex format. The driver (guc-file.c) consumes a pre-scanned token FIFO; ConfigFileLineno is bumped on EOL token pop. Quoted-string values match a regex span and are post-processed via the existing DeescapeQuotedString -- the scanner doesn't need %literal_buffer here. Existing GUC config-file regression tests and SIGHUP-reload tests run unchanged after the port.
The centerpiece of the migration: replace gram.y + scan.l with
gram.lime + scan.lex. Also retire ecpg's bison input
(preproc.y) by feeding gram.lime through a reverse converter
back to bison-shaped grammar text that ecpg's parse.pl reads.
Backend SQL parser:
src/backend/parser/gram.y -> gram.lime (~21k lines,
mechanically
translated)
src/backend/parser/scan.l -> scan.lex (745 lines,
11 exclusive
states)
src/backend/parser/scan.c hand-rolled driver wrapping the
Lime-emitted lex DFA in the
public scanner.h API
(core_yylex, scanner_init/finish,
base_yylex's 2-token lookahead)
The gram.lime file is mechanically derived from the previous
gram.y by src/tools/lime_convert_gram.py: the converter rewrites
%type / %union / mid-rule actions / @n location refs / inline
char-literal terminals / %name-prefix etc. into Lime's idiom.
Action bodies port verbatim modulo $$/$N rewriting. The
converter is preserved in-tree so future gram.y-style edits can
still be made (edit gram.y, run the converter, regenerate
gram.lime), though for this commit gram.y itself is dropped.
ecpg/preproc:
src/interfaces/ecpg/preproc/pgc.l -> pgc.c (hand-rolled C) +
pgc.lex (Lime .lex DFA;
21 exclusive states,
ECPG-specific includes)
src/interfaces/ecpg/preproc/parser.c bridge code
ecpg historically generates its own preproc.y by running
parse.pl over backend gram.y. parse.pl now reads gram.lime via
src/tools/lime_to_bison_gram.py, a reverse converter that
emits a bison-syntax skeleton. This keeps ecpg buildable
without retargeting it to Lime in this commit (a future commit
can do that independently).
Public ABI of scanner.h preserved exactly:
- core_yylex returns the same token codes as before
- base_yylex's 2-token lookahead (NOT BETWEEN -> NOT_LA, etc.)
runs unchanged
- YYLLOC location offsets match bison's
The conversion is byte-identical at the parse-tree level.
Standard regress + ecpg suites pass. Three small expected-
output adjustments land for graph_table.out, jsonpath.out, and
sqljson_queryfuncs.out where Lime's syntax-error messages
report the offending lookahead at a slightly different column
than Bison's. These are cosmetic in user-visible behaviour;
all SQL semantics unchanged.
Port plpgsql's grammar from bison to Lime:
src/pl/plpgsql/src/pl_gram.y -> pl_gram.lime (~4250 lines)
plpgsql is the largest single grammar after the backend SQL
grammar. Its push-driven parser interacts with the surrounding
plpgsql_yylex routine (still hand-rolled C in pl_scanner.c).
Two non-trivial bridges between bison-pull and Lime-push
semantics:
1. The bison parser has empty-rule lookahead via Parse_get_lookahead.
Lime exposes the same via parse_token_offset; pl_scanner.c
consults it where needed.
2. Helper-function lex (peek-ahead in plpgsql_yy_drain_lookahead)
and scanner-state mutation via driver-level K_DECLARE/K_BEGIN
mirroring keep the existing pl_scanner.c surface intact.
Lime's per-rule reduce-callback signature replaces bison's
yylval-via-global pattern, but pl_gram_types.h declares the
shared YYSTYPE union exactly as before, so action bodies port
verbatim.
plpgsql regression suite passes byte-identically; the plpgsql
test module (PG's largest regression group after the standard
regress) shows no diffs.
Three contrib modules ship their own bison+flex grammars.
This commit retires them in favor of Lime parser + scanner pairs
sharing the same migration pattern as the in-tree grammars:
contrib/cube:
cubeparse.y -> cubeparse.lime
cubescan.l -> cubescan.lex
contrib/seg:
segparse.y -> segparse.lime
segscan.l -> segscan.lex
contrib/pg_plan_advice:
pgpa_parser.y -> pgpa_parser.lime
pgpa_scanner.l -> pgpa_scanner.lex
Each module gets a hand-rolled driver C file translating Lime's
emit-callback sentinel codes into the existing token vocabulary
that the parser action bodies expect.
Three small expected-output adjustments land:
- contrib/cube/expected/cube.out: the (A) lhs label was
cosmetically dropped from box's four alternatives and the
leading bare 'A' removed from one error message DETAIL line
(Lime's emitter substitutes letter labels inside string
literals; we worked around that by removing the unused label).
- contrib/seg/expected/seg.out: similar cosmetic error-message
DETAIL deltas.
- contrib/pg_plan_advice/expected/syntax.out: error-position
deltas where Lime reports 'at or near "("' or '")"' at
a slightly different column than Bison.
All three modules' regression tests pass with these byte-level
adjustments; SQL semantics are unchanged.
The bison-to-Lime port replaces flex+bison with the Lime LALR(1) parser generator from https://codeberg.org/gregburd/lime. This commit updates the installation chapter to reflect: * Lime >= 0.12.0 is required for builds from the git repo. Source tarballs continue to ship pre-generated parser/scanner .c/.h files (the same discipline PG already uses for bison/flex output), so end-user tarball builds do not need Lime. * bison and flex are still listed -- contrib modules and out-of- tree extensions can keep using .y/.l grammars via the existing pgxs interface. After the migration only the in-tree grammars are converted; flex+bison remain optional dependencies for the wider ecosystem. * Suggested install paths (distro packages where available; source build from codeberg.org/gregburd/lime otherwise). Also drops src/backend/utils/misc/.gitignore -- the file's only purpose was to ignore the bison output from the (no-longer-bison) guc-file.l, and that scanner is now Lime-driven via guc_file.lex.
Adds a runtime grammar-extension API enabling extensions to
register new tokens, productions, and reduce callbacks before
the first parse, then rebuilds the SQL parser to incorporate
them. This is a foundation for runtime-extensible SQL dialects
(QUEL revival in contrib/quel as a demonstration; out-of-tree
DSLs like a DuckDB-compat or MongoDB-JSONB syntax via the same
API).
Public API (include/parser/parser_extension.h):
PgGrammarExtension *pg_grammar_ext_create(name, version);
void pg_grammar_ext_add_token(...);
void pg_grammar_ext_add_rule(...);
void pg_grammar_ext_set_precedence(...);
bool pg_grammar_ext_register(ext, &err);
Calls are valid only from _PG_init() of a shared_preload_libraries-
loaded module, before raw_parser() runs for the first time.
Implementation (parser_extension.c):
Track A subprocess pipeline: at first parse, walk the registered
extensions, serialize them into a .lime fragment text alongside
the base gram.lime, fork+exec lime + cc to produce a rebuilt
parser .so, dlopen it, and dispatch base_yyparse through a
function pointer (base_yyparse_fn) that points at the rebuilt
symbol. Cache the .so under $PGDATA/pg_parser_cache/<sha256>.so.
Phase 1 scanner hook in scan.c: extension-registered keywords
that don't appear in the compile-time ScanKeywords table are
caught by pg_grammar_ext_keyword_hook after the base lookup
misses. Returns the rebuilt parser's token code; the rebuild
step ensures the parser tables know about it.
Why [DO NOT MERGE]:
* The API surface is intentionally small but the runtime
re-build (fork + lime + cc + dlopen) is operationally
heavy on cold cache: the first parse after postmaster
start with extensions loaded takes ~9s. Warm cache is
~11ms. Production OLTP overhead with no parsing-bound
workload is 0.5-2%; parser-bound benchmarks see 4-12%.
* The keyword shadowing rules are non-obvious (extensions
cannot override base SQL keywords; the hook fires only on
base lookup miss). Documented in parser_extension.h, but
this constraint surprises authors who expect MySQL-compat
or DuckDB-compat dialects to override SHOW or ATTACH.
* Track B (in-process snapshot patching, no subprocess) is
designed but not implemented. Track A works in production
today; Track B would cut the 9s cold cost to ~5ms but
requires invasive parser.c surgery.
* No -hackers consensus on whether runtime grammar extensions
belong in core at all; this is RFC-quality work for review
and discussion.
Tests: see [DO NOT MERGE] commits below for grammar_ext_compose,
grammar_ext_overlap, dummy_grammar_ext, lime_in_process_smoke,
parser_microbench and contrib/quel that exercise this API.
Five test modules exercising the runtime grammar-extension API:
* dummy_grammar_ext -- minimal smoke test: 1 token,
1 rule, 1 reduce callback.
Verifies end-to-end registry
-> rebuild -> dlopen -> parse
pipeline.
* grammar_ext_compose -- 6 small extensions composed in
8 different load-order
permutations. 22 sub-tests
covering token-name no-op
vs collision, cross-extension
references, precedence,
cache-key determinism, and
base-grammar invariance.
* grammar_ext_overlap -- 5 simulator extensions
(DuckDB-compat, MySQL-compat,
MongoDB-JSONB, pg_infer,
QUEL-lite) loaded
simultaneously. 42 sub-tests
covering one-rebuild-for-all,
13-keyword reachability,
mixed SQL+extension-DSL
sessions, order independence,
subset-load fallthrough.
* lime_in_process_smoke -- exercises the in-process
lime_compile_grammar_in_process
path (Track B Phase 2 Step 1).
* parser_microbench -- direct raw_parser() timing
benchmark: 1738 ns/parse for
SELECT 1, 5207 ns for realistic
OLTP, 5539 ns for DDL on a
debug build.
These modules together demonstrate the API works under realistic
multi-extension composition. They are NOT for upstream merge:
they belong in test/modules as research artifacts, not as part
of the core test surface.
…nsion
Demonstrates the runtime grammar-extension API by reviving the
Berkeley QUEL query language from the original POSTGRES (1986)
as a contrib module. All five Berkeley QUEL forms are
supported via the Lime extension API:
RANGE OF e IS emp -- tuple-variable binding
RETRIEVE (e.name, e.salary) -- SELECT
where e.dept = 'shoe'
RETRIEVE (e.name) BY e.salary DESC -- SELECT ... ORDER BY DESC
REPLACE emp (salary = 50000) -- UPDATE WHERE
where dept='shoe'
APPEND TO emp (name='alice', ...) -- INSERT
DELETE emp where salary < 1000 -- DELETE WHERE
Each form constructs a real PostgreSQL parse-tree node
(SelectStmt / UpdateStmt / InsertStmt / DeleteStmt) at parse
time, flowing through parse_analyze + planner + executor +
EXPLAIN unchanged. 9 SQL/QUEL equivalence assertions in
t/001_quel.pl prove the parser produces identical results
to the equivalent SQL.
Keyword shadowing constraint: extension keywords can't
override base SQL keywords, so QUEL uses a q_-prefix for
words that conflict (q_range, q_of, q_is, q_to, q_by,
q_replace, q_delete, q_into). Documented in the SGML
chapter (doc/src/sgml/quel.sgml) and in
parser_extension.h's pg_grammar_ext_keyword_hook block.
Why [DO NOT MERGE]:
* QUEL itself has no production users. This is a
demonstration of the runtime extension API at a non-trivial
scale (30 rules, 8 token types, 10 keyword tokens), not a
proposal to add Berkeley QUEL to PostgreSQL core.
* The SGML chapter is informative but exceeds what most contrib
modules ship; it includes historical context, a syntax
reference, 6 worked examples, and a Limitations section.
This commit ships QUEL as a research artifact alongside the
runtime extension API. Anyone interested in writing a similar
DSL extension can read contrib/quel as a worked-out example.
4100e10 to
5a292fb
Compare
While Flex/Bison have served us well, Lime (an evolution of SQLite's lemon parser generator) is faster than Flex/Bison and maintained and can enable runtime loading of additional grammars.