Skip to content

native tgrep indexer OOM-kills the host on large monorepos (no upper bound / memory cap) #3976

Description

@reillysiemens

Describe the bug

The built-in grep/search tool, when the copilot_cli_tgrep experiment is assigned, replaces ripgrep with the native Rust tgrep trigram indexer. At session startup the CLI spawns a persistent daemon:

tgrep serve . --index-path ~/.cache/copilot/tgrep-index/<sha(cwd)> --exclude .git

over the entire working-directory tree. The startup orchestrator gates this on a lower file-count threshold only — there is no upper bound and no memory cap. The trigram build holds the whole index plus intermediate structures in RAM (the on-disk index is 1.9 GB; live anon RSS reaches ~45 GB, ≈24×). On a large monorepo this exhausts host memory and the Linux kernel OOM-killer reaps the process.

Observed in an internal monorepo (370,925 tracked text files) in WSL2 (46 GiB RAM):

  • tgrep was OOM-killed twice at anon-rss ~46–47 GB (total-vm ~60 GB).
  • A third build was caught live: its resident memory rose without interruption (0.3 GB → 4.7 GB over 20 min, every sample ≥ the previous) while it pegged ~95% of one CPU core — the same trajectory as the two kills.

Because the indexer auto-starts during session startup (it runs tgrep count-files, sees the repo is over threshold, and spawns serve), merely opening or resuming a session in the repo root triggers it — independent of any prompt or tool call. When it OOMs it can take swap and unrelated processes down with it, effectively wedging the WSL VM.

This is not the same as the existing Node.js JavaScript heap out of memory reports (e.g. #841, #1457, #1386, #2132). This is a native child process killed by the kernel OOM-killer, with no JS heap message.

Note: the feature was not opted into. USE_TGREP is unset; it is enabled via a server-side experiment assignment (copilot_cli_tgrep) found in ~/.cache/copilot/exp-cache.json.

Affected version

GitHub Copilot CLI 1.0.66-2

Steps to reproduce the behavior

  1. Be assigned the copilot_cli_tgrep experiment (or set USE_TGREP=true).
  2. cd into a git repository with ≥ 50,000 tracked text files (≥ 10,000 on Windows). Example = 370,925 files (verified: tgrep count-files .370925 text files (12504 binary skipped, 0 errors)).
  3. Start or resume a session there: copilot --resume --add-dir <repo-root>.
  4. The CLI logs Starting tgrep serve (index: …, cwd: <repo-root>) and spawns the daemon. From then on its resident memory only ever increases — every sampled RSS is ≥ the previous, with no plateau or drop — while it keeps ~one CPU core busy: an active, unbounded in-memory build.
  5. On a host without ~50+ GB free, the kernel OOM-killer kills tgrep; the CLI keeps polling tgrep status (~1/sec) and can respawn into a fresh cold rebuild.

Observed daemon command line:

tgrep serve . --index-path /home/<user>/.cache/copilot/tgrep-index/ba63a73da095b6da --exclude .git

Kernel OOM-killer evidence (two separate kills):

Jun 29 21:40:52 kernel: Out of memory: Killed process 2817274 (tgrep) total-vm:59968024kB, anon-rss:46125720kB, file-rss:1336kB, shmem-rss:0kB, UID:1000 pgtables:113820kB
Jun 30 00:46:09 kernel: Out of memory: Killed process 3026540 (tgrep) total-vm:60820144kB, anon-rss:47212960kB, file-rss:1436kB, shmem-rss:0kB, UID:1000 pgtables:115496kB
Jun 30 00:46:09 kernel: oom-kill:constraint=CONSTRAINT_NONE,...,task=tgrep,pid=3026540,uid=1000

Live third occurrence (caught mid-build, idle session, no active prompt):

tgrep pid=3114713 cpu=94.8% rss=4.70GB elapsed=20:19   # still climbing; on-disk index already 1.9 GB

Expected behavior

The indexer must never be able to exhaust host memory or OOM unrelated processes. Specifically:

  • Upper bound / ceiling. Add a maximum file-count (and/or total-bytes) above which the tool falls back to ripgrep instead of indexing. Today the only gate is the lower threshold (fileCount < lY → skip); there is no "too large to index safely" gate, so the biggest repos — exactly where the feature is meant to help — are where it fails hardest.
  • Memory budget for the build. Cap/stream the index build (chunked, bounded working set) so peak RSS is predictable and far below host RAM. tgrep currently exposes no --max-memory flag.
  • Graceful degradation + detection. If the indexer is killed, detect it (cf. copilot-cli doesn't sense that the child process that it has started is killed, doesn't tell this to the user #277) and fall back to ripgrep rather than silently respawning into another OOM loop.
  • Safer rollout. Auto-enabling an unbounded native indexer via experiment on large repos should not be able to take down a developer's machine.

A 50k-file repo and a 370k-file repo currently take the same unbounded code path; only the latter OOMs.

Additional context

Root cause (from the bundled app.js, pkg 1.0.66-2):

  • Feature flag: TGREP = "copilot_cli_tgrep""When true, enables tgrep indexed search for large repositories" (sdk/index.d.ts:7310). Default availability off; here enabled via experiment assignment.
  • Threshold: lY = process.platform === "win32" ? 1e4 : 5e4 (10,000 / 50,000 files).
  • Gate: after tgrep count-files, else if (u < lY) return { outcome: "skipped_below_threshold" } — then it immediately sets the root and spawns tgrep serve. No upper-bound check exists between the threshold test and the spawn.
  • Off-switch (verified): process.env.USE_TGREP === "false""tgrep disabled via USE_TGREP=false" → ripgrep fallback. (USE_BUILTIN_RIPGREP=false also forces fallback.)
  • On-disk index for 1JS: 1.9 GB; peak live anon RSS at kill: ~45 GB (~24×).

Workarounds:

  • export USE_TGREP=false (forces ripgrep; applies to new sessions).
  • Launch from a subdirectory under the 50k-file threshold.
  • Cap the WSL VM ([wsl2] memory=… in .wslconfig) to contain the blast radius.

Environment:

  • OS: Linux (WSL2), kernel 6.6.87.2-microsoft-standard-WSL2
  • CPU architecture: x86_64 (32 logical CPUs)
  • RAM: 46 GiB total, 12 GiB swap
  • Terminal: tmux (TERM=tmux-256color)
  • Shell: zsh (/usr/bin/zsh)
  • Repo under test: large monorepo, 370,925 tracked text files

Possibly related: #277 (CLI doesn't detect that a child process it started was killed) — appears as a secondary symptom of this failure (the kernel kills tgrep; the CLI keeps polling status).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions