Skip to content

feat(extraction): make MAX_FILE_SIZE configurable via CODEGRAPH_MAX_FILE_SIZE (#1016)#1030

Open
maxmilian wants to merge 1 commit into
colbymchenry:mainfrom
maxmilian:fix/1016-configurable-max-file-size
Open

feat(extraction): make MAX_FILE_SIZE configurable via CODEGRAPH_MAX_FILE_SIZE (#1016)#1030
maxmilian wants to merge 1 commit into
colbymchenry:mainfrom
maxmilian:fix/1016-configurable-max-file-size

Conversation

@maxmilian

Copy link
Copy Markdown
Contributor

Closes #1016.

What

The 1 MB file-size skip threshold in src/extraction/index.ts was a hardcoded const, so repos with legitimately large hand-written sources (1–5 MB) had no way to get them indexed.

This adds a resolveMaxFileSize() helper that honours a CODEGRAPH_MAX_FILE_SIZE environment variable (in bytes), mirroring the existing resolve*() env idioms in src/mcp/daemon.ts:

export function resolveMaxFileSize(): number {
  const raw = process.env.CODEGRAPH_MAX_FILE_SIZE;
  if (raw === undefined || raw === '') return DEFAULT_MAX_FILE_SIZE;
  const parsed = Number(raw);
  if (!Number.isFinite(parsed) || parsed <= 0) return DEFAULT_MAX_FILE_SIZE;
  return Math.floor(parsed);
}

Unset / non-numeric / non-positive values fall back to the 1 MB default, so existing behaviour is unchanged unless the var is explicitly set. Both existing size-check sites (single-file and bulk paths) are unchanged — they read the same module MAX_FILE_SIZE.

Why this is the whole change

The issue also suggested complementary WASM memory-safety (recycle the worker after large parses). That guard already existsrecycleWorker() is already invoked after large-file parses — so making the threshold configurable is safe on its own and needs no lifecycle changes.

Tests

  • New __tests__/resolve-max-file-size.test.ts — default fallback, valid override, fractional flooring, and invalid/zero/negative/non-numeric fallback (9 cases, all pass).
  • __tests__/extraction.test.ts — 377 pass, no regression.
  • tsc --noEmit — clean.

README updated with the override next to the existing 1 MB documentation.

…ILE_SIZE (colbymchenry#1016)

The 1 MB skip threshold was a hardcoded constant, so repos with legitimately
large hand-written sources (1-5 MB) had no way to index them. Add a
resolveMaxFileSize() helper that honours the CODEGRAPH_MAX_FILE_SIZE env var
(bytes), mirroring the existing resolve*() env idioms in mcp/daemon.ts, and
falling back to the 1 MB default for unset/non-numeric/non-positive values.
The existing recycleWorker() memory-safety after large parses already guards
the WASM heap, so no extra lifecycle work is needed.

Closes colbymchenry#1016
@maxmilian maxmilian marked this pull request as ready for review June 28, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make MAX_FILE_SIZE configurable via environment variable

1 participant