Skip to content

Make MAX_FILE_SIZE configurable via environment variable #1016

Description

@shadow55

First off — thank you for building CodeGraph! It's a fantastic tool that makes code comprehension so much more accessible. I've been using it on a few projects and ran into a scenario where the 1 MB file size threshold felt a bit too restrictive, so I wanted to share a suggestion for a small, low-risk improvement.

The Scenario

I work with projects that include some legitimate source files between 1–5 MB. These aren't minified bundles or vendored blobs; they contain real, meaningful function definitions that would be valuable in the graph.

Currently, these files are silently skipped during indexing (size_exceeded warning, no nodes/edges), which means the function graph for the project is incomplete. I completely understand the original design rationale — the 1 MB threshold protects against WASM heap blowup from generated/minified files, and that's a wise default.

The Suggestion

What if the 1 MB default stayed exactly where it is (it's a great safe default!), but users could override it via an environment variable when they know their project has legitimate large source files?

const MAX_FILE_SIZE = (() => {
  const envVal = process.env.CODEGRAPH_MAX_FILE_SIZE;
  if (envVal !== undefined) {
    const parsed = parseInt(envVal, 10);
    if (parsed > 0 && !isNaN(parsed)) {
      return parsed;
    }
  }
  return 1024 * 1024;  // 1 MB — unchanged default
})();

This keeps the existing behavior 100% intact for everyone who doesn't set the variable, while giving users with large-but-legitimate source files an escape hatch.

Complementary Measure: Memory Safety for Large Files

If a user overrides the threshold, files above 1 MB will now be parsed — but these files consume significantly more WASM linear memory (WebAssembly spec limitation: can grow but never shrink). Without additional safeguards, parsing many large files in sequence could accumulate heap memory until OOM.

A complementary change would make this safe: for files exceeding 1 MB, aggressively reclaim WASM memory after each parse.

const LARGE_FILE_THRESHOLD = 1024 * 1024;  // 1 MB — same as original MAX_FILE_SIZE

Bulk indexing path (worker thread)

Recycle the worker before parsing each large file, ensuring a fresh WASM heap:

const isLargeFile = stats.size > LARGE_FILE_THRESHOLD;
if (isLargeFile) {
  recycleWorker();  // Destroy worker → fresh WASM heap
}
result = await requestParse(filePath, content);

Single-file extraction path (in-process)

Reset the parser after extracting each large file:

const result = extractFromSource(relativePath, content, language, frameworkNames);
if (stats.size > LARGE_FILE_THRESHOLD) {
  resetParser(language);  // Delete cached parser instance → reclaim WASM heap
}

Both recycleWorker() and resetParser() already exist in the codebase — no new functions needed, just calling them at the right time.

Required import

import { ..., resetParser } from './grammars';

Memory Safety Summary

Concern Mitigation
WASM heap growth from large files recycleWorker() / resetParser() after each >1MB parse
Cascading OOM after WASM corruption Worker crash → respawn (existing mechanism)
Parse timeout on large files Already scales:PARSE_TIMEOUT_MS + content.length / 100KB × 10s
Worker recycling cost Only for >1MB files; normal files unaffected
User overrides too aggressively Tunable via env var; 1 MB default remains the safe baseline

Changes Summary

File Change
src/extraction/index.ts Make MAX_FILE_SIZE configurable via CODEGRAPH_MAX_FILE_SIZE env var (default unchanged: 1 MB)
src/extraction/index.ts Add LARGE_FILE_THRESHOLD = 1MB for memory-reclamation gating
src/extraction/index.ts Import resetParser from ./grammars
src/extraction/index.ts (bulk path) Call recycleWorker() before parsing files > LARGE_FILE_THRESHOLD
src/extraction/index.ts (single-file path) Call resetParser(language) after extracting files > LARGE_FILE_THRESHOLD

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions