Make MAX_FILE_SIZE configurable via environment variable

First off — thank you for building CodeGraph! It's a fantastic tool that makes code comprehension so much more accessible. I've been using it on a few projects and ran into a scenario where the 1 MB file size threshold felt a bit too restrictive, so I wanted to share a suggestion for a small, low-risk improvement.

## The Scenario

I work with projects that include some legitimate source files between 1–5 MB. These aren't minified bundles or vendored blobs; they contain real, meaningful function definitions that would be valuable in the graph.

Currently, these files are silently skipped during indexing (`size_exceeded` warning, no nodes/edges), which means the function graph for the project is incomplete. I completely understand the original design rationale — the 1 MB threshold protects against WASM heap blowup from generated/minified files, and that's a wise default.

## The Suggestion

What if the 1 MB default stayed exactly where it is (it's a great safe default!), but users could **override it via an environment variable** when they know their project has legitimate large source files?

```typescript
const MAX_FILE_SIZE = (() => {
  const envVal = process.env.CODEGRAPH_MAX_FILE_SIZE;
  if (envVal !== undefined) {
    const parsed = parseInt(envVal, 10);
    if (parsed > 0 && !isNaN(parsed)) {
      return parsed;
    }
  }
  return 1024 * 1024;  // 1 MB — unchanged default
})();
```

This keeps the existing behavior 100% intact for everyone who doesn't set the variable, while giving users with large-but-legitimate source files an escape hatch.

## Complementary Measure: Memory Safety for Large Files

If a user overrides the threshold, files above 1 MB will now be parsed — but these files consume significantly more WASM linear memory (WebAssembly spec limitation: can grow but never shrink). Without additional safeguards, parsing many large files in sequence could accumulate heap memory until OOM.

A complementary change would make this safe: **for files exceeding 1 MB, aggressively reclaim WASM memory after each parse**.

```typescript
const LARGE_FILE_THRESHOLD = 1024 * 1024;  // 1 MB — same as original MAX_FILE_SIZE
```

### Bulk indexing path (worker thread)

Recycle the worker **before** parsing each large file, ensuring a fresh WASM heap:

```typescript
const isLargeFile = stats.size > LARGE_FILE_THRESHOLD;
if (isLargeFile) {
  recycleWorker();  // Destroy worker → fresh WASM heap
}
result = await requestParse(filePath, content);
```

### Single-file extraction path (in-process)

Reset the parser **after** extracting each large file:

```typescript
const result = extractFromSource(relativePath, content, language, frameworkNames);
if (stats.size > LARGE_FILE_THRESHOLD) {
  resetParser(language);  // Delete cached parser instance → reclaim WASM heap
}
```

Both `recycleWorker()` and `resetParser()` already exist in the codebase — no new functions needed, just calling them at the right time.

### Required import

```typescript
import { ..., resetParser } from './grammars';
```

## Memory Safety Summary

| Concern                             | Mitigation                                                          |
| ----------------------------------- | ------------------------------------------------------------------- |
| WASM heap growth from large files   | `recycleWorker()` / `resetParser()` after each >1MB parse       |
| Cascading OOM after WASM corruption | Worker crash → respawn (existing mechanism)                        |
| Parse timeout on large files        | Already scales:`PARSE_TIMEOUT_MS + content.length / 100KB × 10s` |
| Worker recycling cost               | Only for >1MB files; normal files unaffected                        |
| User overrides too aggressively     | Tunable via env var; 1 MB default remains the safe baseline         |

## Changes Summary

| File                                           | Change                                                                                                |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `src/extraction/index.ts`                    | Make `MAX_FILE_SIZE` configurable via `CODEGRAPH_MAX_FILE_SIZE` env var (default unchanged: 1 MB) |
| `src/extraction/index.ts`                    | Add `LARGE_FILE_THRESHOLD = 1MB` for memory-reclamation gating                                      |
| `src/extraction/index.ts`                    | Import `resetParser` from `./grammars`                                                            |
| `src/extraction/index.ts` (bulk path)        | Call `recycleWorker()` before parsing files > `LARGE_FILE_THRESHOLD`                              |
| `src/extraction/index.ts` (single-file path) | Call `resetParser(language)` after extracting files > `LARGE_FILE_THRESHOLD`                      |

File	Change
`src/extraction/index.ts`	Make `MAX_FILE_SIZE` configurable via `CODEGRAPH_MAX_FILE_SIZE` env var (default unchanged: 1 MB)
`src/extraction/index.ts`	Add `LARGE_FILE_THRESHOLD = 1MB` for memory-reclamation gating
`src/extraction/index.ts`	Import `resetParser` from `./grammars`
`src/extraction/index.ts` (bulk path)	Call `recycleWorker()` before parsing files > `LARGE_FILE_THRESHOLD`
`src/extraction/index.ts` (single-file path)	Call `resetParser(language)` after extracting files > `LARGE_FILE_THRESHOLD`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make MAX_FILE_SIZE configurable via environment variable #1016

The Scenario

The Suggestion

Complementary Measure: Memory Safety for Large Files

Bulk indexing path (worker thread)

Single-file extraction path (in-process)

Required import

Memory Safety Summary

Changes Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Concern	Mitigation
WASM heap growth from large files	`recycleWorker()` / `resetParser()` after each >1MB parse
Cascading OOM after WASM corruption	Worker crash → respawn (existing mechanism)
Parse timeout on large files	Already scales:`PARSE_TIMEOUT_MS + content.length / 100KB × 10s`
Worker recycling cost	Only for >1MB files; normal files unaffected
User overrides too aggressively	Tunable via env var; 1 MB default remains the safe baseline

Make MAX_FILE_SIZE configurable via environment variable #1016

Description

The Scenario

The Suggestion

Complementary Measure: Memory Safety for Large Files

Bulk indexing path (worker thread)

Single-file extraction path (in-process)

Required import

Memory Safety Summary

Changes Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions