research: add autoconfig POC with QNN NPU catalog sweep by DingmaomaoBJTU · Pull Request #891 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-15T02:30:48Z

What this PR adds

research/autoconfig/ — an automated config search POC that sweeps opset versions (17–21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware.

Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)

npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property

Rigorously validated with fresh quantized.onnx builds, 3×500-iter sessions:

Model	opset17	opset21	Gain
facebook/dinov2-small	7.18 ms	4.98 ms	+30.6% ✅
facebook/dinov2-base	34.56 ms	26.23 ms	+24.1% ✅
facebook/dino-vitb16	19.92 ms	20.07 ms	-0.7% (NEUTRAL) ← critical control
microsoft/rad-dino	274.98 ms	275.36 ms	-0.1% (CPU-bound)

Key discriminant: dino-vitb16 is the same ViT-B size as dinov2-base, but gets zero benefit from opset21. The speedup is specific to the DINOv2 architecture — mechanism TBD (DINOv2-specific op patterns in opset21 ONNX export, not the original kMaxSupportedOpset bypass mechanism which doesn't apply to ORT 1.24.x).

npu-006: conv fusions cause catastrophic regression on Conv-dominant models only

Model	No fusions	With fusions	Regression
microsoft/resnet-18	~1–4 ms	~132–135 ms	+4900% 🔥
facebook/dinov2-base	34.56 ms	25.92 ms	-25% (FASTER)
facebook/dino-vitb16	19.92 ms	20.12 ms	+1% (neutral)

Hazard is proportional to Conv op density. Attention-dominant models are safe or slightly benefit.

npu-007: DVFS thermal noise requires session-level averaging

QNN NPU CV is always 0.1–2.0+. Use 3×500-iter sessions with 30s cool-down. Trust gains >10% only.

Included files

Core scripts

autoconfig.py — main search loop (ConvNext CPU baseline)
catalog_qnn_sweep.py — 8-model QNN NPU catalog sweep
analyze_graph.py — ONNX graph analysis helper
validation_sweep.py — focused npu-001/npu-006 validation sweep (NEW)
gen_report_v3.py, autoconfig_diagram.html

Knowledge base (`ep_knowledge/`)

qnn_npu.json — 7 findings (npu-001 through npu-007), continuously updated with validation data
cpu.json, dml.json, qnn_gpu.json

Benchmark results (`catalog-qnn-sweep/`)

SUMMARY.md — original 8-model sweep results
VALIDATION_SUMMARY.md — 3-model validation sweep with full per-session data and cross-model comparison table
Per-model results.json and results_v2.json for dinov2-base, rad-dino, dino-vitb16

Design docs (`docs/`)

agent-design.md — winml-cli agent layer design (Diagnostic / Decision / Cross-Device / Regression / Recommendation agents)
skills-design.md — WinML CLI Skills Design (11 skills, competitive analysis, feature gaps)
ep-knowledge-review.md — statistical audit of ep_knowledge findings

Feature gaps identified

FusedConv detection in analyze_graph.py — needed to gate npu-006 rule automatically
DVFS-aware perf protocol — current winml perf doesn't expose session-level averaging
Budget-aware sweep — skip expensive hypotheses when time budget exhausted
Mechanism investigation for npu-001 — graph dump comparing Transpose counts at opset17 vs opset21

Status: Research POC — not production code. Scripts run standalone; not integrated into the winml CLI yet.

Adds research/autoconfig/ — an automated config search POC that sweeps opset versions (17-21), execution providers, and graph optimizations to find the best winml-cli build config for a given model on Windows hardware. Key findings from 8-model QNN NPU catalog sweep: - npu-001: opset 21 bypass gives +25-31% on Conv+residual models (MobileViT, DINOv2) - npu-006: conv fusions (conv-bn/add/activation) cause 4900% regression on ResNet-18 QNN NPU - npu-007: DVFS thermal noise requires session-level averaging (3x500 iters) for reliable results Includes ep_knowledge/ KB with confirmed findings per EP, and catalog-qnn-sweep/ with per-model benchmark results and cross-model pattern analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/agent-design.md — strategic design for the agent layer of winml-cli, covering: - winml-cli vs Olive distinction (UX + Windows-first + explainability) - Why autoconfig search is a sub-tool, not the agent entry point - 5 agent types: Diagnostic, Decision Guidance, Cross-Device Confidence, Regression Detection, Model Recommendation - Autoconfig's role within the agent framework - Key concerns and open questions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds research/autoconfig/docs/skills-design.md — full design doc for the winml-cli skills/agent layer, including: - 11 skill designs (use-winml-cli, optimize-for-device, ep-compatibility-check, debug-accuracy-drop, and others) - Competitive analysis (Apple coremltools, ExecuTorch, AI Hub, NVIDIA ModelOpt, OpenVINO, Olive) - Top 5 feature gaps - Validation confidence levels (L1-L5) - Structured output requirements - QNN NPU catalog sweep findings (npu-001/006/007) - FusedConv unfuse feature request Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ping skills - Split skill catalog into two ranked categories by the 'does it touch code?' discriminator: User (config-only) and Contributor (code changes) - Merge overlapping skills (12 -> 9): - check-model-feasibility = find-a-model + ep-compatibility-check - ship-to-winapp = validate-before-ship + prepare-for-winapp - autoconfig absorbs optimize-for-device as its manual mode - Add self-contained HTML render of the design doc for easier reading

xieofxie · 2026-06-16T01:14:34Z

+
+    {
+      "id": "cpu-005",
+      "title": "Baseline (no extra flags) is the optimal config for ConvNext CPU",


xieofxie · 2026-06-16T01:15:04Z

+
+    {
+      "id": "cpu-001",
+      "title": "opset 19+ causes severe regression on CPU EP (3-4x slowdown)",


Critical issues found and corrected: npu-001 (opset 21 speedup): - mechanism_confirmed changed TRUE → FALSE The kMaxSupportedOpset bypass requires ORT < 1.18; the sweep used onnxruntime-windowsml 1.24.5 where kMaxSupportedOpset >= 22. The bypass mechanism does not apply. The speedup for DINOv2/MobileViT is empirically real but the WHY is now unknown. - ResNet-18 removed from 'benefits' list — sub-ms model, 3-session ranges span 4x for the same config (pure DVFS noise). Reported +20.2% was noise. - MobileViT magnitude corrected: h1 had DVFS spike inflating median to 11.72ms; actual gain is ~20-26% not 26.5%. - DINOv2 finding kept: 3-session data shows non-overlapping distributions. - Added per-session raw data analysis and required follow-up experiments. npu-002 / npu-003 (W8A16 speedup, compile speedup): - scope changed from 'General / all vision models' to 'ConvNext only' (both findings from 1 model; magnitude claims not transferable) - confidence reduced from 'high' to 'medium' npu-004 (W8A8 accuracy collapse): - confidence changed from 'medium' to 'very_low / anecdote' - Finding has NO recorded data (experiment 'aborted early, numbers not saved') Cannot be treated as a KB finding until re-run with recorded numbers. npu-005 (QNN Hub comparison): - Added fairness caveat: comparing qairt-stack model on ORT QNN EP is not a valid comparison. Finding is trivially true (use right tool for right stack) but not informative. npu-006 (conv fusions catastrophic): - No confidence change — this is the most statistically solid finding. - Added session-level evidence note: h4 CV=0.016 (extremely stable, unusual for QNN NPU), consistent with deterministic CPU fallback hypothesis. search_space_rules: - opset recommendation changed from 'Conv+residual' to 'Conv+attention hybrid' to reflect actual validated models (DINOv2 is attention-dominant, not Conv+residual in the traditional sense) New file: docs/ep-knowledge-review.md - Full statistical analysis of per-session data - ORT version dependency explained - Additional models needed for validation - Minimum experiment protocol Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eneral ViT Run validation_sweep.py across 3 new models to rigorously test npu-001 (opset21 speedup) and npu-006 (conv fusion regression) hypotheses. KEY FINDINGS: npu-001 (opset21 speedup): - facebook/dinov2-base: +24.1% (opset17 34.56ms -> opset21 26.23ms) 3-session full bench, fresh quantized.onnx builds, very stable - microsoft/rad-dino: -0.1% NEUTRAL -- model runs on CPU (~275ms), QNN NPU cannot accelerate ViT-L; opset irrelevant when CPU-bound - facebook/dino-vitb16: -0.7% NEUTRAL -- critical control proving the speedup is NOT a general ViT property; DINOv2-specific op patterns must explain the difference Combined with original catalog data: dinov2-small +30.6%, dinov2-base +24.1% (both confirmed) dino-vitb16 NEUTRAL (confirmed control) -> scope is DINOv2 family npu-006 (conv fusions): - dinov2-base: fusions -25% (faster) -- attention-dominant, benign - dino-vitb16: fusions +1% (neutral) -- no meaningful Conv ops to fuse Combined with original resnet-18 +4900% -> hazard is conv-density-gated Script fixes in validation_sweep.py: - bench_screen parsed d.get('p50_ms') instead of d['latency_ms']['p50'] - Reuse check accepted any .onnx (including truncated export.onnx) - Model selection preferred optimized.onnx over quantized.onnx Updated files: - ep_knowledge/qnn_npu.json: npu-001 scope narrowed to DINOv2-family, validated_models expanded with dino-vitb16 (negative control) and dinov2-base (positive), rad-dino (CPU-bound); npu-006 scope updated - catalog-qnn-sweep/VALIDATION_SUMMARY.md: full cross-model results table - catalog-qnn-sweep/{dinov2-base,rad-dino,dino-vitb16}/results_v2.json - catalog-qnn-sweep/.gitignore: exclude val_h*/ build artifact dirs

…nism invalidated, confidence calibrated Merge structural improvements from local review into KB (smart merge, preserving validation sweep data from 2026-06-16): npu-001: - Add mechanism_invalidation field (explicit statement of INVALIDATION with cause: ORT 1.24.5 kMaxSupportedOpset>=22, bypass does not apply) - Add critical_caveats array (4 caveats incl. DINOv2-specific scope note) - Downgrade confidence to 'medium-high on empirical / low on mechanism' (was 'high' which was overclaiming given unknown mechanism) npu-002/003: - Add follow_up_required fields (FP32 baselines on MobileViT/DINOv2/ResNet) npu-004: - Update action_for_autoconfig: 'Do NOT use to skip W8A8 without running eval first' (was 'Treat as potentially risky' which was still prescriptive without data) search_space_rules: - Rename recommended_order_conv_attention_hybrid -> recommended_order_conv_residual to match local review terminology NOTE: Validation sweep data (dinov2-base +24.1%, dino-vitb16 NEUTRAL, rad-dino CPU-bound) from 2026-06-16 is preserved — not overwritten.

…d NOT Transpose elimination Task 3 investigation: loaded dinov2-small opset17 (h0) and opset21 (h3) optimized.onnx and quantized.onnx from catalog_qnn_sweep builds; counted op types with onnx.load(). Key finding: Transpose count is IDENTICAL (49 nodes) in both opsets. - opset17 optimized: 391 total, 49 Transpose, 121 Reshape - opset21 optimized: 439 total, 49 Transpose, 169 Reshape (+48) - opset17 quantized: 1398 total, 49 Transpose, 615 DQ, 392 Q - opset21 quantized: 1542 total, 49 Transpose, 663 DQ, 440 Q (+48 QDQ pairs) Rules out: NHWC Transpose-elimination as speedup cause, fewer-ops as explanation. Consistent with: QNN EP scheduling/partitioning difference triggered by +48 Reshape nodes. Also: kMaxSupportedOpset confirmed >= 23 in ORT 1.24.4 (C:\\tmp env), reaffirming that the original bypass mechanism does NOT apply. Updated npu-001 critical_caveats, follow_up_required, and added transpose_analysis_2026_06_16 section with raw op counts.

…DINOv2-specific New benchmark results (2026-06-17, QNN NPU Snapdragon X Elite, 3x500-iter W8A16): BAAI/bge-small-en-v1.5 (BERT/sentence-similarity): h0=10.617ms [10.52, 10.32, 11.01] h3=9.840ms [10.25, 9.33, 9.94] opset21 gain +7.3% -- MARGINAL / INCONCLUSIVE (CV=0.3, ranges barely non-overlapping) Unusual vs all other NLP models (distilbert -0.1%, MiniLM -0.7%, roberta +0.1%) Needs 5+ sessions to differentiate from DVFS noise. rizvandwiki/gender-classification (plain ViT): h0=14.326ms [14.15, 14.94, 13.89] h3=13.830ms [13.70, 13.92, 13.87] opset21 gain +3.5% -- NEUTRAL (ranges overlap 13.89/13.92ms, CV=0.35) CRITICAL FINDING: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count profiles or general ViT architecture. Combined with Transpose analysis (Task 3): opset17 and opset21 DINOv2-small have identical Transpose node counts (49). The speedup mechanism is NOT Transpose elimination. The effect is specific to DINOv2 family at a level below op-count visibility -- possibly quantization behavior, tensor layout, or QNN EP partitioning. Also updated: models_tested list (+5 entries), validated_models sections, scope and confidence statements, task completion notes in follow_up_required.

…ndings, fix mechanism claims cpu.json: - cpu-001: mechanism_confirmed true->false. Data is real (opset 17 best) but the kMaxSupportedOpset gate hypothesis doesn't explain the non-monotonic pattern (opset22=85ms partial recovery while 19/20/21 all ~150-170ms). Two separate kMaxSupportedOpset constants exist (NHWC gate vs Transpose Optimizer gate); the CPU one is unverified. Added note on this distinction. - cpu-006: mechanism_confirmed true->false (derived from cpu-001). Meta-rule (EP isolation) remains valid. Added note that NPU/CPU experiments used different models (DINOv2 vs ConvNext) -- comparison is directional only. dml.json: - dml-001: INVALIDATED as 'DML is faster'. DML p50=16.9ms vs QNN GPU p50=17.7ms: diff = 0.8ms = 0.82 sigma of GPU measurement -- distributions OVERLAP. Retained: DML IS more stable (std 0.52 vs 0.97), that difference is real. - dml-002: HEADLINE CORRECTED. p50 with NHWC is marginally BETTER (16.5 vs 16.9ms), not worse. The actual finding is NHWC increases tail latency (p90 +19%) and variance (std 3.6x worse). Action unchanged (avoid NHWC) but for stability reasons, not p50. qnn_gpu.json: - gpu-003: Downgraded from medium to low confidence. Single experiment, 34% gap is above noise level but needs replication before citing as 'NEVER use compile'.

Key corrections: - Bench protocol: QNN NPU CV 0.10-1.2 is normal (DVFS); never reject on CV. Protocol is 3x500-iter always, not gated on CV. - Phase 4 conv fusions: add npu-006 hard gate — FusedConv not supported by QNN EP -> CPU fallback -> +4900% regression on Conv-dense models. Rule: skip all conv-*-fusion if Conv% of total ops > 20%. - Diagnosis table: add npu-006 catastrophic regression row. - Gate 2 lesson: DINOv2 opset21 +24-31% is real but mechanism UNKNOWN. Two hypotheses ruled out: kMaxSupportedOpset bypass (ORT>=23), Transpose elimination (count identical opset17/21). +48 Reshape nodes only diff found. ViT models with identical op counts see no benefit -- effect below topology. - DML vs QNN GPU: correct 'consistently faster' claim -- 0.8ms diff = 0.82sigma, distributions overlap. Real finding: DML is more stable (std 0.52 vs 0.97). - EP table: update QNN NPU to 'architecture-dependent', add conv-fusion caveat; DML note corrected; CPU note: mechanism uncertain (two kMaxSupportedOpset). - Actionable findings: replace 'mechanism CONFIRMED' with full invalidation log.

… loop Phase 0 — new analyze step sets 3 EP-specific flags before any experiment: conv_fusions_blocked: QNN NPU + Conv% > 20% -> skip all conv-*-fusion nhwc_blocked: QNN GPU / DML -> skip nhwc-transformer (dml-002) opset_sweep_blocked: CPU EP -> never sweep opset (cpu-001, fixed at 17) bench_protocol: 'npu' if QNN NPU -> always 3-session, no CV gate Phase 1 skip_set — 3 new hard blocks wired from Phase 0 flags: conv fusions blocked when npu-006 risk detected nhwc-transformer blocked for GPU/DML EPs opset sweep blocked for CPU EP Conv bottleneck queue respects conv_fusions_blocked flag Phase 2 loop: Hypothesis rule 2a: start with W8A16 (not W8A8); W8A8 is high-risk for LN/GELU W8A8 early exit: if top-1 <= 15% on first W8A8 attempt -> skip all W8A8 variants PERF step: full EP-aware bench protocol with 3-session NPU path, CV gate for CPU/GPU, s0 JIT exclusion rule, and non-overlapping range requirement for KEEP Post-convergence: mandatory compile for QNN NPU (+1.7x validated), explicit compile-skip guard for GPU/DML (compile regresses on Adreno X1-85). Hypothesis generation: opset sweep is now EP-qualified — CPU always blocked, GPU/DML not validated (skip), QNN NPU full sweep 17-21 with scope note.

…p script catalog_qnn_sweep.py: - Add NPU006_CONV_PCT_THRESHOLD constant (20%) -- npu-006 guard - Add _count_conv_pct(): after h0 builds, count Conv ops via onnx library to assess whether h4/h5 conv fusions are safe or will catastrophically regress - In hypothesis loop: after h0 succeeds, analyze model.onnx Conv%. If Conv% > 20%: print [npu-006] WARNING before running h4/h5. Annotate h4/h5 bench result with npu006_expected_regression=True/False. - results dict: add conv_pct, npu006_risk, npu006_regression, npu001_ranges_non_overlapping fields - _compute_summary: improve npu001_generalized with range-overlap check (max(h3_p50s) < min(h1_p50s)) alongside median test. DVFS-noisy NPU results where ranges overlap are reported as 'median_only' (marginal), not True -- prevents false positives like BGE-small (+7.3%, overlapping). - _compute_summary: add npu-006 catastrophic regression detector (h4/h5 median >= 5x baseline = CPU fallback confirmed) - write_summary: SUMMARY.md now includes Conv% column, npu-006 regression column, and range-overlap note in npu001 column. Bench protocol header updated to note DVFS expectation.

Bugs fixed (from code-review + rubber-duck analysis): 1. [CRITICAL] autoconfig.py hypothesis optim keys were kebab-case ('conv-bn-fusion') but build_config() in pipes/graph.py looks up cap.python_name (snake_case). All h1-h5 were silently benchmarking the baseline config. Fix: rename all optim keys to snake_case ('conv_bn_fusion', 'gelu_fusion', etc.) 2. [HIGH] autoconfig.py hypothesis accumulation: h2-h5 used {**cfg['optim'], ...} but each hypothesis starts from a fresh BASELINE copy where optim={}. Refactored to explicit isolated mode — each hypothesis is independent. Labels updated to remove misleading '+' prefix. Behavior now matches intent. 3. [HIGH] autoconfig.py baseline_p50 only set when i==0 AND bench passes. If iter 0 was KB-skipped, baseline_p50 stayed None forever and the perf gate never fired. Fix: set baseline_p50 on the first successful Phase B bench regardless of iteration index. 4. [HIGH] catalog_qnn_sweep.py MODEL_TIMEOUT_S=20*60 (20 min) caused all hypotheses after h0 to time out. A single hypothesis takes ~30 min minimum. Fix: raise to 180 min (3 hours for 6 hypotheses). 5. [MEDIUM] catalog_qnn_sweep.py _count_conv_pct() used a catch-all except that masked ImportError. When onnx is missing, conv_pct returns 0.0 which evaluates as 'no risk' — silently disabling the npu-006 guard. Fix: split ImportError (loud warning + treat as UNKNOWN/HIGH risk) from other exceptions (parse errors, silent fallback). Additional fixes: - validation_sweep.py npu-007 bug: bench_screen failure gated Phase B for QNN NPU. For QNN NPU, only non-NPU EPs should gate Phase B on screen fail. - autoconfig.py: replace 'Likely DVFS noise' CV message with EP-aware text - autoconfig.py: median_p50 local variable shadowed imported function — renamed to med_p50 to prevent confusion - autoconfig.py: remove duplicate code section left by earlier refactor - bench_utils.py: new shared module with run_cmd, bench_screen, bench_full, ScreenResult, count_conv_pct, ranges_non_overlapping, median_p50, etc. bench_full now accepts warmup/iters/cool_down_s overrides for CPU protocol

…ume (AgenticGPUOptimizer V2) Three improvements borrowed from AgenticGPUOptimizer V2 patterns: 1. ThroughputOnly verdict policy (bench_utils.py) - improvement must exceed max(1% floor, 2x screen-CV) - noise-level deltas (delta < stat_bar * CV) are DISCARD, not KEEP - marks marginal KKEPs (1x < delta < 1.5x threshold) as MARGINAL_KEEP 2. Screen phase early exit (autoconfig.py) - if screen improvement < 1%, skip 3x full-bench entirely - saves ~25-90 min per rejected hypothesis on first run - applied only when baseline_p50 is known (not first iter) 3. Crash-resume via SessionManager (bench_utils.py) - session.json written atomically after each experiment - on restart, completed iters are loaded and skipped - state includes baseline_p50, best_p50/label, consecutive_discards Also extracts _run_phase_b() helper to reduce main() nesting depth.

…summary.html autoconfig_diagram.html (v3): - Phase 2 Optimizer: screen early exit box (skip full bench when screen delta < 1%) - Phase 2 Reviewer: ThroughputOnly verdict policy with KEEP/MARGINAL_KEEP/DISCARD/EARLY pills - Phase 2: crash-resume session.json box (new teal row) - Phase 0: session.json load on startup (crash-resume) - Phase 1 skip_set: updated with empirical KB rules (npu-006 Conv% gate, cpu-002, gpu-004, etc.) - Side panel: session.json added alongside results.tsv and ep_knowledge/ - Footnote: v3 change summary + pending features with issue references agent-design.md: - New Section 2.1: improved loop V3 (what it does well) vs remaining agent gaps - Section 2.2: corrected framing (original was wrong; V3 fixes the computation layer; agent gaps are explanation/architecture-awareness/cross-device/KB self-update) - Date updated to 2026-06-17 docs/ep-findings-summary.html (new): - 17 findings across QNN NPU / CPU / DML / QNN GPU, only confirmed/valid - Color-coded by EP, confidence badges (HIGH/MEDIUM/LOW) - Per-finding: observation data, scope, autoconfig action - 7 feature requests table with issue IDs (#155, #158, #443, #867, #868) + 2 not-yet-filed gaps (FusedConv detect, DML analyze rules)

…dings-summary.html 11 findings (npu-002/003/004, cpu-001/002/005, dml-001/002/003, gpu-001/002/003/005) are hidden by default because they derive from only 1 model (convnext-tiny-224). 6 multi-model / universal findings remain visible: npu-001 (14 models), npu-006 (4 models), npu-007 (8 models), cpu-006 (meta EP-isolation rule), dml-004 (all DML models), gpu-004 (QNN SDK limitation). A toggle button lets readers expand hidden findings on demand. sm-divider rows summarize how many are hidden per EP section.

…s-summary.html The .finding { grid rule lost its selector, breaking the 4-column layout for every finding row. Restored selector and fixed grid-template-columns to explicit 28px 70px 1fr 220px (was auto, caused action column collapse).

… condensed footnote

…ness contract

Set lifecycle SVG to full-width rendering with fixed panel height so the diagram occupies the whole card area and avoids right-side whitespace.

… reliability gates A clean from-scratch 11-hypothesis rerun of apple/mobilevit-small (2026-06-22, fresh winml config+build, 3x500-iter) did NOT reproduce the historical +26.5%/ +42.1% opset21 "win". True baseline median is 5.51ms (vs the inflated ~12ms in the original data); opset21 (h3) = 5.355ms = +2.81% with fully overlapping session ranges. The earlier speedup was a DVFS/thermal baseline artifact. Conclusion updates (ep_knowledge/qnn_npu.json): - npu-001: move MobileViT from benefits_from_opset21 to no_benefit_neutral - update title/observation/scope/confidence/action_for_autoconfig and gates to drop the MobileViT-benefit claim; record the clean rerun data Safeguards to stop recording DVFS artifacts as findings: - catalog_qnn_sweep.py: add an effect-size gate (gain% >= 2x session-to-session CV AND non-overlapping session p50 ranges) emitting best_gain_verdict (RELIABLE / NEUTRAL_WITHIN_NOISE / UNRELIABLE_RANGES_OVERLAP); surface it in SUMMARY.md as a "Reliable?" column - ep_knowledge/README.md: add a promotion checklist (paired same-window deltas, clean-baseline gate, effect-size > noise floor, independent reruns + L1-L5 confidence, baseline-drift invalidation) before a finding may prune search Regenerates apple/mobilevit-small results.json, report.html, and SUMMARY.md.

…mary.html Reflect the clean 2026-06-22 rerun in the npu-001 card: MobileViT-small opset21 is now NEUTRAL (baseline 5.51ms -> opset21 5.355ms = +2.81%, ranges overlap; matmul_transpose 6.218ms = slower). The original +28.6%/+42.1% was a DVFS/thermal baseline artifact. Updates title, scope, autoconfig action (effect-size gate), and the Last updated banner.

…sampling, promote_findings) Aligns the autoconfig POC code with docs/self-evolution-design.html section 4. Implements the components still marked TODO; champion-config, arch pruning (build_insight) and the feature-gaps log were already done. bench_utils.py (Fix #1/#2/#5): - run_perf_session: atomic single-session perf primitive - paired_ab_bench / adaptive_paired_ab_bench: interleaved baseline-vs-hypothesis A/B so DVFS drift cancels in the within-pair ratio; adaptive variant samples until the 95% CI is decisive (KEEP/DISCARD band) or MAX_PAIRS, returns verdict + CI - thermal_classify: COOL/WARM/HOT_RUN from a cold reference latency - session_cv: shared between-session noise-floor helper promote_findings.py (Fix #4, NEW): reads catalog-*-sweep/*/results.json, applies the L1->L4 confidence ladder (L2 = effect-size gate, L3 = >=2 models/arch, L4 = >=3 arch classes), writes ep_knowledge/_auto_promoted.json as a draft sink that never clobbers curated KB. Tolerant of both QNN (full.p50s_ms) and GPU/CPU (full_p50s_ms) schemas. catalog_qnn_sweep.py (Fix #1 wiring): opt-in --paired-ab flag (default off) that runs adaptive paired A/B per hypothesis against the baseline ONNX and records verdict + CI; sequential Phase B remains the default path. README: document the self-evolution tooling and refresh the directory layout.

…n against bad references Agent-driven follow-up on the historical catalog data: picked hustvl/yolos-small (largest model at ~49ms, lowest measurement noise, and the high-value opset/fusion hypotheses were never measured — the original run timed out). Re-ran clean on QNN NPU. Finding: no config beats the winml auto-config baseline (49.6ms, opset17/W8A16). - opset 21 (h3): 48.6ms, paired-A/B MARGINAL -0.52% over 8 pairs -> within noise - transformer fusions (h6 matmul_transpose / h7 bias_softmax / h8 attention): flat to slightly worse on this 99.9%-transformer model (conv%=0.1%) The agent loop correctly REJECTED every candidate, including a tempting +47% false positive that a naive median comparison would have reported. Root-caused that false positive and hardened the methodology: - catalog_qnn_sweep.py: npu-001 confirmation no longer fires on a degenerate opset-17 reference. It now (1) returns N/A when the explicit-opset-17 reference is HIGH-CV (yolos h1 reproducibly ran 66ms at CV>0.25 vs a stable 49ms baseline), (2) requires opset21 to also beat the auto-config baseline (h0) by the effect-size gate, not just the explicit-opset-17 stress build, and (3) respects an unbiased paired-A/B verdict. - emit_champion_config: when no gain is RELIABLE, the champion falls back to the auto-config baseline (h0) instead of crowning a within-noise hypothesis. Records champion_verdict / reliable_improvement. - gen_model_report.py: report KPIs surface a 'within noise — ship baseline' caveat and show h0 as the champion config when the best gain is not reliable. Regenerated yolos + mobilevit champion configs (both now correctly ship h0 baseline) and restored the full cross-model SUMMARY table with yolos marked npu-001 N/A.

…Reviewer + add skill defs Refactor autoconfig.py's inlined Phase 2 loop into three named components that mirror autoconfig_diagram.html: - Explorer: hypothesis pool -> KB hard-blocks + Insight skip_set -> priority_queue - Optimizer: winml build -> Phase A screen -> Phase B full bench -> accuracy eval - Reviewer: ThroughputOnly verdict -> KEEP/MARGINAL/DISCARD -> KB draft main() is now a thin orchestrator wiring the three. Behavior is preserved (same constants, gates, early-exit, session.json, TSV, experiment docs). Also add skills/ -- the same architecture as composable SKILL.md defs: an autoconfig-orchestrator (the brain) delegating to autoconfig-explorer, autoconfig-optimizer, and autoconfig-reviewer sub-skills.

…restore debug-accuracy-drop skill ep-findings-summary.html: add a 'Show low-confidence findings' toggle (hidden by default) and remove the inaccurate 'winml perf: report p90/p99 + std' feature gap (winml perf already emits Avg/P50/P90/P95/P99/Min/Max/Std). skills-design.html: restore debug-accuracy-drop as rank-6 P2 in the user-skills table and remove the contradicting archived-skill appendix; update skill counts.

…drop autoconfig- prefix Moves each loop role's implementation script into its skill folder: - skills/orchestrator/autoconfig.py (Explorer/Optimizer/Reviewer + main loop) - skills/explorer/{analyze_insight,analyze_graph}.py - skills/optimizer/bench_utils.py - skills/reviewer/promote_findings.py Shared helpers -> lib/ (report_gen, gen_model_report); batch drivers -> tools/ (catalog_*_sweep, validation_sweep, gen_report_v3). Skill folders de-prefixed (orchestrator/explorer/optimizer/reviewer). Cross-imports now package-qualified via an ep_knowledge-anchored _AGENT_ROOT bootstrap; resource paths re-rooted. README directory layout + run commands and SKILL.md references updated.

…n driver Collapse catalog_{cpu,gpu,qnn}_sweep.py into a single tools/catalog_sweep.py that takes --ep/--device and reads its hypothesis matrix, model catalog, and bench protocol (screen/full iters, thermal handling, effect-size gate, paired A/B, accuracy eval, cross-checks) from the knowledge base. All per-EP behavior is preserved, config-gated in JSON. Rename ep_knowledge/<ep>.json -> ep_device_knowledge/<ep>_<device>.json so the KB can distinguish qnn_npu, qnn_gpu, dml_gpu and cpu_cpu. Each KB file now also carries sweep_config / hypotheses / models / cross_checks consumed by the driver. Update orchestrator/explorer/reviewer scripts, SKILL.md docs, KB README and the top-level README to the new single-script + renamed-KB layout.

DingmaomaoBJTU requested a review from a team as a code owner June 15, 2026 02:30

github-actions Bot and others added 2 commits June 15, 2026 10:32

github-advanced-security AI found potential problems Jun 15, 2026

View reviewed changes

Comment thread research/autoconfig/tools/gen_report_v3.py Fixed

xieofxie reviewed Jun 16, 2026

View reviewed changes

github-actions Bot and others added 2 commits June 16, 2026 14:33

github-advanced-security AI found potential problems Jun 16, 2026

View reviewed changes

Comment thread research/autoconfig/tools/validation_sweep.py Fixed

Comment thread research/autoconfig/tools/validation_sweep.py Fixed

github-actions Bot added 8 commits June 16, 2026 19:12

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread research/autoconfig/autoconfig.py Fixed

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread research/autoconfig/autoconfig.py Fixed

Comment thread research/autoconfig/skills/orchestrator/autoconfig.py Fixed

Comment thread research/autoconfig/skills/optimizer/bench_utils.py Fixed

github-actions Bot added 8 commits June 17, 2026 16:29

research(autoconfig): rename --report to --json, drop dml-004 FR row

2c38721

research(autoconfig): trim autoconfig_diagram.html — shorter bullets,…

5c5684a

… condensed footnote

research(autoconfig): add Pending Features badge to Phase 3 in diagram

e18e066

research(autoconfig): add local PyTorch reference FR; clarify correct…

526fcd0

…ness contract

research(autoconfig): fix Phase 0 layout — nowrap, 3 equal-width boxes

9d5148e

github-actions Bot added 30 commits June 20, 2026 17:43

research(skill-plan): make lifecycle SVG fill the frame

35dba55

Set lifecycle SVG to full-width rendering with fixed panel height so the diagram occupies the whole card area and avoids right-side whitespace.

docs: refine skills design with debug-model and appendix

b064016

docs: reorder user and contributor skill rankings

ac44feb

docs: remove fusedconv feature-request section

3ab63e5

docs: add professional skill evaluation section

3dc5b4c

docs: move and simplify skill evaluation section

b736e7e

docs: move pruning and queueing into Explorer stage

d9edd2e

docs: align phase-2 agent relationships with ppt flow

73b4bb5

docs: make phase-2 loop relationship explicit in autoconfig diagram

3f82473

docs: make lead role span full autoconfig lifecycle

bdd8e29

docs: switch lifecycle lead bar to vertical orchestrator line

c11fd26

docs: add concise agent design html and widen orchestrator lifeline

e3647d2

docs: restructure agent design flow to skills-style narrative

75d963e

docs: regenerate agent design as skills-style document

9a9f8bf

docs: enrich agent design and align I/O with diagram

1497501

docs: expand orchestration details and add customer evidence

f1d43c3

docs: sharpen winml-cli vs olive positioning

06e1ff0

docs: revise problem focus and add key user scenarios

5683ca7

docs: add execution loop and scenario-driven outcomes

600a7ab

docs: refocus solution on loop and add role/policy sections

d049a9c

docs: reorganize input section to match user-input strip

487ca02

docs: remove input details table

19bb6e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: add autoconfig POC with QNN NPU catalog sweep#891

research: add autoconfig POC with QNN NPU catalog sweep#891
DingmaomaoBJTU wants to merge 89 commits into
mainfrom
dingmaomaobjtu/research-autoconfig-poc

DingmaomaoBJTU commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

xieofxie Jun 16, 2026

Uh oh!

xieofxie Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DingmaomaoBJTU commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Key findings from QNN NPU catalog sweep (8 models, Snapdragon X Elite)

npu-001: opset21 gives +24–31% on DINOv2 family — NOT a general ViT property

npu-006: conv fusions cause catastrophic regression on Conv-dominant models only

npu-007: DVFS thermal noise requires session-level averaging

Included files

Core scripts

Knowledge base (ep_knowledge/)

Benchmark results (catalog-qnn-sweep/)

Design docs (docs/)

Feature gaps identified

Uh oh!

Uh oh!

xieofxie Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

xieofxie Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DingmaomaoBJTU commented Jun 15, 2026 •

edited

Loading

Knowledge base (`ep_knowledge/`)

Benchmark results (`catalog-qnn-sweep/`)

Design docs (`docs/`)