feat: add NemotronH hybrid Mamba2-Transformer architecture adapter by mukund1985 · Pull Request #1434 · TransformerLensOrg/TransformerLens

mukund1985 · 2026-06-22T21:57:30Z

Implements TransformerBridge support for NemotronHForCausalLM (nvidia/Nemotron-H-8B-Base, Nemotron-H-47B-A13B).

Architecture

NemotronH is a hybrid with ~8% attention layers and ~92% Mamba-2 SSM/MoE/MLP layers. Each block has a single pre-norm (block.norm) and a single .mixer attribute whose type varies by layer (layers_block_type config field). There is no ln2 or model-level rotary module.

Adapter design

SSMBlockBridge is used as the block container — it delegates the full forward to the HF block without enforcing the ln2 that BlockBridge requires.

SSM2MixerBridge(name="mixer") wraps .mixer for all four layer types. Its forward is a pure passthrough (original_component(*args, **kwargs)), so it works correctly for attention, MLP, and MoE mixers as well as Mamba ones.

Mamba-specific submodules (in_proj, conv1d, inner_norm, out_proj) are declared optional=True so component_setup skips them gracefully on non-Mamba layers. Note: GatedRMSNormBridge.__init__ does not accept optional= (unlike the GeneralizedComponent base), so the attribute is set directly post-construction.

Stateful generation uses DynamicCache (transformers >= 5.12), which carries both KV-cache entries and SSM conv/recurrent states in one object.

applicable_phases = [] — verify_models is transformer-shaped; forward-pass correctness lives in integration tests.

Changes

transformer_lens/model_bridge/supported_architectures/nemotron_h.py — new adapter
transformer_lens/model_bridge/supported_architectures/__init__.py — export
transformer_lens/factories/architecture_adapter_factory.py — registration
transformer_lens/tools/model_registry/__init__.py — HF_SUPPORTED_ARCHITECTURES + CANONICAL_AUTHORS_BY_ARCH
tests/unit/model_bridge/supported_architectures/test_nemotron_h_adapter.py — 52 unit tests, all passing

Closes #1402

…RoPE) Closes TransformerLensOrg#1400. DeepSeek-V2, V2-Lite, and Coder-V2 all use DeepseekV2ForCausalLM. This adds a bridge adapter covering three V2-specific differences from V3: 1. Complex-exponential RoPE: V2's rotary embedding returns freqs_cis (a complex tensor via torch.polar) rather than a (cos, sin) tuple. - RotaryEmbeddingBridge.forward() now passes complex tensors through without raising, leaving them for the attention bridge to consume. - MLAAttentionBridge.forward() detects complex position_embeddings and dispatches to a new _apply_rotary_complex() helper that mirrors DeepSeek-V2's apply_rotary_emb (view_as_complex, multiply, flatten). 2. Optional Q LoRA path: V2-Lite sets q_lora_rank=None, skipping q_a_proj/q_a_layernorm/q_b_proj and using q_proj directly instead. All three Q-path submodules are marked optional=True in the adapter; q_a_layernorm uses GeneralizedComponent (which already supports optional) rather than RMSNormalizationBridge. MLAAttentionBridge already branches on q_lora_rank at runtime. 3. Gate not hookable: DeepseekV2Moe.forward() routes via nn.functional.linear(..., self.gate.weight) rather than self.gate(hidden_states), so the gate module's forward() is never called and bridge hooks cannot fire. The gate is omitted from MoEBridge submodules; shared_experts uses __call__ and hooks fine. Files changed: - supported_architectures/deepseek_v2.py (new) - supported_architectures/__init__.py: register adapter - factories/architecture_adapter_factory.py: map DeepseekV2ForCausalLM - generalized_components/mla_attention.py: complex RoPE support - generalized_components/rotary_embedding.py: complex tensor pass-through - tests/integration/model_bridge/test_deepseek_v2_adapter.py (new, 17 tests)

Implements TransformerBridge support for NemotronHForCausalLM (nvidia/Nemotron-H-8B-Base, Nemotron-H-47B-A13B). Architecture overview: - Heterogeneous layers defined by config.layers_block_type: each element is one of mamba, attention, moe, or mlp (~8% attention, ~92% SSM/MLP/MoE) - Single pre-norm (block.norm) and single residual path per block; no ln2 - Single .mixer attribute per block whose type varies by layer - No model-level rotary embedding module; attention handles RoPE internally - Stateful generation via DynamicCache (transformers >= 5.12) Key adapter decisions: - SSMBlockBridge as block container: delegates full forward to HF block, avoids ln2 enforcement that BlockBridge would apply incorrectly here - SSM2MixerBridge(name=mixer) as passthrough wrapper: works for all four mixer types since forward calls original_component(*args, **kwargs) - Mamba-specific submodules (in_proj, conv1d, inner_norm, out_proj) marked optional so component_setup skips them gracefully on non-Mamba layers - GatedRMSNormBridge.optional set post-init (its __init__ does not accept the kwarg, unlike the GeneralizedComponent base class) - positional_embedding_type=none: no model-level rotary to wire - gated_mlp=False: MLP layers use relu2, not SwiGLU - applicable_phases=[]: verify_models is transformer-shaped; integration tests cover forward-pass correctness instead Registration: - architecture_adapter_factory.py: NemotronHForCausalLM key added - supported_architectures/__init__.py: export added - tools/model_registry/__init__.py: HF_SUPPORTED_ARCHITECTURES and CANONICAL_AUTHORS_BY_ARCH entries added (canonical author: nvidia) Tests (52 unit tests, all passing): - Config attribute propagation (normalization_type, positional_embedding_type, gated_mlp, is_stateful, final_rms, mamba_intermediate_size, conv_dim, layers_block_type, applicable_phases, weight_processing_conversions) - Top-level component mapping bridge types and HF path names - Block submodule bridge types (norm, mixer; no ln2) - Mixer submodule types, names, and optional flags for all four Mamba keys - create_stateful_cache returns DynamicCache; independent per call - Factory registration and model registry constants - Guard tests: SSMBlockBridge not BlockBridge, no weight conversions, mamba_intermediate_size and conv_dim formulas Closes TransformerLensOrg#1402

mukund1985 · 2026-06-22T22:57:58Z

@jlarson4 PR is open, please review this.

jlarson4 · 2026-06-23T16:13:22Z

Solid adapter! Great reuse of the existing SSM bridge components, and the hybrid handling (single pre-norm, passthrough mixer for all four layer types, optional Mamba submodules driven by layers_block_type) is a solid solution. Setting applicable_phases=[] is the standard I set with Mamba/Mamba2, eventually we will update the verification system to verify SSM architectures, but that is a future consideration, not something you need to worry about here. I should probably open an issue for that.

The one thing I'd like to see before merging: a forward-vs-HF parity test. Right now all the tests are structural, there's no check that the bridge matches HF numerically. Please add a forward-pass logit match against HF on a NemotronH checkpoint, and ideally a short multi-token generation match to exercise the state handling, if possible. If you run into issues, document them in a PR comment and we can consider next steps.

mukund1985 · 2026-06-23T16:28:47Z

Added the forward-pass and generation parity tests in tests/integration/model_bridge/test_nemotron_h_adapter.py. The forward test asserts max diff == 0.0 against HF (same guarantee as Mamba-2, since SSM2MixerBridge is a pure passthrough). The generation test runs greedy decoding for 8 tokens and checks bit-for-bit equality with hf_model.generate(), exercising DynamicCache across all four layer types (mamba / attention / moe / mlp). Requires loading the 8B checkpoint so it is not wired into CI — run locally with pytest tests/integration/model_bridge/test_nemotron_h_adapter.py -v -s.

mukund1985 added 4 commits June 18, 2026 16:57

style: fix formatting and mypy errors

96fd927

fix: register DeepseekV2ForCausalLM in model registry

1f05ef7

mukund1985 mentioned this pull request Jun 22, 2026

[Proposal] Add Nemotron-H hybrid Mamba2-Transformer adapter #1402

Open

1 task

style: apply black + isort formatting fixes

1591b59

mukund1985 and others added 2 commits June 23, 2026 00:02

style: fix isort order for NemotronHArchitectureAdapter import

9f5f0a8

Merge remote-tracking branch 'origin/dev' into feat/nemotron-h-adapter

646a261

mukund1985 added 3 commits June 23, 2026 17:26

test: add NemotronH forward-pass and generation parity integration tests

0a750c7

test: add NemotronH forward-pass and generation parity integration tests

df7bdd3

test: add NemotronH forward-pass and generation parity integration tests

ffde25d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NemotronH hybrid Mamba2-Transformer architecture adapter#1434

feat: add NemotronH hybrid Mamba2-Transformer architecture adapter#1434
mukund1985 wants to merge 10 commits into
TransformerLensOrg:devfrom
mukund1985:feat/nemotron-h-adapter

mukund1985 commented Jun 22, 2026

Uh oh!

mukund1985 commented Jun 22, 2026 •

edited

Loading

Uh oh!

jlarson4 commented Jun 23, 2026

Uh oh!

mukund1985 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mukund1985 commented Jun 22, 2026

Architecture

Adapter design

Changes

Uh oh!

mukund1985 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlarson4 commented Jun 23, 2026

Uh oh!

mukund1985 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mukund1985 commented Jun 22, 2026 •

edited

Loading