Skip to content
View atandra2000's full-sized avatar
💭
Learning has no ending
💭
Learning has no ending

Block or report atandra2000

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
atandra2000/README.md

Atandra Bharati

Deep Learning Research Engineer rebuilding frontier AI architectures from scratch — LLMs, latent diffusion, multimodal, video understanding. PyTorch-first, single-GPU heroics, paper-faithful reproductions.


🧭 Open to roles

Applied ML · ML Research · Research Engineer · GenAI Engineering. Remote-friendly; available worldwide.

🛠️ Core stack

Languages & core ML   Python PyTorch CUDA

Architectures   Transformers · GQA · MLA · RoPE · SwiGLU · RMSNorm · MoE · Gated Delta Net · MTP · Diffusion UNet · VAE · GAN · CycleGAN · ST-GCN

Optimization & numerics   BF16 · Flash Attention 2 · torch.compile · Gradient checkpointing · μP scaling · WSD LR · Chunked cross-entropy · Disk-backed token caching

Hardware validated   A100 80GB · RTX 5090 · RTX 6000 Ada · RTX 3090 · P100 · T4 (2×)

🔭 Now

Shipping the Autonomous ML Research Engineer platform (15 phases, 23 agents) and exploring a paper on mixture-of-depths routing for sub-1B parameter LLMs.


Highlights

  • 78% peak memory reduction (92 GB → 20 GB) for LLM pretraining via gradient checkpointing, chunked cross-entropy, and disk-backed token caching — enabling 2× batch-size headroom on a single A100 80GB.
  • Training loss 0.0947 at epoch 16 on Stable Diffusion 1.x (860M UNet) trained from scratch across a 7-phase curriculum on 2× RTX 5090.
  • 878 passing tests, 15 cooperating phases, 23 agents, 61 tools, 186 models in the Autonomous ML Research Engineer platform — a full research loop from paper to conclusions, with self-repair and provider-agnostic LLM routing.
  • 12 end-to-end projects spanning LLMs, generative vision, multimodal AI, and video — every project engineered for single-GPU feasibility.

Projects

Category Project Highlight Stack / hardware Repo
Architecture DeepSeek-v3-Lite (422M) MLA + aux-loss-free MoE + MTP, end-to-end with inference absorption PyTorch · μP · 8.4B-token Chinchilla recipe
Architecture LLaMA-3-Lite (515M) GQA · RoPE · fused SwiGLU · RMSNorm · Flash-Attn 2 · chunked CE PyTorch · BF16 · A100 80GB
Architecture FusionLLM (415.6M active / 868.6M stored) MLA + Gated Delta Net + MoE + MTP in a 24-layer hybrid PyTorch · NorMuon + CautiousAdamW · WSD + μP
Generative vision Stable Diffusion 1.x (860M UNet) Best loss 0.0947 at epoch 16; 42-epoch run PyTorch · BF16 · 2× RTX 5090
Generative vision FaceAgingCycleGAN (AdaIN-conditioned) 31 epochs on IMDB-WIKI; per-layer age conditioning, 3-scale PatchGAN PyTorch · RTX 6000 Ada
Generative vision FaceGenerationVAE (β-VAE) 50 epochs on CelebA; recon MSE 0.0152, KL annealing 0→1 PyTorch · bilinear-upsample decoder
Generative vision DCGAN-Face-Generation 50 epochs on 202k CelebA; D loss → ln 2 ≈ 0.693 equilibrium PyTorch · 2× T4
Multimodal VisionLangModel (PaliGemma-style) Trained end-to-end on COCO 2014 captions; zero pre-trained weights PyTorch · P100
NLP TranslationLM (EN→IT seq2seq) 20 epochs on OPUS Books; cross-attention visualizations, custom SentencePiece BPE PyTorch · T4
Foundations GPT-From-Scratch 200-line educational GPT-2, trained on Tiny Shakespeare PyTorch
Agentic / research infra Autonomous ML Research Engineer 15-phase multi-agent platform: paper → plan → patch → train → evaluate → iterate → report PyTorch · Ollama Cloud · multi-agent · 878 tests
In progress ActionRecognition (ST-GCN) Pose + ST-GCN pipeline ready; NTU RGB+D 120 benchmark pending PyTorch

Writing

  • "Multi-Head Latent Attention — A Technical Deep-Dive" — 643-line reference covering KV-cache math, low-rank compression algebra, the absorption-trick derivation, decoupled RoPE mechanics, and SDPA vs manual attention path trade-offs in DeepSeek-V2/V3. (read)

Connect

Portfolio LinkedIn Weights & Biases Kaggle Comet

Pinned Loading

  1. StableDiffusion StableDiffusion Public

    A Stable Diffusion 1.x-class latent diffusion model trained from scratch on 2× RTX 5090 (Blackwell) GPUs. Full UNet (~860M params), DDPM/DDIM, LAION pipeline, DDP+BF16.

    Python

  2. DeepSeek-v3-Lite DeepSeek-v3-Lite Public

    Faithful from-scratch reimplementation of DeepSeek-V3 (MLA + MoE + MTP), scaled for Chinchilla-optimal 422M training on a single A100 80GB

    Python 1