Skip to content

pranavranjan13/Autonomous-Data-Engineering-Squad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Autonomous Data Engineering Squad

Executive Summary

This project demonstrates a virtual Data Engineering Department built using Microsoft AutoGen (AG2). It automates the end-to-end lifecycle of production data pipelinesโ€”from initial architecture and deterministic quality governance to multi-cloud infrastructure deployment (IaC).

By implementing a Self-Healing Feedback Loop, the system ensures every script meets strict enterprise standards for performance and cost-optimization before it reaches production.


๐Ÿ—๏ธ The Agentic Architecture

The squad utilizes a multi-layered approach to ensure code reliability and governance:

  1. The Data Architect (Agent): A Senior Engineer persona specialized in generating high-scale PySpark logic.
  2. The Local Quality Gate (Deterministic): A high-speed validation layer that enforces "Senior Standards" (Partitioning, explicit schemas, and date derivation) using Python regex to ensure 100% compliance.
  3. The Cloud Architect (Agent): A DevOps specialist that translates approved code into Terraform (AWS Glue) and YAML (Azure Databricks).
  4. The Admin (Orchestrator): Manages the workflow, handles agent hand-offs, and controls the self-correction logic.

๐Ÿ› ๏ธ Key Technical Features

1. The Self-Healing Feedback Loop

Unlike standard code generators, this system audits its own work. If the Architect fails a quality check (e.g., missing partitioning on a 500GB+ dataset), the system triggers an automatic rewrite cycle. The specific failure logs are fed back to the agent until the code is 100% compliant.

2. Multi-Cloud Infrastructure as Code (IaC)

Once the PySpark logic is validated, the system automatically generates:

  • AWS Glue (Terraform): Optimized with G.2X workers and IAM trust policies.
  • Azure Databricks (YAML): Configured for Standard_DS3_v2 clusters and automated CI/CD triggers.

3. Resource & Cost Governance

Engineered for production efficiency under resource constraints:

  • Sliding Context Window: Limits conversation history to prevent "Token Snowballing."
  • Token Insurance: Hard limits on session tokens to ensure cost-effective operations within a daily budget.

๐Ÿš€ Getting Started

1. Prerequisites

  • Python 3.10+
  • An API Key for a supported LLM Gateway (e.g., Euron/OpenAI)

2. Installation

git clone [https://github.com/YOUR_USERNAME/Autonomous-Data-Engineering-Squad.git](https://github.com/YOUR_USERNAME/Autonomous-Data-Engineering-Squad.git)
cd Autonomous-Data-Engineering-Squad
python -m venv venv
# Windows: venv\Scripts\activate | Mac/Linux: source venv/bin/activate
pip install -r requirements.txt

3. Setup

Create a .env file in the root directory:

EURI_API_KEY=your_actual_key_here

4. Run the Squad

python production.py

๐Ÿ“Š Verification of Outputs

The system generates three primary artifacts in your project root:

Artifact File Name Description
Audit Log *_full_squad.txt A complete record of the "thinking," review, and correction process.
Production Code approved_script.py The final, validated, and partition-aware PySpark script.
Infra Config infra_config.txt Terraform and YAML deployment logic generated by the Cloud Architect.

๐Ÿง  Core Competencies Demonstrated

This project serves as a showcase of my ability to:

  • Orchestrate Multi-Agent Systems to solve complex engineering bottlenecks.
  • Enforce Data Governance (partitioning, schema-on-read, compute sizing) programmatically.
  • Bridge AI Frameworks with deterministic logic to eliminate hallucinations in production code.
  • Manage AI OpEx by optimizing token usage and context windows.

About

๐Ÿš€ Autonomous Multi-Agent Data Engineering Squad. Using Microsoft AutoGen (AG2) to architect, validate, and deploy production-grade PySpark pipelines with automated governance and Multi-Cloud IaC (Terraform/YAML).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages