This project demonstrates a virtual Data Engineering Department built using Microsoft AutoGen (AG2). It automates the end-to-end lifecycle of production data pipelinesโfrom initial architecture and deterministic quality governance to multi-cloud infrastructure deployment (IaC).
By implementing a Self-Healing Feedback Loop, the system ensures every script meets strict enterprise standards for performance and cost-optimization before it reaches production.
The squad utilizes a multi-layered approach to ensure code reliability and governance:
- The Data Architect (Agent): A Senior Engineer persona specialized in generating high-scale PySpark logic.
- The Local Quality Gate (Deterministic): A high-speed validation layer that enforces "Senior Standards" (Partitioning, explicit schemas, and date derivation) using Python regex to ensure 100% compliance.
- The Cloud Architect (Agent): A DevOps specialist that translates approved code into Terraform (AWS Glue) and YAML (Azure Databricks).
- The Admin (Orchestrator): Manages the workflow, handles agent hand-offs, and controls the self-correction logic.
Unlike standard code generators, this system audits its own work. If the Architect fails a quality check (e.g., missing partitioning on a 500GB+ dataset), the system triggers an automatic rewrite cycle. The specific failure logs are fed back to the agent until the code is 100% compliant.
Once the PySpark logic is validated, the system automatically generates:
- AWS Glue (Terraform): Optimized with G.2X workers and IAM trust policies.
- Azure Databricks (YAML): Configured for Standard_DS3_v2 clusters and automated CI/CD triggers.
Engineered for production efficiency under resource constraints:
- Sliding Context Window: Limits conversation history to prevent "Token Snowballing."
- Token Insurance: Hard limits on session tokens to ensure cost-effective operations within a daily budget.
- Python 3.10+
- An API Key for a supported LLM Gateway (e.g., Euron/OpenAI)
git clone [https://github.com/YOUR_USERNAME/Autonomous-Data-Engineering-Squad.git](https://github.com/YOUR_USERNAME/Autonomous-Data-Engineering-Squad.git)
cd Autonomous-Data-Engineering-Squad
python -m venv venv
# Windows: venv\Scripts\activate | Mac/Linux: source venv/bin/activate
pip install -r requirements.txt
Create a .env file in the root directory:
EURI_API_KEY=your_actual_key_here
python production.py
The system generates three primary artifacts in your project root:
| Artifact | File Name | Description |
|---|---|---|
| Audit Log | *_full_squad.txt |
A complete record of the "thinking," review, and correction process. |
| Production Code | approved_script.py |
The final, validated, and partition-aware PySpark script. |
| Infra Config | infra_config.txt |
Terraform and YAML deployment logic generated by the Cloud Architect. |
This project serves as a showcase of my ability to:
- Orchestrate Multi-Agent Systems to solve complex engineering bottlenecks.
- Enforce Data Governance (partitioning, schema-on-read, compute sizing) programmatically.
- Bridge AI Frameworks with deterministic logic to eliminate hallucinations in production code.
- Manage AI OpEx by optimizing token usage and context windows.