Documentation | Installation | Quick Start | Go Library
Your AI assistant can run SQL. But it doesn't know that cust_id contains PII, that the table was deprecated last month, or who to ask when something breaks.
mcp-data-platform fixes that. It is a single MCP server that connects AI assistants to your data infrastructure and enriches every response with business context from your semantic layer: query a table and get its meaning, owners, quality scores, and deprecation warnings in the same call.
It is a platform, not just a bridge. The same endpoint gives agents persistent memory and a governed path to write knowledge back to the catalog, proxies third-party MCP servers and REST APIs through one authentication, persona, and audit pipeline, and ships a web portal where AI-generated artifacts are saved, organized into collections, and shared with teammates.
The only required backend is DataHub as the semantic layer. Add Trino for SQL and S3 for object storage when you're ready. Learn why this stack.
AI assistants are powerful at querying data, but they work blind. When an agent asks "What's in the orders table?", it gets column names and types. It doesn't know that customer_id is PII, that the table is deprecated in favor of orders_v2, that the quality score dropped last week, or who to contact when something looks wrong.
# Without mcp-data-platform
─────────────────────────────────────────────────────────────────────
User: "Describe the orders table"
AI: Queries Trino → gets columns and types
User: "Who owns this data?"
AI: Queries DataHub → finds owners
User: "Is this table still active?"
AI: Queries DataHub again → finds deprecation status
User: "What does customer_id actually mean?"
AI: Queries DataHub again → finds column descriptions
─────────────────────────────────────────────────────────────────────
4 round trips. Context scattered across conversations. Easy to miss warnings.
# With mcp-data-platform
─────────────────────────────────────────────────────────────────────
User: "Describe the orders table"
AI: Gets everything in one response:
→ Schema: columns and types
→ ⚠️ DEPRECATED: Use orders_v2 instead
→ Owners: Data Platform Team
→ Tags: pii, financial
→ Quality Score: 87%
→ Column meanings and business definitions
─────────────────────────────────────────────────────────────────────
1 call. Complete context. Warnings front and center.
sequenceDiagram
participant AI as AI Assistant
participant P as mcp-data-platform
participant T as Trino
participant D as DataHub
AI->>P: trino_describe_table "orders"
P->>T: DESCRIBE orders
T-->>P: columns, types
P->>D: Get semantic context
D-->>P: description, owners, tags, quality, deprecation
P-->>AI: Schema + Full Business Context
The platform intercepts tool responses at the protocol level and enriches them with context from the other services. This cross-enrichment is bidirectional:
- Trino → DataHub: query results include owners, tags, glossary terms, deprecation warnings, quality scores
- DataHub → Trino: search results include query availability and sample SQL
- S3 ↔ DataHub: object listings include matching dataset metadata, and dataset searches show storage availability
Each feature links to its full documentation.
| Feature | Description |
|---|---|
| Cross-enrichment | Business context added to every tool response automatically, with session dedup to save tokens |
| Lineage inheritance | Column descriptions inherited from upstream datasets via DataHub lineage |
| Universal search | One search tool fans a query across the catalog, knowledge pages, memory, insights, assets, prompts, and APIs; fetch dereferences any result |
| Workflow gating | Session-aware guidance that steers agents to discovery before SQL, with escalating warnings |
| Tools | Full tool reference for Trino, DataHub, S3, knowledge, memory, portal, and gateway toolkits |
| Feature | Description |
|---|---|
| Memory layer | Persistent agent memory across sessions, PostgreSQL + pgvector, hybrid semantic/lexical recall |
| Knowledge capture | Agents record domain insights during sessions; approved knowledge is written back to DataHub or canonical knowledge pages |
| Governance workflow | Human-in-the-loop review, approve/reject, changeset tracking, and rollback for every applied change |
| Managed resources | Human-uploaded reference files (playbooks, samples, templates) served to agents as MCP resources |
| Feature | Description |
|---|---|
| MCP gateway | Re-expose any third-party MCP server through the platform's auth, persona, and audit pipeline |
| API gateway | Proxy REST/HTTP APIs (Salesforce, Google, GitHub, Stripe) with four tools instead of one tool per endpoint |
| API catalogs | Versioned OpenAPI bundles shared across connections, with semantic endpoint ranking |
| REST invoke shim | Call gateway endpoints from NiFi, Airflow, or curl under the same auth and audit pipeline |
| Self-configuration | Admins manage personas, connections, and prompts by asking the agent instead of clicking |
| MCP Apps | Interactive UI panels rendered inline in the MCP host |
| Go library | Import the platform as a library: custom toolkits, providers, and middleware |
| Feature | Description |
|---|---|
| Authentication | Fail-closed model: OIDC (Keycloak, Auth0, Okta, Azure AD) and API keys for service accounts |
| OAuth 2.1 server | Built-in authorization server with PKCE and Dynamic Client Registration; Claude signs in through your IdP |
| Outbound OAuth | OAuth to upstream MCPs and APIs with encrypted refresh tokens that survive restarts |
| Personas | Role-mapped allow/deny tool and connection filtering, default-deny |
| Audit logging | Every tool call logged to PostgreSQL with identity, persona, sanitized parameters, and timing |
| Observability | Prometheus metrics and optional OpenTelemetry distributed tracing |
| Session externalization | PostgreSQL-backed sessions for zero-downtime restarts, horizontal scaling, and live tool-inventory updates |
| Multi-provider | Multiple instances of each service behind one endpoint, with isolated failure domains |
| Operating modes | Standalone (no database) or file + database with hot-reloaded config overrides |
A built-in web portal serves both operators and end users. Enable with portal.enabled: true.
For operators: dashboards with activity timelines and performance percentiles, a searchable audit log, an interactive tool explorer with per-persona visibility and inline test runs, knowledge insight governance, connection and persona management, API keys, and indexing health. See the Admin Portal guide.
For users: AI-generated artifacts (reports, charts, documents) are saved from any session with the save_artifact tool, organized into shareable collections, and shared with teammates or through public links. A prompt library, feedback threads on any artifact, and personal knowledge and activity views round out the User Portal.
Install (see all methods: Homebrew, Docker, source):
go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latestCreate a minimal configuration. DataHub is the only required backend; ${VAR} references are expanded from the environment:
# platform.yaml
server:
name: mcp-data-platform
transport: stdio
semantic:
provider: datahub
instance: primary
toolkits:
datahub:
enabled: true
instances:
primary:
url: "${DATAHUB_URL}"
token: "${DATAHUB_TOKEN}"
default: primaryWire it to Claude Code:
claude mcp add data-platform \
-e DATAHUB_URL=https://datahub.example.com/api/graphql \
-e DATAHUB_TOKEN=$TOKEN \
-- mcp-data-platform --config platform.yamlFor a hosted deployment, run --transport http and enable the built-in OAuth 2.1 server so Claude and other MCP clients sign in through your identity provider. See Configuration, Deployment (Docker Compose, Kubernetes), and the OAuth 2.1 Server guide.
The platform implements a fail-closed security model: missing or invalid credentials deny access, never bypass. Personas are default-deny, Trino and S3 support enforced read-only mode, and metadata is sanitized against prompt injection. See the Auth Overview and MCP Defense: A Case Study in AI Security for the architecture rationale.
| Transport | Authentication | TLS |
|---|---|---|
| stdio | Not required (local execution) | N/A |
| HTTP | Required (Bearer token or API key) | Strongly recommended |
mcp-data-platform is the orchestration layer for a suite of open-source MCP servers that also run standalone:
- txn2/mcp-datahub: DataHub metadata: search, lineage, glossary, domains, tags, ownership
- txn2/mcp-trino: Trino distributed SQL with configurable timeouts and row limits
- txn2/mcp-s3: S3 object storage: buckets, prefixes, objects, presigned URLs
See Ecosystem for how they compose.
Full documentation lives at mcp-data-platform.txn2.com.
- Server Guide: architecture, configuration, deployment
- Cross-Enrichment: how automatic enrichment works
- Authentication: OIDC, API keys, OAuth 2.1
- Knowledge Capture and Memory: the agent knowledge loop
- Go Library: build custom MCP servers
- Tools API Reference: complete tool specifications
- Examples Gallery: real-world configurations
- Troubleshooting: common issues and debugging
go build -o mcp-data-platform ./cmd/mcp-data-platform # build
go test -race ./... # tests
make verify # full CI-equivalent suiteContributions for bug fixes, tests, and documentation are welcome. Please run make verify (formatting, race-detected tests, coverage, linting, security scanning) before opening a pull request.
Open source by Craig Johnston, sponsored by Deasil Works, Inc. and Plexara

