GitHub - txn2/mcp-data-platform: A semantic data platform MCP server that composes multiple data tools with bidirectional cross-injection - tool responses automatically include critical context from other services.

Documentation | Installation | Quick Start | Go Library

Your AI assistant can run SQL. But it doesn't know that cust_id contains PII, that the table was deprecated last month, or who to ask when something breaks.

mcp-data-platform fixes that. It is a single MCP server that connects AI assistants to your data infrastructure and enriches every response with business context from your semantic layer: query a table and get its meaning, owners, quality scores, and deprecation warnings in the same call.

It is a platform, not just a bridge. The same endpoint gives agents persistent memory and a governed path to write knowledge back to the catalog, proxies third-party MCP servers and REST APIs through one authentication, persona, and audit pipeline, and ships a web portal where AI-generated artifacts are saved, organized into collections, and shared with teammates.

The only required backend is DataHub as the semantic layer. Add Trino for SQL and S3 for object storage when you're ready. Learn why this stack.

Why

AI assistants are powerful at querying data, but they work blind. When an agent asks "What's in the orders table?", it gets column names and types. It doesn't know that customer_id is PII, that the table is deprecated in favor of orders_v2, that the quality score dropped last week, or who to contact when something looks wrong.

# Without mcp-data-platform
─────────────────────────────────────────────────────────────────────
User:      "Describe the orders table"
AI:        Queries Trino → gets columns and types
User:      "Who owns this data?"
AI:        Queries DataHub → finds owners
User:      "Is this table still active?"
AI:        Queries DataHub again → finds deprecation status
User:      "What does customer_id actually mean?"
AI:        Queries DataHub again → finds column descriptions
─────────────────────────────────────────────────────────────────────
4 round trips. Context scattered across conversations. Easy to miss warnings.

# With mcp-data-platform
─────────────────────────────────────────────────────────────────────
User:      "Describe the orders table"
AI:        Gets everything in one response:
           → Schema: columns and types
           → ⚠️ DEPRECATED: Use orders_v2 instead
           → Owners: Data Platform Team
           → Tags: pii, financial
           → Quality Score: 87%
           → Column meanings and business definitions
─────────────────────────────────────────────────────────────────────
1 call. Complete context. Warnings front and center.

How It Works

sequenceDiagram
    participant AI as AI Assistant
    participant P as mcp-data-platform
    participant T as Trino
    participant D as DataHub

    AI->>P: trino_describe_table "orders"
    P->>T: DESCRIBE orders
    T-->>P: columns, types
    P->>D: Get semantic context
    D-->>P: description, owners, tags, quality, deprecation
    P-->>AI: Schema + Full Business Context

The platform intercepts tool responses at the protocol level and enriches them with context from the other services. This cross-enrichment is bidirectional:

Trino → DataHub: query results include owners, tags, glossary terms, deprecation warnings, quality scores
DataHub → Trino: search results include query availability and sample SQL
S3 ↔ DataHub: object listings include matching dataset metadata, and dataset searches show storage availability

Features

Each feature links to its full documentation.

Semantic data access

Feature	Description
Cross-enrichment	Business context added to every tool response automatically, with session dedup to save tokens
Lineage inheritance	Column descriptions inherited from upstream datasets via DataHub lineage
Universal search	One `search` tool fans a query across the catalog, knowledge pages, memory, insights, assets, prompts, and APIs; `fetch` dereferences any result
Workflow gating	Session-aware guidance that steers agents to discovery before SQL, with escalating warnings
Tools	Full tool reference for Trino, DataHub, S3, knowledge, memory, portal, and gateway toolkits

Knowledge and memory

Feature	Description
Memory layer	Persistent agent memory across sessions, PostgreSQL + pgvector, hybrid semantic/lexical recall
Knowledge capture	Agents record domain insights during sessions; approved knowledge is written back to DataHub or canonical knowledge pages
Governance workflow	Human-in-the-loop review, approve/reject, changeset tracking, and rollback for every applied change
Managed resources	Human-uploaded reference files (playbooks, samples, templates) served to agents as MCP resources

Gateways and extensibility

Feature	Description
MCP gateway	Re-expose any third-party MCP server through the platform's auth, persona, and audit pipeline
API gateway	Proxy REST/HTTP APIs (Salesforce, Google, GitHub, Stripe) with four tools instead of one tool per endpoint
API catalogs	Versioned OpenAPI bundles shared across connections, with semantic endpoint ranking
REST invoke shim	Call gateway endpoints from NiFi, Airflow, or `curl` under the same auth and audit pipeline
Self-configuration	Admins manage personas, connections, and prompts by asking the agent instead of clicking
MCP Apps	Interactive UI panels rendered inline in the MCP host
Go library	Import the platform as a library: custom toolkits, providers, and middleware

Security and operations

Feature	Description
Authentication	Fail-closed model: OIDC (Keycloak, Auth0, Okta, Azure AD) and API keys for service accounts
OAuth 2.1 server	Built-in authorization server with PKCE and Dynamic Client Registration; Claude signs in through your IdP
Outbound OAuth	OAuth to upstream MCPs and APIs with encrypted refresh tokens that survive restarts
Personas	Role-mapped allow/deny tool and connection filtering, default-deny
Audit logging	Every tool call logged to PostgreSQL with identity, persona, sanitized parameters, and timing
Observability	Prometheus metrics and optional OpenTelemetry distributed tracing
Session externalization	PostgreSQL-backed sessions for zero-downtime restarts, horizontal scaling, and live tool-inventory updates
Multi-provider	Multiple instances of each service behind one endpoint, with isolated failure domains
Operating modes	Standalone (no database) or file + database with hot-reloaded config overrides

The Portal

A built-in web portal serves both operators and end users. Enable with portal.enabled: true.

For operators: dashboards with activity timelines and performance percentiles, a searchable audit log, an interactive tool explorer with per-persona visibility and inline test runs, knowledge insight governance, connection and persona management, API keys, and indexing health. See the Admin Portal guide.

For users: AI-generated artifacts (reports, charts, documents) are saved from any session with the save_artifact tool, organized into shareable collections, and shared with teammates or through public links. A prompt library, feedback threads on any artifact, and personal knowledge and activity views round out the User Portal.

Quick Start

Install (see all methods: Homebrew, Docker, source):

go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest

Create a minimal configuration. DataHub is the only required backend; ${VAR} references are expanded from the environment:

# platform.yaml
server:
  name: mcp-data-platform
  transport: stdio

semantic:
  provider: datahub
  instance: primary

toolkits:
  datahub:
    enabled: true
    instances:
      primary:
        url: "${DATAHUB_URL}"
        token: "${DATAHUB_TOKEN}"
    default: primary

Wire it to Claude Code:

claude mcp add data-platform \
  -e DATAHUB_URL=https://datahub.example.com/api/graphql \
  -e DATAHUB_TOKEN=$TOKEN \
  -- mcp-data-platform --config platform.yaml

For a hosted deployment, run --transport http and enable the built-in OAuth 2.1 server so Claude and other MCP clients sign in through your identity provider. See Configuration, Deployment (Docker Compose, Kubernetes), and the OAuth 2.1 Server guide.

Security

The platform implements a fail-closed security model: missing or invalid credentials deny access, never bypass. Personas are default-deny, Trino and S3 support enforced read-only mode, and metadata is sanitized against prompt injection. See the Auth Overview and MCP Defense: A Case Study in AI Security for the architecture rationale.

Transport	Authentication	TLS
stdio	Not required (local execution)	N/A
HTTP	Required (Bearer token or API key)	Strongly recommended

Ecosystem

mcp-data-platform is the orchestration layer for a suite of open-source MCP servers that also run standalone:

txn2/mcp-datahub: DataHub metadata: search, lineage, glossary, domains, tags, ownership
txn2/mcp-trino: Trino distributed SQL with configurable timeouts and row limits
txn2/mcp-s3: S3 object storage: buckets, prefixes, objects, presigned URLs

See Ecosystem for how they compose.

Documentation

Full documentation lives at mcp-data-platform.txn2.com.

Server Guide: architecture, configuration, deployment
Cross-Enrichment: how automatic enrichment works
Authentication: OIDC, API keys, OAuth 2.1
Knowledge Capture and Memory: the agent knowledge loop
Go Library: build custom MCP servers
Tools API Reference: complete tool specifications
Examples Gallery: real-world configurations
Troubleshooting: common issues and debugging

Development

go build -o mcp-data-platform ./cmd/mcp-data-platform   # build
go test -race ./...                                     # tests
make verify                                             # full CI-equivalent suite

Contributions for bug fixes, tests, and documentation are welcome. Please run make verify (formatting, race-detected tests, coverage, linting, security scanning) before opening a pull request.

License

Apache License 2.0

Open source by Craig Johnston, sponsored by Deasil Works, Inc. and Plexara

Name		Name	Last commit message	Last commit date
Latest commit History 630 Commits
.cache/plugin/social		.cache/plugin/social
.github		.github
.semgrep		.semgrep
apps		apps
bench		bench
cmd		cmd
configs		configs
deployments/observability		deployments/observability
dev		dev
docs		docs
internal		internal
mcpb		mcpb
pkg		pkg
scripts		scripts
test		test
ui		ui
.env.example		.env.example
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.e2e.yml		docker-compose.e2e.yml
go.mod		go.mod
go.sum		go.sum
mkdocs.yml		mkdocs.yml
package_budget_test.go		package_budget_test.go
server.json		server.json
verify_test.go		verify_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Why

How It Works

Features

Semantic data access

Knowledge and memory

Gateways and extensibility

Security and operations

The Portal

Quick Start

Security

Ecosystem

Documentation

Development

License

About

Uh oh!

Releases 251

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Why

How It Works

Features

Semantic data access

Knowledge and memory

Gateways and extensibility

Security and operations

The Portal

Quick Start

Security

Ecosystem

Documentation

Development

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 251

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages