Skip to content

txn2/mcp-data-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

630 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

txn2/mcp-data-platform

GitHub license Go Reference codecov Go Report Card OpenSSF Scorecard SLSA 3

Documentation | Installation | Quick Start | Go Library

Your AI assistant can run SQL. But it doesn't know that cust_id contains PII, that the table was deprecated last month, or who to ask when something breaks.

mcp-data-platform fixes that. It is a single MCP server that connects AI assistants to your data infrastructure and enriches every response with business context from your semantic layer: query a table and get its meaning, owners, quality scores, and deprecation warnings in the same call.

It is a platform, not just a bridge. The same endpoint gives agents persistent memory and a governed path to write knowledge back to the catalog, proxies third-party MCP servers and REST APIs through one authentication, persona, and audit pipeline, and ships a web portal where AI-generated artifacts are saved, organized into collections, and shared with teammates.

The only required backend is DataHub as the semantic layer. Add Trino for SQL and S3 for object storage when you're ready. Learn why this stack.


Why

AI assistants are powerful at querying data, but they work blind. When an agent asks "What's in the orders table?", it gets column names and types. It doesn't know that customer_id is PII, that the table is deprecated in favor of orders_v2, that the quality score dropped last week, or who to contact when something looks wrong.

# Without mcp-data-platform
─────────────────────────────────────────────────────────────────────
User:      "Describe the orders table"
AI:        Queries Trino → gets columns and types
User:      "Who owns this data?"
AI:        Queries DataHub → finds owners
User:      "Is this table still active?"
AI:        Queries DataHub again → finds deprecation status
User:      "What does customer_id actually mean?"
AI:        Queries DataHub again → finds column descriptions
─────────────────────────────────────────────────────────────────────
4 round trips. Context scattered across conversations. Easy to miss warnings.
# With mcp-data-platform
─────────────────────────────────────────────────────────────────────
User:      "Describe the orders table"
AI:        Gets everything in one response:
           → Schema: columns and types
           → ⚠️ DEPRECATED: Use orders_v2 instead
           → Owners: Data Platform Team
           → Tags: pii, financial
           → Quality Score: 87%
           → Column meanings and business definitions
─────────────────────────────────────────────────────────────────────
1 call. Complete context. Warnings front and center.

How It Works

sequenceDiagram
    participant AI as AI Assistant
    participant P as mcp-data-platform
    participant T as Trino
    participant D as DataHub

    AI->>P: trino_describe_table "orders"
    P->>T: DESCRIBE orders
    T-->>P: columns, types
    P->>D: Get semantic context
    D-->>P: description, owners, tags, quality, deprecation
    P-->>AI: Schema + Full Business Context
Loading

The platform intercepts tool responses at the protocol level and enriches them with context from the other services. This cross-enrichment is bidirectional:

  • Trino → DataHub: query results include owners, tags, glossary terms, deprecation warnings, quality scores
  • DataHub → Trino: search results include query availability and sample SQL
  • S3 ↔ DataHub: object listings include matching dataset metadata, and dataset searches show storage availability

Features

Each feature links to its full documentation.

Semantic data access

Feature Description
Cross-enrichment Business context added to every tool response automatically, with session dedup to save tokens
Lineage inheritance Column descriptions inherited from upstream datasets via DataHub lineage
Universal search One search tool fans a query across the catalog, knowledge pages, memory, insights, assets, prompts, and APIs; fetch dereferences any result
Workflow gating Session-aware guidance that steers agents to discovery before SQL, with escalating warnings
Tools Full tool reference for Trino, DataHub, S3, knowledge, memory, portal, and gateway toolkits

Knowledge and memory

Feature Description
Memory layer Persistent agent memory across sessions, PostgreSQL + pgvector, hybrid semantic/lexical recall
Knowledge capture Agents record domain insights during sessions; approved knowledge is written back to DataHub or canonical knowledge pages
Governance workflow Human-in-the-loop review, approve/reject, changeset tracking, and rollback for every applied change
Managed resources Human-uploaded reference files (playbooks, samples, templates) served to agents as MCP resources

Gateways and extensibility

Feature Description
MCP gateway Re-expose any third-party MCP server through the platform's auth, persona, and audit pipeline
API gateway Proxy REST/HTTP APIs (Salesforce, Google, GitHub, Stripe) with four tools instead of one tool per endpoint
API catalogs Versioned OpenAPI bundles shared across connections, with semantic endpoint ranking
REST invoke shim Call gateway endpoints from NiFi, Airflow, or curl under the same auth and audit pipeline
Self-configuration Admins manage personas, connections, and prompts by asking the agent instead of clicking
MCP Apps Interactive UI panels rendered inline in the MCP host
Go library Import the platform as a library: custom toolkits, providers, and middleware

Security and operations

Feature Description
Authentication Fail-closed model: OIDC (Keycloak, Auth0, Okta, Azure AD) and API keys for service accounts
OAuth 2.1 server Built-in authorization server with PKCE and Dynamic Client Registration; Claude signs in through your IdP
Outbound OAuth OAuth to upstream MCPs and APIs with encrypted refresh tokens that survive restarts
Personas Role-mapped allow/deny tool and connection filtering, default-deny
Audit logging Every tool call logged to PostgreSQL with identity, persona, sanitized parameters, and timing
Observability Prometheus metrics and optional OpenTelemetry distributed tracing
Session externalization PostgreSQL-backed sessions for zero-downtime restarts, horizontal scaling, and live tool-inventory updates
Multi-provider Multiple instances of each service behind one endpoint, with isolated failure domains
Operating modes Standalone (no database) or file + database with hot-reloaded config overrides

The Portal

A built-in web portal serves both operators and end users. Enable with portal.enabled: true.

For operators: dashboards with activity timelines and performance percentiles, a searchable audit log, an interactive tool explorer with per-persona visibility and inline test runs, knowledge insight governance, connection and persona management, API keys, and indexing health. See the Admin Portal guide.

Admin Dashboard

For users: AI-generated artifacts (reports, charts, documents) are saved from any session with the save_artifact tool, organized into shareable collections, and shared with teammates or through public links. A prompt library, feedback threads on any artifact, and personal knowledge and activity views round out the User Portal.

Collections

Quick Start

Install (see all methods: Homebrew, Docker, source):

go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest

Create a minimal configuration. DataHub is the only required backend; ${VAR} references are expanded from the environment:

# platform.yaml
server:
  name: mcp-data-platform
  transport: stdio

semantic:
  provider: datahub
  instance: primary

toolkits:
  datahub:
    enabled: true
    instances:
      primary:
        url: "${DATAHUB_URL}"
        token: "${DATAHUB_TOKEN}"
    default: primary

Wire it to Claude Code:

claude mcp add data-platform \
  -e DATAHUB_URL=https://datahub.example.com/api/graphql \
  -e DATAHUB_TOKEN=$TOKEN \
  -- mcp-data-platform --config platform.yaml

For a hosted deployment, run --transport http and enable the built-in OAuth 2.1 server so Claude and other MCP clients sign in through your identity provider. See Configuration, Deployment (Docker Compose, Kubernetes), and the OAuth 2.1 Server guide.

Security

The platform implements a fail-closed security model: missing or invalid credentials deny access, never bypass. Personas are default-deny, Trino and S3 support enforced read-only mode, and metadata is sanitized against prompt injection. See the Auth Overview and MCP Defense: A Case Study in AI Security for the architecture rationale.

Transport Authentication TLS
stdio Not required (local execution) N/A
HTTP Required (Bearer token or API key) Strongly recommended

Ecosystem

mcp-data-platform is the orchestration layer for a suite of open-source MCP servers that also run standalone:

  • txn2/mcp-datahub: DataHub metadata: search, lineage, glossary, domains, tags, ownership
  • txn2/mcp-trino: Trino distributed SQL with configurable timeouts and row limits
  • txn2/mcp-s3: S3 object storage: buckets, prefixes, objects, presigned URLs

See Ecosystem for how they compose.

Documentation

Full documentation lives at mcp-data-platform.txn2.com.

Development

go build -o mcp-data-platform ./cmd/mcp-data-platform   # build
go test -race ./...                                     # tests
make verify                                             # full CI-equivalent suite

Contributions for bug fixes, tests, and documentation are welcome. Please run make verify (formatting, race-detected tests, coverage, linting, security scanning) before opening a pull request.

License

Apache License 2.0


Open source by Craig Johnston, sponsored by Deasil Works, Inc. and Plexara

About

A semantic data platform MCP server that composes multiple data tools with bidirectional cross-injection - tool responses automatically include critical context from other services.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors