Stream Lineage

Overview

Stream Lineage renders an interactive, left-to-right graph of every producer, topic, processor (Managed Connect, Flink SQL, Tableflow), lake table, and consumer group running in your (business_unit, stage) context. It answers the question "who produces to and consumes from this topic, and what runs in between?" without leaving the self-service portal.

The graph is reconstructed on demand by querying Confluent Cloud directly — Managed Connect, Flink SQL v1, Tableflow, the Telemetry (Metrics) API, the Schema Registry, and the Kafka REST endpoint — and mapping the results to the open OpenLineage object model. Nothing is stored: every page load rebuilds the graph from authoritative sources (with a short cache window — see Caching).

Two surfaces share the same backend:

The full-lineage page at /ui/lineage — cluster-wide graph for the current context.
The Stream Lineage card on the topic page (/ui/topics/{bu}/{stage}/{topic}) — a 1-hop subgraph centred on the focused topic. It calls the same endpoint with an extra ?topic=<name> query param that pushes the topic filter server-side into the Metrics queries, so the response is small and fast.

Reference

Access

Stream Lineage is writer-gated: developers, BU admins, super-admins, and platform admins can view it. Viewers and unauthenticated users are denied. The gate applies to both the page (/ui/lineage) and the backing endpoint (GET /{business_unit}/{stage}/lineage); the sidebar link is also hidden for viewers so they normally never reach the page.

Why writer-gated?

On shared clusters, the lineage graph is cluster-wide — it returns every edge regardless of BU naming prefix. Restricting it to writer roles prevents read-only viewers from seeing other tenants' producers, consumer groups, connectors, and Flink statements.

How It Works

On every uncached page load the backend issues parallel async calls to seven Confluent sources and merges the results into a single OpenLineage event stream. The frontend then bootstraps a Cytoscape.js graph with a dagre layout.

When the request carries ?topic=<name> (the topic-page card always does), the dataflow source AND-filters every Metrics query on metric.topic so the API returns only rows involving that topic. Cluster-wide callers (/ui/lineage) omit the param and get the full graph.

Source	Confluent API	Contributes
`connect`	Managed Connect REST	Connector name + direction (source/sink) per `lcc-` id. Used to reclassify the matching `dataflow` producer / consumer-group edges into Connect source* / Connect sink nodes — the topic bindings themselves come from telemetry, not the static connector config.
`flink`	Flink SQL v1 + Compute Pools (fcpm/v2)	RUNNING Flink statements → topic-to-topic edges (parsed from the SQL)
`tableflow`	Tableflow v1	Topic → Delta / Iceberg table sync edges
`dataflow`	Telemetry API `/v2/metrics/dataflow/query`	Producer & consumer-group edges + per-topic throughput facet. Three queries (`bytes_in`, `bytes_out`, `consumer_lag_offsets`) fired in parallel — matching exactly what the Confluent Cloud portal fires. Surfaces non-committing consumers (e.g. Spark Structured Streaming) the documented `consumer_lag_offsets` metric never returns.
`dataflow_ksql`	Telemetry API (ksqlDB query metrics, gated on cluster presence)	ksqlDB query → topic edges (skipped silently when no ksqlDB cluster exists)
`schema_registry`	Schema Registry	Per-topic value & key schemas (facet enrichment)
`kafka_rest`	Kafka REST v3	Topic partition counts & replication factor (facet enrichment)

Each source runs independently. If one fails (auth, transport, 5xx), the others still contribute — see Partial Responses.

Reading the Graph

Data flows left to right. Two node families render distinctly:

Dataset nodes

Kafka topics (pink), Delta tables and Iceberg tables (indigo). Topic nodes carry partition, replication, and schema facets when the enrichments succeed.

Job nodes

Anything that moves data, colour-coded by kind: producers and consumer groups (teal), bidirectional apps (cyan), Connect source connectors (amber), Connect sink connectors (orange), Flink statements (violet), ksqlDB queries (sky), Tableflow syncs (slate). Job nodes carry kind-specific facets (connector class, Flink SQL excerpt, table format, throughput, etc.).

Badge nodes (aggregated, currently disabled)

The graph supports collapsing groups of producers / consumer groups that share an identical topic-set into one PRODUCERS · N or CONSUMER GROUPS · N badge — the same rule the Confluent Cloud portal uses. Aggregation is intentionally turned off in the current build while its effect on real graphs is being evaluated; every producer and consumer group renders as an individual node. The toolbar's Expand all groups button is a no-op until aggregation is re-enabled.

Action	Behaviour
Click a node	Opens a right side-panel listing every OpenLineage facet attached to that node.
Period selector	Switches the telemetry lookback window (presets from 10 minutes up to 24 hours). The window is sent as `?start=&end=` on the lineage fetch and drives the `dataflow` queries. The selection is not persisted — every page load resets to the surface's default: Last 10 minutes on `/ui/lineage`, Last hour on the topic card. Tight defaults keep both pages fast on entry; the Metrics API gets slow at long windows.
Search	Type-ahead search across node labels. Selecting a result centres and highlights the node.
Fit to screen	Re-runs the layout fit so the whole graph is visible in the canvas.
Hide internal resources	Hides Confluent-internal producers, consumer groups, and topics (e.g. `_confluent-*`) so the graph shows only user-owned flows.
Refresh	Cache-busts the browser fetch. The server-side 60 s cache still applies; see Caching.
Export PNG	Downloads the full graph (not just the viewport) at 2× scale.

Caching

Lineage responses are cached server-side for 60 seconds via aiocache, keyed by (business_unit, stage, window.start, window.end, topic). Repeated loads within the window share a single build — important because a full graph build can fan out a dozen or more API calls against Confluent. Cluster-wide (topic=None) and topic-scoped responses occupy independent cache slots so they don't poison each other. Picking a different period in the toolbar produces a new cache entry; the default key rolls forward every minute as the window's end is snapped to the current minute.

The browser also gets Cache-Control: private, max-age=60. Newly created resources (topics, connectors, consumer groups) show up on the next build after the TTL expires.

Partial Responses

Source failures never return a 5xx. The endpoint always answers 200 OK with the events it could build, and adds two response headers when one or more sources failed:

X-Lineage-Partial: true
X-Lineage-Failed-Sources: dataflow,flink

A yellow banner above the graph enumerates the failed sources so you can tell at a glance whether you are looking at the full picture or a degraded view. The most common causes:

Failed source	Likely cause	Fix
`dataflow` / `dataflow_ksql`	Admin Cloud API key is missing the MetricsViewer role.	Grant `MetricsViewer` at org or env scope on the admin service account.
`flink`	Compute pool listing returned 5xx or Flink SQL v1 rejected the credential.	Verify the per-landing-zone `flink_<region>_client_id` / `_secret` entries in App Configuration.
`connect`	At least one connector's config could not be parsed.	Check application logs for the offending connector name; the entire source fails on a single parse error by design.
`schema_registry` / `kafka_rest`	Per-context credentials missing or expired.	Same credentials used by Topics & Schemas pages — verify those work first.

Flink not configured ≠ failure

A context with zero Flink compute pools, or pools whose region has no credentials in App Configuration, simply contributes no Flink edges. This is not a partial failure — no header is set, no banner appears.

Limitations

Sources

No ksqlDB or self-managed Connect

EON does not run ksqlDB or self-managed Connect clusters today. Statements / connectors running outside Confluent Managed Connect & Cloud Flink will not appear.

Consumers

Spark Structured Streaming granularity

Non-committing consumers (e.g. Spark Structured Streaming, which uses assign() and an external checkpoint store) are surfaced via the dataflow bytes_out rows — one consumer-group node per StreamingQuery (queryRunId), with all Kafka sources in that query merged into a single node. Per-executor breakdown isn't shown; "Spark application" isn't a Confluent concept.

Flink

SQL parsing is regex-based

Complex Flink SQL patterns (CTEs, lateral joins, deeply nested subqueries) may not yield edges. The statement still renders as a job node — only its input / output detection is affected.

Metrics

Short default lookback

Producer and consumer edges come from the Telemetry API. Each surface defaults to a short window — Last 10 minutes on /ui/lineage, Last hour on the topic card — to keep page loads fast. Clients that haven't produced or committed in that window won't show up; widen the window via the toolbar to look further back. Direct API calls without ?start=&end= fall back to the server-side default controlled by CONFLUENT_METRICS_LOOKBACK_MINUTES.

Lineage

Topic-level only

Column / field-level lineage is not extracted. Schemas are attached as facets on topic nodes for inspection, but edges are always topic-to-job-to-topic.

Mobile

Desktop-only

Below 768 px the page shows an "open on desktop" placeholder. Large DAGs with badges are not usable on a phone screen.