Evolving To Streamhouse: Apache Fluss x Iceberg
Streaming-Native Lakehouse
Yesterday, we hosted a webinar titled “Evolving to Streamhouse: Apache Fluss × Iceberg”, bringing together contributors and practitioners from across the lakehouse and streaming ecosystems.
The idea is simple but ambitious:
How do we evolve today’s lakehouse architectures to support real-time, sub-second analytics, without duplicating data, pipelines, or operational complexity?
In this post, I’ll summarize the key ideas, architectural insights, and future directions we discussed.
🎥 You can find the full webinar recording here.
The Lakehouse Today: Powerful, but Not Real-Time Native
Apache Iceberg has become a cornerstone of modern data architectures. It enables open, vendor-neutral lakehouses with strong guarantees around schema evolution, snapshot isolation, and time travel, all on inexpensive object storage.
For analytical workloads, this model works extremely well. However, Iceberg tables are fundamentally optimized for cold storage and batch-style access patterns. Even with frequent commits, most real-world deployments still operate at minute-level freshness.
As organizations increasingly rely on real-time fraud detection, operational monitoring, personalization, and AI-driven decision-making, this latency gap becomes problematic. When the lakehouse cannot serve fresh data fast enough, teams introduce additional systems—streaming platforms, real-time OLAP databases, custom pipelines—to compensate.
This is how many platforms drift into modern Lambda architectures, with duplicated data paths and inconsistent results across systems.
Iceberg’s Evolution: Designed by Production Pain
A key theme of the webinar was that Iceberg’s evolution is not theoretical—it is shaped by real production problems surfaced by the community.
With Iceberg Spec v3, several changes directly address incremental and streaming-adjacent workloads. One of the most important is row lineage, which gives each row a persistent identity across snapshots. This allows engines like Flink to detect exactly what changed between snapshots instead of rescanning entire partitions or tables.
Another major improvement is the introduction of deletion vectors, which significantly reduce the cost of updates and deletes. Instead of rewriting entire Parquet files, Iceberg can mark deleted rows using compact bitmaps that are applied at read time.
Iceberg v3 also introduces native support for semi-structured data through the variant type, along with higher-resolution timestamps and geospatial types. These features reflect the reality that modern data is often event-driven, evolving, and not strictly tabular.
Looking ahead, Iceberg v4 proposals focus on reducing metadata overhead, improving commit scalability, and making metadata more portable. All of this moves Iceberg closer to supporting higher write rates and more dynamic workloads—but it still does not solve real-time access on its own.
Streamhouse: An Architecture Lakehouse Streaming-Native
This is where the concept of Streamhouse enters.
Streamhouse is best understood as a lakehouse with a native real-time layer. Instead of stitching together separate streaming and batch systems, Streamhouse proposes a unified architecture with a single write path and a single read path.
Apache Fluss is designed to be the real-time storage layer in this model.
Recent data lives in Fluss, backed by fast storage and optimized for sub-second reads and writes. Older data is automatically tiered into Iceberg, where it benefits from cheap object storage and mature analytical tooling. Importantly, both layers represent the same logical table.
From the application and query perspective, there is no “streaming table” versus “lakehouse table.” There is only a table.
Zero-ETL Tiering: Hot and Cold Without Pipelines
A core principle of Streamhouse is zero ETL.
In traditional architectures, moving data from streaming systems into Iceberg requires dedicated ingestion pipelines, compaction jobs, schedulers, and careful coordination to avoid conflicts. These pipelines are expensive to build and even more expensive to operate.
Fluss eliminates this complexity by embedding tiering directly into the storage layer. Data written to Fluss is continuously converted into Parquet and committed into Iceberg with exactly-once semantics. Compaction happens asynchronously within the same workflow, avoiding the classic ingestion-versus-compaction conflicts.
Operationally, this means teams can enable real-time lakehouse behavior through configuration rather than custom pipelines.
Union Read: One Table, Always Up to Date
A central promise of Streamhouse is that users never have to think about where data lives.
This is achieved through Union Read. When a query is executed against a Fluss-backed table, the engine transparently reads historical data from Iceberg and fresh data from Fluss, merging them into a single, consistent result set.
Fluss maintains precise boundaries between hot and cold data using offsets and snapshot coordination. This ensures there are no gaps, no duplicates, and no reordering. From the user’s perspective, a single SQL query always returns the latest state of the table.
This dramatically simplifies analytics, feature engineering, and exploratory workflows, especially when freshness matters.
Deletion Vectors in Streamhouse: Making Updates Correct at Scale
Supporting real-time analytics is hard. Supporting updates and deletes across hot and cold storage is even harder.
This is where one of the most important technical contributions discussed in the webinar comes in: Fluss’s multi-layer deletion vector framework, introduced by Yuxia.
In a Streamhouse architecture, updates and deletes first arrive in the real-time layer. However, historical data may already exist in Iceberg. The system must ensure that deleted rows never reappear when data from both layers is merged.
Fluss solves this by managing deletion vectors across three distinct layers.
First, Iceberg deletion vectors handle rows that have already been persisted and marked as deleted in Iceberg snapshots. This is standard Iceberg behavior.
Second, Log deletion vectors track deletes and updates within Fluss’s real-time changelog. These apply only to data that still resides in the hot layer.
Third, and most importantly, Lake deletion vectors bridge the two worlds. When a delete arrives for a row that already exists in Iceberg, Fluss records metadata that marks the corresponding Iceberg row as logically deleted—even before a new Iceberg snapshot is written.
During a union read, the query engine applies all relevant deletion vectors. Rows deleted in Fluss are masked out from both the hot layer and the historical Iceberg data. Over time, these logical deletes are safely materialized into Iceberg itself.
This approach ensures correct upsert semantics across streaming and historical data without sacrificing performance or freshness. It is a foundational building block for primary-key tables in Streamhouse.
Flink CDC: The Front Door to the Streamhouse
No Streamhouse architecture is complete without robust, real-time ingestion. This is where Flink CDC, presented by Leonard, plays a critical role.
Flink CDC provides end-to-end streaming ingestion from transactional databases using change data capture. It handles initial snapshots and continuous change streams in a unified way, ensuring downstream systems see a consistent view of the source database.
One of Flink CDC’s most powerful features is schema evolution. When upstream schemas change—columns are added, modified, or removed—Flink CDC coordinates these changes safely. It flushes in-flight data, propagates schema updates downstream, and resumes streaming without data loss or corruption.
In practice, this means teams can synchronize entire databases into Fluss or Iceberg using declarative YAML pipelines, rather than writing custom streaming code. Flink CDC supports a wide range of sources and sinks, including Iceberg, Fluss, and other lake formats.
In a Streamhouse architecture, Flink CDC becomes the ingestion backbone: capturing operational data changes and feeding them directly into the unified hot-and-cold storage layer.
From Modern Lambda and Kappa to Streamhouse for Analytics
Taken together, these components enable a shift away from modern Lambda and Kappa architectures toward a Kappa-style architecture for analytics — the Streamhouse.
Instead of maintaining separate systems for streaming and batch, Streamhouse provides:
one ingestion path
one storage abstraction
one query model
one source of truth
Real-time dashboards, historical analytics, feature engineering, and AI workloads all operate on the same logical tables, with consistent semantics and predictable behavior.
Why Streamhouse Matters Now
Real-time is no longer optional. AI agents, machine learning systems, and operational applications all depend on current, trustworthy data.
Streamhouse does not replace Iceberg. It extends it. By pairing Iceberg’s strengths in open, durable analytics with Fluss’s real-time storage and ingestion capabilities, Streamhouse offers a practical path toward continuous analytics at scale.


