The End of Data Chaos

E3 Magazine was invited to the Kafka Summit London 2024. Here are our observations on the latest innovations in the Kafka Apache world.

Laura Cepeda

April 3, 2024

Content:

To the comments

The problem

Jay Kreps is the founder and CEO of Confluent and one of the original authors of the open source Apache Kafka project. The problem Kreps’ attempts to resolve is one many are familiar with: data is not up-to-date, connections are brittle, and it is a constant struggle to utilize data to its fullest extent, because there is no one cohesive structure to organize said data between the different areas of a business. One system is in place from the beginning and then new connections, new wires, are added on, resulting in a smorgasbord of disorganization.

Many current offerings offer only solutions for the analytical side of data, but not the operational side, and vice versa. Yet there is a need for business to be able to access all their data to the fullest extent, especially when many are planning on implementing AI into their business. Any AI they develop can only be as good as the data they are trained with.

Responsibilities, technologies, and processes

Others have attempted to reign in the chaos by bridging the gap between the operational estate, where applications that manage the business run, and the analytical estate, where data is analyzed and financial aspects are evaluated, through other means, such as through data lake technologies. However, the Kafka Apache founder and CEO, Jay Kreps, is nothing if not ambitious, attempting to tackle the problem of messy data systems and infrastructures. Kreps has proposed a universal data product as a solution, which uses data streaming as a way to allow businesses to access and filter data in real time—which, by Jay Kreps’ definition, simply means very fast—across both operational and analytical estates, with low latency as the standard.

According to Confluent, a data product is a reliable data set, which is purpose-built to be shared and reused with other teams and services. It’s a formalization of responsibilities, technology, and processes to easily allow users to access the data they require.

Confluent’s data product, their offering, is a three-part end-to-end data streaming solution—a three-pronged attack on data chaos—which consists of the Confluent Data Streaming Platform, which provides the core streaming functionality; Apache Flink, which provides real time processing; and Iceberg, which allows the results to be visualized and shared in table formats. Architectures for AI applications, which spans both operational and analytical world, requires batch processing and streaming is a generalization of Batch processing, a superset of that.

Data streaming platform

The Confluent Data Streaming Platform is a cloud-native serverless offering which integrates Kora, the new Kafka engine for the cloud with infinite storage, and which is apparently sixteen times faster than the original Apache Kafka. Apache Kafka is a distributed system, a PubSub model, where producers are able to send messages—real time data movement—into the data streaming platform, and any downstream consumers that want to work with that data in a one-to-many fashion are able to access and directly work with it. The Confluent platform builds on top of Apache Kafka and facilitates data streaming capabilities by lowering the need for businesses to manage the original system. It provides additional accessibility to Apache Kafka by eliminating the overhead, the expenses, and the other challenges of managing open source software.

SAP integration

A vital part of Confluent’s evolution is its SAP integration. As SAP is a core offering for Confluent, integration is built directly within the SAP console. Confluent allows SAP customers to access their SAP data, and also merge it in real time with all different data sources—such as IoT data, data from marketing tools, real time to click streams from the web, among other sources—so that they can send it in real time as a full data product downstream to databases, data warehouses, data lakes, and AI ML tools.

SAP customers gain the ability to access Confluent data streaming platform and fully managed data streams from directly within SAP Datasphere. This means that when working with BTP (Business Technology Platform), when users have access to S/4 Hana ECC and other tools on the SAP side, they additionally have the ability to configure a real time writing of that data over to fully managed data streams on the Confluent side. This allows users to unlock ERP data from within SAP and allows them to move said data downstream to fuel applications and analytics with real time data.

Flink

The second prong—after the data platform—of the three-pronged attack is Apache Flink, which provides an open source stream processing service that works like batch system processing. Confluent claims that like the Kora search engine, Flink is also sixteen times faster than Apache Kafka, making it an add-on worth considering if speed is a priority. The product allows users to process data without a single line of code, making it easier to manage for staff with a less specialized skillset, although there are also coding options available for those who are interested. Flink also enables the processing of continuous streams of data with low-latency and high throughput, with additional capabilities such as special processing semantics and support for multiple APIs, among others.

Iceberg

The final prong after Flink is Apache Iceberg. Iceberg is an open source project and is the one of the standard open table formats for the full ecosystem of analytics tools, such as snowflake. It allows users to access tables from the stored data in cloud storage and allows broad sharing. Additionally, there is a broad community that utilizes the Iceberg format, and that ecosystem will continue growing in the years to come, meaning additional options and functionalities will be available for Iceberg users, even from other vendors.

Additional stand out characteristics include atomic transactions, where data is either fully committed or fully rolled out to prevent data corruption or loss; schema evolution, which allows column modification without disrupting existing data or queries; and time travel, which unfortunately does not involve time machines, but rather allows users to query data as it existed at a specific point in time.

The inclusion of Iceberg allows the offering to provide a unified system, allowing streams of data in Kafka and shared tables in Iceberg. The exact same data from the kora cloud engine is then made available as iceberg tables. The data flow occurs in three phases. Phase one consists of the data stored in the Kora engine flowing into Iceberg. The second phase involves the data flow occurring bidirectionally, becoming available from both locations. In the third phase, all Iceberg data (tables, etc.) is then available through Kafka.

Governance

Of course, when working with data, knowing about a company’s data governance policy is crucial. The way governance works in the Confluent platform is that it is implemented—streamed— right from the onset. Other factors at play in governance are stream quality, stream catalogue, and stream lineage. Stream quality consists of data integrity, data rules, and contracts; it ensures that there are standards in place for all data that passes through the platform. Stream catalog and stream lineage allows user to view visual representation of the data’s movements and how it has transformed along its trajectory. The integrated and complete governance suite is part of the data product and is another way to solve the solution of data mess, by alleviating the security team’s workload.

The End of Data Chaos

For SAP users grappling with data chaos within Datasphere on the BTP (Business Technology Platform), the comprehensive three-part data streaming solution could be the exact beacon of light they are looking for. Users could not only streamline their data, but also alleviate the workload associated with managing complex data flows. By embracing this trifecta of Apache technologies—Confluent’s data streaming platform, Flink's stream processing capabilities, and Iceberg's table format for data management—SAP users have the possibility to gain greater control over their data and unlock newfound efficiencies, allowing them to redirect valuable time and resources towards enhancing other aspects of their operation. It is one of many interesting options available for SAP users, and E3 Magazine will be following Confluent’s future trajectory with data streaming solutions with great interest.

confluent.io

Laura Cepeda

Laura Cepeda is Managing Editor for e3mag.com.

All articles of the author

The End of Data Chaos

The problem

Responsibilities, technologies, and processes

Data streaming platform

SAP integration

Flink

Iceberg

Governance

The End of Data Chaos

Write a comment (Cancel Reply)

Test text

The woes of the young CIO

AI must be seen as a team initiative

Venue

Event date

Early Bird Ticket

Regular ticket

Venue

Event date

The End of Data Chaos

The problem

Responsibilities, technologies, and processes

Data streaming platform

SAP integration

Flink

Iceberg

Governance

The End of Data Chaos

Write a comment (Cancel Reply)

Test text

The woes of the young CIO

AI must be seen as a team initiative

Venue

Event date

Early Bird Ticket

Regular ticket

Venue

Event date

Tickets