The global and independent platform for the SAP community.

The technology of a data lake

In a data warehouse, the data is stored in a relational database. This is expensive and accordingly there are products from the Big Data world that start here. Parquet, Hive, SAP Vora and Exasol are the best-known representatives in the SAP environment.
Werner Dähn, rtdi.io
January 9, 2020
Content:
Smart and Big Data Integration
avatar
This text has been automatically translated from German to English.

In general, I would divide the data storage options into three categories. Files: The data is stored as simple files and used like tables.

These files should have information about the structure and should also be indexed. The Parquet file format is a representative of this category.

Database process: Instead of working directly with the files, there is an active service on top that feels like a database. It takes care of caching frequently used data and can be queried via ODBC/JDBC. A typical representative of this type in the big data world is Apache Hive.

In-memory: For maximum performance, all data is stored in memory, indexed and used to build something similar to Hana. Exasol and SAP Vora work according to this principle.

The big data world lives solely from the fact that many small (and therefore inexpensive) servers form an overall system. This allows you to scale infinitely and the hardware costs only increase linearly.

But the more nodes form the overall system, the more expensive their synchronization becomes. A link ("join") of three or even more tables can mean that each node has to fetch the appropriate intermediate results from the previous join and the query runs for hours.

This problem is called "reshuffle". Of course, the fact that the data is stored in memory does not help when redistributing the intermediate results via the network.

Hana, on the other hand, is a real database. It is extremely fast when searching. The join performance is great, you have full transaction consistency when reading and writing. All of this requires a lot of synchronization.

However, such a database does not scale infinitely. Many projects solve the "reshuffle" dilemma by storing the data in an optimized way for certain queries. This in turn reduces flexibility and increases costs, i.e. precisely the points that were actually intended as advantages of a data lake.

The synchronization effort of transaction consistency is a logical problem. It cannot be solved without imposing softer requirements, such as "eventual consistency".

This problem is known as the CAP theorem. Of the three requirements Consistency-Availability-Partitioning, all of the points can never be achieved, especially in the event of an error.

A highly available and distributed system must make compromises in terms of data consistency, while a transactional database system must make compromises in terms of availability or scalability.

The data available in Big Data is raw data that becomes information through non-SQL transformations - so a Big Data-based data warehouse with SQL queries makes no sense.

The data lake is the playground for the data scientist. This person has easy access to data that was previously deleted or was difficult to access.

The data scientist can deal with all the problems that arise from big data technology: Semantics of the data; slow performance; and, what data is there. Mixing big data and business data? No problem for him.

Coupling Hana with Vora makes little sense from this point of view. Both store the data in-memory and allow fast searches - with corresponding costs. Both have warm storage on disk (Sybase database), both focus on SQL queries. Vora is also no longer on SAP's price list as a stand-alone product.

Parquet files and a database, on the other hand, complement each other perfectly. The parquet files in a data lake cost practically nothing to store, whereas storage space in the database is expensive.

A database like Hana is excellent for joins and complicated SQL queries, but for a compute cluster these operations are the most complex.

The combination of the two results in fast business intelligence queries and convenient access to all raw data. Both contribute their strengths.

Download as PDF only for members. Please create an account Here

avatar
Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.


Write a comment

Work on SAP Basis is crucial for successful S/4 conversion. This gives the so-called Competence Center strategic importance among SAP's existing customers. Regardless of the operating model of an S/4 Hana, topics such as automation, monitoring, security, application lifecycle management, and data management are the basis for the operative S/4 operation. For the second time already, E3 Magazine is hosting a summit in Salzburg for the SAP community to get comprehensive information on all aspects of S/4 Hana groundwork. With an exhibition, expert presentations, and plenty to talk about, we again expect numerous existing customers, partners, and experts in Salzburg. E3 Magazine invites you to Salzburg for learning and exchange of ideas on June 5 and 6, 2024.

Venue

Event Room, FourSide Hotel Salzburg,
At the exhibition center 2,
A-5020 Salzburg

Event date

June 5 and 6, 2024

Tickets

Early Bird Ticket - Available until 29.03.2024
EUR 440 excl. VAT
Regular ticket
EUR 590 excl. VAT

Secure your Early Bird ticket now!

Venue

Event Room, Hotel Hilton Heidelberg,
Kurfürstenanlage 1,
69115 Heidelberg

Event date

28 and 29 February 2024

Tickets

Regular ticket
EUR 590 excl. VAT
The organizer is the E3 magazine of the publishing house B4Bmedia.net AG. The presentations will be accompanied by an exhibition of selected SAP partners. The ticket price includes the attendance of all lectures of the Steampunk and BTP Summit 2024, the visit of the exhibition area, the participation in the evening event as well as the catering during the official program. The lecture program and the list of exhibitors and sponsors (SAP partners) will be published on this website in due time.