The global and independent platform for the SAP community.

SAP Big Data - what is Big Data?

What exactly is meant by the term Big Data? Does Big Data simply mean mass data, i.e. "lots of data" in the data warehouse? Or is Big Data a replacement for the data warehouse?
Werner Dähn, rtdi.io
November 28, 2019
Content:
Smart and Big Data Integration
avatar
This text has been automatically translated from German to English.

With the direct translation mass data one hits only one aspect. All normal data from the ERP system and other databases are also mass data.

With regard to the volume of data, we must speak of quantities that are too large for databases - too large in the absolute sense or in the cost/benefit sense.

The more interesting aspect is the degree of structure in the data. The ERP system contains 99 percent well-structured data, such as the MATART (material type) field in the MARA (material master) table.

One percent is free text like a delivery note. With Big Data, it's the other extreme and the exciting information is in the unstructured data areas. When and where a photo was taken is interesting, but what the picture shows is infinitely more important.

This is also accompanied by the type of data preparation. Whereas with databases it is a query such as "Total sales per month", in the above example we are suddenly talking about image analysis.

Even in cases that are not so extreme, such as log files, simple summations and counts are not performed. Databases are thus the worst choice for such data.

The most important definition of Big Data, however, is "all data that is not being used today to increase company profits." Creativity is the name of the game here. One of my recent projects involved writing down the utilization of servers in the data center - with the goal of reducing the number of servers.

An example: Sales are to be linked with information on how intensively customers have viewed the respective product on the website. For example, a product is advertised in the media.

Is this advertising being noticed? If so, we should see increased traffic on the associated product pages. Do prospective customers read the product page briefly, are immediately convinced, and then buy?

The web server already writes all page views to log files, but after a week they are deleted. So the data for this would be there, they are just not used yet.

The goal is maximum effectiveness and flexibility. A few years ago, Map Reduce on Hadoop was the ultimate, then came Apache Spark. It could do more, with better performance and greater power.

For a long time Apache Hive was the way to go, today it's Parquet Files. In such a dynamic environment, I don't want to spend a lot of resources on a potentially short-term solution, and I also want to have the openness to switch to something new at any time.

Currently, Apache Spark is such a powerful, but at the same time open solution. With it, the log files of the web server are broken down into rows and columns with one line of code. It is more complex to develop the logic of how the reading time per page can be derived from the history of page views.

If I finally add these and other key figures to the data warehouse, it enables combined analyses - for example, visualizing the key figures of sales, reading time and page impressions over time for a product.

Until recently, storing and processing secondary data was not attractively priced. The volume of data was too large, the information density too low, and the only way to process data effectively was with database-related tools.

The Apache Hadoop Filesystem (HDFS) can be used to form large filesystems from cheap PC components instead of buying an expensive disk array. Apache Spark can process these large data sets, with associated complex algorithms including statistical methods and machine learning.

Data warehouse tools, including those from SAP, have adapted to this situation and provide direct access to Hadoop files or send transformation tasks to a connected Spark cluster. A very simple way to read data from Hana is via the SAP Hana Spark Connector.

Download as PDF only for members. Please create an account Here

avatar
Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.


Write a comment

Work on SAP Basis is crucial for successful S/4 conversion. This gives the so-called Competence Center strategic importance among SAP's existing customers. Regardless of the operating model of an S/4 Hana, topics such as automation, monitoring, security, application lifecycle management, and data management are the basis for the operative S/4 operation. For the second time already, E3 Magazine is hosting a summit in Salzburg for the SAP community to get comprehensive information on all aspects of S/4 Hana groundwork. With an exhibition, expert presentations, and plenty to talk about, we again expect numerous existing customers, partners, and experts in Salzburg. E3 Magazine invites you to Salzburg for learning and exchange of ideas on June 5 and 6, 2024.

Venue

Event Room, FourSide Hotel Salzburg,
At the exhibition center 2,
A-5020 Salzburg

Event date

June 5 and 6, 2024

Tickets

Early Bird Ticket - Available until 29.03.2024
EUR 440 excl. VAT
Regular ticket
EUR 590 excl. VAT

Secure your Early Bird ticket now!

Venue

Event Room, Hotel Hilton Heidelberg,
Kurfürstenanlage 1,
69115 Heidelberg

Event date

28 and 29 February 2024

Tickets

Regular ticket
EUR 590 excl. VAT
The organizer is the E3 magazine of the publishing house B4Bmedia.net AG. The presentations will be accompanied by an exhibition of selected SAP partners. The ticket price includes the attendance of all lectures of the Steampunk and BTP Summit 2024, the visit of the exhibition area, the participation in the evening event as well as the catering during the official program. The lecture program and the list of exhibitors and sponsors (SAP partners) will be published on this website in due time.