Euclid and Hana

Clustering, i.e. finding similarities, can often provide to be very enlightening for large amounts of data. The trick is based on the Euclidean distance formula and can be done without Hana as a simple mental calculation.

November 2, 2023

Content:

To the comments

This text has been automatically translated from German to English.

SAP TechEd 2023, Bangalore, India

At the start of TechEd 2023, SAP Chief Technology Officer Jürgen Müller said he was excited to announce one of the most important, if not the most important, enhancement to the Hana database platform. SAP set the stage at TechEd with a Hana sensation that goes beyond Large Language Models (LLMs), but which merely uses deep learning algorithms to summarize, order, or make predictions from large amounts of data.

Jürgen Müller argued at TechEd 2023 in Bangalore that large language models can usually only capture the past. They are trained on existing data, mostly extracted from the Internet. An immediate, real-time answer based on operational data is difficult. SAP's Hana database has been delivering real-time results for many years—now with vectors!

Vectors

According to Jürgen Müller, the Hana sensation is the ability to use vectors as objects on the database platform. Well, in traditional Euclidean mathematics, vectors are not really anything new. With the Hana database platform's existing IT tools, any first-year computer science student can implement a few simple vector functions. What Jürgen Müller may have meant is an SQL DB language extension with a few vector commands.

What is a vector? In a coordinate system with an x and y axis, you can select any two points. If you connect these points with the shortest possible straight line and add an arrow at the end, you have a directed graph or vector in two-dimensional space. It is also easy to imagine a vector in three-dimensional space (x, y, and z axes), such as a pencil lying on a table. The end and the tip of the pencil can be exactly determined as points in space. The pencil would then be the vector.

Now we go into higher dimensions, which are difficult to imagine visually (a four-dimensional cube would have a three-dimensional shadow, for example), but with higher dimensions it is still easy to calculate—even in your head, which can be proven with this editor-in-chief's blog post.

Many parameters, many dimensions

Task: cluster a million quotes according to customer groups, machine utilization, sales, etc., i.e. group them into groups that have similarities. Each quote has specific parameters that can be easily identified. The status of the customer can be derived from the customer name. Bad customers get the value zero, good customers get the value nine. Small offers under 1000 EUR get a value of one, large offers over one million EUR get a value of 25 and all others get a fixed gradation between one and 25. The same applies to the goods offered: stock items, one-off items, etc. At the end of this process, there are ten categories and each listing has one value per category. These ten values can also be interpreted as a vector (starting point is zero) in a ten-dimensional space and written down as follows (5, 9, 3, 7, 11, 2, 42, 15, 6, 102).

Euclidean distance

For each offer there is now a vector in the ten-dimensional space. Now the task is to group these offers for possible marketing actions, expected sales, or for ordering raw materials in advance. The trick of clustering, i.e. the process of grouping, is to determine the distances of the different vectors in the ten-dimensional space.

The distances of pencils lying on an office desk in three-dimensional space are easy to determine. The distance is measured with a ruler. All pencils less than ten centimeters apart belong to one group, all others to the other group. (I hear justified objections: clustering is a bit more complex, but the vector principle, as presented by SAP's Chief Technology Officer Juergen Mueller, remains very simple).

Each supply vector has a data point at its end in ten-dimensional space (see the ten numbers above). Euclidean distances are used, among other things, as distance or similarity measures to measure how similar or dissimilar these data points are. But how do we calculate the distance of the two data points between the example vector (5, 9, 3, 7, 11, 2, 42, 15, 6, 102) and a second quote with the approximate vector (7, 2, 5, 13, 25, 9, 1, 132, 55, 8)?

The first step is to calculate the respective difference of the data points: The first value of the first supply vector minus the first value of the second vector, i.e.: 5 minus 7, 9 minus 2, 3 minus 5, etc. These results are squared and summed: -2 squared is 4, 7 squared is 49, and so on. From the sum of the ten squared numbers (4 plus 49 plus 4 plus 36, etc.) you take the square root, this result is the Euclidean distance! Eureka!

This means that there are mathematically clearly defined relationships between the one million offers. These Euclidean distances can now be used as similarity measures for clustering. This makes it possible to create excellent maps whose areas (clusters) provide direct information about preferences, similarities and trends. About ten years ago, here at the E3 publishing house, we created a Hana map with clusters based on a survey in the SAP community (the survey results back then are the one million quotes in this example), with the help of Professor Alfred Taudes from the Vienna University of Economics and Business.

Maps are not AI

What SAP Chief Technology Officer Jürgen Müller presented at TechEd 2023 in Bangalore is of great practical importance. Many Hana customers will appreciate the language extension into higher dimensions. However, what was presented is very traditional and very familiar mathematics. Presenting Euclidean distance as a milestone in Hana development is a rather odd choice.

Supplement: in addition to Euclidean distance, there is a second unit of measurement. The Manhattan metric is a measure of proximity, which is also a measure of distance for metric variables such as height, age, or weight. The Manhattan metric measures distances as rectangular distances; like when walking through a street or riding in a taxi. In contrast, Euclidean distance measures the direct, shortest, diagonal distance (as the crow flies). Both methods can be used to generate clusters for a utility map. (Source only in German)

Euclid and Hana

SAP TechEd 2023, Bangalore, India

Vectors

Many parameters, many dimensions

Euclidean distance

Maps are not AI

Write a comment (Cancel Reply)

Test text

The woes of the young CIO

AI must be seen as a team initiative

Venue

Event date

Early Bird Ticket

Regular ticket

Venue

Event date

Euclid and Hana

SAP TechEd 2023, Bangalore, India

Vectors

Many parameters, many dimensions

Euclidean distance

Maps are not AI

Write a comment (Cancel Reply)

Test text

The woes of the young CIO

AI must be seen as a team initiative

Venue

Event date

Early Bird Ticket

Regular ticket

Venue

Event date

Tickets