Rather WannaSwitch than WannaCry
Malware such as ransomware, viruses, and unintentional data leakage have so far been reserved primarily for endpoints. Recently, however, ransomware variants such as WannaCry and Petya have also caused a stir in the corporate server environment, because physical and logical access controls and purely structural and network-related safeguards are obviously not sufficient, even for high data center classes.
The vulnerability for WannaCry, Petya and colleagues is obviously based on unpatched operating systems. Now, there are many use cases where patches on production environments can, by definition, only be implemented at long intervals.
It doesn't even take highly sophisticated zero-day exploits and fast-acting malicious energy for the corresponding software to take over these systems.
It is enough to exploit reasonably recent gaps (whose patches are only available for a few weeks or months) without restraint. A data center in 7-x-24 operation, which provides regular maintenance windows perhaps twice a year and does not consider the current security patches to be sufficiently critical to interrupt operations, is and remains open to such attacks.
Names of a wide variety of companies from the financial sector, the chemical industry and many other economic sectors, public and semi-public institutions have been circulating through the press as suspected and confirmed victims.
The fact that this also includes operators of critical infrastructures gives a deep insight into the response to current requirements such as IT-SiG and the like. The question here is whether the security measures taken to the best of our knowledge and belief are at all suitable for dealing with the threats posed by ransomware. Two areas of tension need to be resolved:
- 7-x-24 IT operation vs. permanent patching
- Attack defense vs. relaxed counter-reaction
There is a comparatively simple - and above all more or less universally applicable - approach to this: architecture- and infrastructure-independent mirroring at database and application level.
This method, which is often ridiculed as outdated in times of virtual machines, storage mirroring and a wide variety of cluster variants, can also show its strength in ransomware attacks: the logical independence of the underlying system environments.
Gangsters go empty-handed despite successful attack
A Libelle user did not want to play the victim of a ransomware attack. That's why he practiced the emergency. (The company is a well-known food manufacturer and unfortunately cannot be named because this industry is permanently exposed to extortion attempts and other attacks).
The company's success is based on product quality and 7x24 availability of a large number of MS-SQL databases. Patching these permanently, and constantly interrupting operations to do so, is not feasible in the company's day-to-day operations.
That is why a different approach was chosen: The productive systems run "as is". The current data is permanently transferred from the productive environment to a so-called mirror environment via data and application mirroring.
It is also patched regularly and up-to-date, but not in the productive environments, but on the mirror systems. BusinessShadow works completely independently of the productive environment, without shared servers, without shared storage, in short: shared nothing.
Mirroring means that the current data is physically present on the mirror side at all times, but the systems can be maintained and kept up to date with the latest patches separately from productive operation.
If a ransomware attack on the production environment succeeds due to the low patch level, the system simply switches to the highly patched mirror system and continues working there within a few minutes. The result: The attack was not blocked, but it failed.
Precaution also against classic data corruption
The data mirroring described above is asynchronous. This has several advantages over synchronous mirroring, which is often used in storage and cluster solutions: On the one hand, it is possible to use relaxed maintenance windows on the mirror in the first place because, in contrast to synchronous mirroring, no two-site commit is required; on the other hand, the company also gets out of the synchronous trap.
If a logical error has corrupted the productive dataset, the dataset on the mirror is automatically corrupted as well. Executed ransomware encryptions or deletions, virus infestations, faulty application activities, faulty data imports, malicious manual activities of internal or external users, or the like are logical errors that, in a bad case, cause the company to come to a standstill.
In the even worse case, that work is continued with incorrect data, thereby generating additional economic expense or even public image damage.
This asynchronous data and application mirroring can be used to define any time offsets between the production and mirror systems. The current production data is already physically available on the mirror system, but is artificially kept on hold in a time funnel and only logically activated when the defined time offset expires.
From a logical point of view, the mirror system thus permanently lags behind the production system by exactly this time offset, but already has the delta of the data physically available on its own storage, and can update this ad hoc if required.
So if a logical error of any kind occurs in the production environment, the organizationally responsible instance decides the switchover case, e.g., the person responsible for SAP, the application owner, the DR officer, or the IT manager, depending on the company structure and processes.
From a technical point of view, the best possible point in time for the dataset is determined and activated on the mirror system. The database or application on the mirror system is thus made available productively at any point in time within the time funnel with transactional accuracy and data consistency, users and other accessing applications log on again and can continue working with correct data.
Another advantage of this asynchronous data and application mirroring is the significant reduction in latency times, since the production system does not have to wait for the commit of the mirror system.
This means that practical and economically interesting DR (disaster recovery) concepts are also possible over long distances and with low volume and QoS (quality of service) requirements for the network lines between the systems.
Challenges: logical, physical, infrastructural
The failover systems can be operated not only in the company's own data centers, but also, for example, as a service at a "friendly company" or service provider located at any distance, which is particularly common in medium-sized companies.
As a result, the distance between the production site and the outage site is no longer limited by the capabilities of DarkFibre, campus or metrocluster technologies, which in the usual case are only a few kilometers.
Asynchronous mirroring can be expanded as required based on business requirements and depending on the corporate structure, even on different tectonic plates. This means that DR concepts are possible that are also effective in the event of large-scale disasters and keep IT operations running across countries, regions or even worldwide.
In addition, architecture-independent data and application mirroring frees the user from the "single point of failure" dilemma: In addition to the already recommended shared-nothing architecture, different hardware architectures and infrastructures are also supported in the environments involved.
In addition to technological interests, economic interests must also be taken into account. Homogeneous architectures require less maintenance, but the risk of faulty drivers, firmware patches or controller software affects not only individual environments, but all of them.
In addition, commercial considerations also play a role with regard to the requirements for productive and emergency environments: Often it is sufficient if only the productive system is designed for permanent high-performance operation.
The failure system can also be designed smaller, it just has to be "good enough" for a hopefully never occurring, and if, then only temporary use.
In practice, these considerations often result in the "old" productive system continuing to operate as the new failover system as part of the usual hardware cycle. Thus, many companies opt for the middle ground, between homogeneous and heterogeneous architecture, in which at least two hardware standards are defined, often also with components from different manufacturers.