[LINUX] Mainframe error handling

Introduction

In this article, I'll talk about what mainframes do with exception handling, and how they differ from Linux exception handling.

The direct reason for writing this article was a request from a certain person. However, I always felt that the idea of mainframe error handling was often helpful even in the present age when PC server and cloud systems have become major.

I'm working around the Linux Kernel now, but about 20 years ago I was assigned to the mainframe OS development team and wrote a driver for a coprocessor on the mainframe. It was. At this time, I had many opportunities to experience the way of thinking about the abnormal handling, and I felt that the experience at that time was still very useful 20 years later.

In the first place, the mainframe is a system that has quietly supported the foundation of society for many years. For example, in the core banking system, that is, the system that keeps the ledger of your account, the mainframe still occupies a large number [^ 4]. Mainframes have been used as a system that keeps working normally every day and becomes a social problem when trouble occurs. What the mainframe does in the event of an anomaly may be a hint, even on modern systems. We hope that readers will refer to this article along with the word "Won't know new".

[^ 4]: Wikipedia's "[Core Banking System](https://ja.wikipedia.org/wiki/%E5%8B%98%E5%AE%9A%E7%B3%BB%E3%82%B7" % E3% 82% B9% E3% 83% 86% E3% 83% A0) "

Refusal (terms)

I had a lot of trouble and difficulty in writing this article about terms. Mainframe terminology is often different from Unix / Linux terminology and does not always have exactly the same functionality.

For example, IBM's mainframe documentation [may be written](https://www.ibm.com/support/knowledgecenter/ja/SSLTBW_2.2.0/com.ibm.zos.v2r2. ikjb600 / ikj2k200_ESTAE_and_ESTAI_Exit_Routines.htm).

The ESTAE exit routine is set by issuing the ESTAE macro instruction.

ESTAE will be described later, but the macro here refers to the process of calling a system call [^ 3] or a library function in Unix / Linux. The name of this macro is that the mainframe system control program is written in assembler base, and a series of flow when calling the function is registered as macro, and that macro is used when calling. Comes from. Similarly, the term "exit routine" may be more appropriate to the reader as a "handler". In this way, even just calling a function is often called differently on mainframe operating systems.

I'm more familiar with terms such as Unix / Linux, so I've written them in a way that suits them. However, please note that it is not an accurate description as a mainframe.

Also, I've been away from the mainframe for almost 20 years, so I think there are some mistakes. I would appreciate it if you could point out such a place.

[^ 3]: This is also called an SVC (Super Visor Call) routine instead of a system call on the mainframe.

Concept of mainframe exception handling

Now, let's start with the basic concept of mainframe exception handling. The following text will best describe this. This text is a manual of the old IBM mainframe OS MVS [^ 1] "Functions and structure of MVS" % 81% AE% E6% A9% 9F% E8% 83% BD% E3% 81% A8% E6% A7% 8B% E9% 80% A0-% E5% 8D% 83% E7% 94% B0-% E6 It is an excerpt from% AD% A3% E5% BD% A6 / dp / 4844372890). (Emphasis author)

[^ 1]: IBM's mainframe OS is z / OS now, but the basic idea should be applicable to both MVS and z / OS.

MVS aims to maximize the availability of the system and to minimize the impact on users in the event of a hardware or software failure. In the event of a failure, we first try to recover work and resources, but if this is not successful, even if the failure part is separated from the system, the entire system will continue to operate __ and the user will use it. Try to stay ready.

Please note that the whole system here is not __ but __ that we are talking about the whole system such as a clustered cluster system. This sentence is actually talking about __ "1 hardware + 1 OS" __ configuration. In other words, MVS aims at __ operation that minimizes such __failure in one system that operates it. In Linux, even if the kernel is unavoidably panicked, the mainframe localizes the failure and isolates it so that the system continues to operate as much as possible.

Of course, there are limits to what can be done with one system, so it is naturally possible to perform redundancy by clustering even with a mainframe, but it can be said that the mainframe is aimed at the above operation even within one system. Probably.

It may be said that the idea of valuing the continuous operation of one computer as much as possible was born because the unit price per computer was still high in the era when mainframes developed.

Exception handling handler registration (ESTATE)

In order to achieve the above goal, in the mainframe OS, register a handler for processing when a critical error occurs, and register a handler for abnormal processing called __ESTAE exit routine __ in the above __ESTAE macro __. Is done [^ 6]. In normal error handling, the error is reported by the return code of the called subroutine, but in the case of a certain abnormal state, the handler for abnormal processing registered using this function called ESTAE It is called. The meaning of this handler may be easier to understand if you think of a signal handler for C language programs on Unix / Linux and a handler registered with defer () for Go language. However, in the case of a mainframe OS, this ESTAE function can be used inside the system call.

[^ 6]: According to the reviewers, there is ESPIE in addition to ESTAE. (ESTAE has a wider range of functions). Unfortunately, I have never used ESPIE.

The handler for this exception handling is called in the following cases.

In this way, it is characterized by being called not only in the case of software bugs but also in the case of hardware abnormalities. In the case of a hardware error, how this handler is called is done in the following order, as shown in the following figure.

  1. The CPU that detects the error (machine check) generates a machine check interrupt.
  2. Machine Check Handler (MCH) determines whether recovery can be performed by ECC, etc., and if recovery is possible, resumes the interrupted processing.
  3. If it is determined that recovery is not possible, if there is another processor, it will be notified of the abnormality.
  4. The processor notified of the anomaly runs the Recovery Termination Manager (RTM).
  5. RTM calls the registered ESTAE handler (not shown in this figure).

mch_handling.jpg

(The figure is excerpted from "MVS Functions and Structure")

The features of ESTAE are as follows.

  1. Multiple ESTAE handlers can be registered and nested. For example, if the module that registered ESTAE Handler A calls another module, and that module registers ESTAE Handler B and then an error occurs, ESTAE Handler B is called. Very useful considering that Unix / Linux signal handlers cannot be nested.
  2. In the relationship described in 1., if the exception handling handler of the called subroutine cannot be recovered, the exception handling handler of the caller is called. (This is called percolation). It takes the form of the parent taking responsibility for the mismanagement of the child (?).

A simple diagram of this situation is as follows.

ESTAE_error_handling.jpg

Exception Handling Handler Registration (FRR)

In addition to ESTAE, mainframes also have a feature called FRR. FRR can also register an exception handling handler like ESTAE, but the difference from ESTAE is that MW and apps cannot be used for the system only [^ 5] Instead, the __ handler is called while holding the __ system lock. It is a point. The advantage is that you don't have to worry about deadlocks because you don't have to re-secure the lock during exception handling because you can continue the error handling while holding the lock.

Those who struggle with the Unix / Linux async signal safe specification may find this value. Rather than limiting the "functions that can be called in the signal handler", it will be easier to recover if you have a dedicated handler for holding the lock.

If an FRR handler is registered, this handler will be called before the ESTAE handler. This is a convenient function that can be used when there are resources that must be returned, such as when the system lock must be returned.

[^ 5]: However, in the z / OS manual, ["Any program function can use SET FRR ..."](https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com. Since it is written as ibm.zos.v2r1.ieaa400/setfrr.htm), it may be possible to call it not only in the system program but also in z / OS.

What to do with an exception handler?

In the anomaly handling exit routine, in my experience when I wrote a driver for a coprocessor, I was doing the following:

  1. Record debug information when an error occurs
  2. Notification of abnormality to the component using the function
  3. Repair pointers in components if possible
  4. Return of used resources (lock, reserved memory, etc.)
  5. If repair is not possible, set to return an error when the function is called again.

Let's look at each one in turn.

1. Record debug information when an error occurs

Partial dumps, logs, and execution traces of related parts are collected so that the cause can be investigated later regarding the state when an abnormality occurs. Even a difficult timing failure, such as the frequency of occurrence once a year, is a problem that has just occurred from the end user's point of view. For such a case, you need to save the information for debugging and use it to find the cause. It is important to collect data such as the internal state of the component in which the abnormality occurred and the related structure at that time.

I wrote it as a partial dump, but in the driver I wrote, only the relevant parts such as the memory area used by that driver were collected in the dump. In addition, after collecting this partial dump, if possible, it will be recovered and the system will continue to operate. In the Linux kernel, I don't remember much about the function that corresponds to such a partial dump.

(Suffice it to say, the closest is Linux kdump, but kdump's original purpose is to collect a memory dump of the entire system, not a partial dump, and it is assumed that it will be restarted. )

2. Notification of an error to the component using the function

Programs using the functionality of the component in which the anomaly occurred should be notified that the anomaly has occurred. As mentioned above, the mainframe is designed to keep the system running as much as possible, so if you don't do this __ "the system is working properly" but "necessary for business" Only the component was dead "__, which is the worst death. If this happens, the necessary recovery processing will not work, which will cause major troubles that affect the entire business system. Therefore, it is necessary to set a return code indicating that an abnormality has occurred for all the tasks that are sleeping waiting for the processing of the component to be completed, and to wake up the task.

In the Linux Kernel, when a memory error occurs, killing the process is the closest operation if the process is using the area. However, in the case of a PCIe error, it is not possible to do so, and it is only possible to record it in the log. (However, I think this is because I / O is abstracted in multiple stages from the page cache to the driver layer in the kernel in Linux, and it is not possible to determine which I / O is associated with which process. The I / O layer of the main frame is relatively simple, such as not having a caching layer like page cache.)

3. Repair pointers in components if possible

I can't go into too much detail here, but I remember that the driver I wrote repaired pointers such as linked lists inside components as much as possible. I don't remember any Linux driver doing the same thing. Basically, the basic stance of the Linux kernel is to panic when this happens.

4. Return of used resources

The resources used must be returned even in the event of an abnormality. For example, if you do not return the used memory, a memory leak may occur and the system memory may run out later. In addition, if the entire system is still in possession of a lock, the system will later deadlock in an attempt to retrieve the lock. In this way, it is important to return the resources held in the exception handling routine. Resources must be returned appropriately in order to keep the system running as long as possible.

5. If repair is not possible, set to return an error when the function is called again.

Detach the component from the system to prevent it from being used in an abnormal state. Make sure that the error return code is returned when the component is called.

Mainframe culture

In talking about culture, I would like to tell a little old story. The words pointed out by my seniors in a desktop review of the source code for a feature on a mainframe have never been forgotten by me. That's because the word seems to point to a part of mainframe culture.


Senior: "When I look at the source converted to assembler, what happens if a machine check occurs between this instruction and the next instruction (due to a hardware failure)?" I:"????" Senior: "(Because the FRR handler has not been registered yet) The system will stall because the lock cannot be returned? __ It's a bug, so let's fix it! __"


I think the lines that my seniors took for granted show how mainframe system developers were working toward exception handling.

It is unpredictable where a hardware failure will occur when the software is running. So it doesn't make sense to take the above considerations into just one code I've reviewed. Almost all source code from top to bottom is meaningless as an indication unless such hardware anomalies are taken into consideration. I think that the real awesomeness of mainframe reliability is that we have implemented it with the attitude of not stopping the system even for such abnormal handling in the event of a hardware failure, and we have continued to do so.

In addition, I remember that the ratio of the abnormal processing code in the entire source code was very large because the abnormal processing was implemented to the fullest extent in this way. I'm sorry for the personal feeling that n = 1, but I feel that the normal processing and abnormal processing of the mainframe were implemented at a ratio of about 3: 7 intuitively. I think that it was implemented with so much effort in error handling. (If you want the same ratio, is the Linux kernel about 5: 5? If you are interested, please calculate.)

Critical view

So far, we've talked about what to do when a mainframe is abnormal, but finally, let's talk about some critical views. Some of you may have felt uncomfortable reading the explanations so far, and I have heard many skeptical and critical opinions since I was in the mainframe department. Believing in the mainframe is not a healthy attitude. On the contrary, I think that completely denying the mainframe is not the correct form of engineering.

It's also important to be aware of your skepticism about mainframes so that you can calmly analyze the pros and cons of modern architectures, including mainframes and the cloud, and choose the right technology. Let's do it.

"In the first place, does the abnormal handling at the time of hardware failure work?"

Since there is no data at hand, it will be a qualitative story, but when it comes to whether the recovery process works well when a CPU failure occurs, even the mainframe could not be recovered as expected. There seems to be quite a few. After all, hardware failures are often Pulpunte who do not know what will happen, and it seems difficult to perform recovery processing well even with a mainframe that is doing so hard. For this reason, does it make sense to work hard on the recovery process so far? There is also the idea.

Personally, even if it doesn't work ideally, it's not surprising that I can proudly say, "The software was designed to recover so far," but the technology Opinions will be divided on whether or not it makes sense.


(Added at 2020/5/3 20:00) I received a comment on Facebook. I have received permission, so I will add it here as well.

The question is, "Does the abnormal handling at the time of hardware failure work in the first place?", But when I was working on a general-purpose machine, ・ Rather than "avoid hardware failures", think of "leaving information to make the trouble occurrence situation during operation" explainable ". " ・ "Because it is essential to be accountable in the professional world." ・ "First of all, recognize that best effort is not allowed." ・ "Therefore, the task links are also tripled, and the past state transitions and the factors that caused the transitions are left." ・ "For that purpose, it is necessary to avoid hardware failures and leave information." Was taught. for your information.


"Is it credible to recover on a system that claims to be abnormal?"

Is reporting anomalies credible in the first place? There is an opinion. Rather than trusting the story of someone who claims to be "I'm crazy" from the boss of the next department at the time, isn't it right to have another normal person recover? Therefore, it is strange that the system that caused the abnormality recovers by itself with ESTAE or FRR. The recovery process should be performed from another node. " This isn't really a false opinion, and it seems that some systems are actually aimed at successors to mainframes designed that way.

Of course, recovery by other systems raises another issue. I don't talk much about the difficulty of recovery by clustering or distributed system because I think the readers who read this will be more familiar with it, but there seems to be no absolute solution. After all, "how and how much recovery processing should be performed in the event of an abnormality?" May be the only way to compare the technical and monetary costs and effects and adopt the appropriate one. not.

Summary

Described the concept and contents of mainframe error handling. I also added as much as I could write about what Linux does for that. I hope it will be helpful to the reader.

References

["Functions and Structure of MVS" Modern Science Company (Impress) ISBN-13: 978-4844372899](https://www.amazon.co.jp/%EF%BC%ADVS%E3%81%AE%E6% A9% 9F% E8% 83% BD% E3% 81% A8% E6% A7% 8B% E9% 80% A0-% E5% 8D% 83% E7% 94% B0-% E6% AD% A3% E5% BD% A6 / dp / 4844372890)

Recommended Posts

Mainframe error handling
Python Error Handling
SikuliX error handling
django.db.migrations.exceptions.InconsistentMigrationHistory error handling
About tweepy error handling
Error handling in PythonBox
GraphQL (gqlgen) error handling
Around feedparser error handling
[Error countermeasures] django-heroku installation error handling
Error handling when installing mecab-python
About FastAPI ~ Endpoint error handling ~
PyCUDA build error handling memorandum
Error divided by 0 Handling of ZeroDivisionError
[Error handling] peewee.IntegrityError 1451 occurs in peewee
Error handling when updating Fish shell
Error handling during Django migrate'DIRS': [BASE_DIR /'templates']
Error recording
Data handling
Exception handling
Homebrew error
Selenium + Firefox 47+ Can't load the profile. Error handling
Summary of error handling methods when installing TensorFlow (2)