Inspired by an Advent Calendar, I would like to regret what I did in a production environment.
This is the event in my second year as a member of society. At that time, I was in charge of maintaining BtoB systems running in a cloud environment. As a maintenance member, I remember that there were two infrastructure and three apps. I was in charge of infrastructure and the other member was the leader.
On this day, I was supposed to change the settings of the middleware in the server. I told the system owner that there was no outage. Therefore, there is no switching to the maintenance screen or maintenance notification. As usual, I sent an email to the owner to contact me to start the production work and started the production work. We will proceed according to the verified procedure. When the setting change itself was completed and the operation was confirmed, I was informed that the service was down. .. .. I don't know what happened and I remember my head turning white. I asked the leader for help and the leader started the investigation. Then, the leader said, "The host name is strange."
↓ Host name at that time
^i
He noticed by looking at the host name displayed at the prompt after logging in to the server. By the way, I didn't notice because I wasn't logging out of the server or launching a new prompt.
If you check the history, the following command will be displayed. .. ..
hostname ^i
As anyone who understands it, I have rewritten the host name. I didn't realize that I had rewritten the hostname by mistakenly typing "hostname -i" as "hostname ^ i". I didn't know at the time that the hostname command was a command that could also change the hostname.
Command execution example | Description |
---|---|
hostname | Show host name |
hostname -i | Show IP address |
hostname <string> | Host name<string>change to |
Changing the host name by the hostname command becomes invalid when the OS is restarted (it returns to the host name before the change). Therefore, the OS was restarted and the host name was restored. The failure has been successfully recovered.
-You have manually typed a command that is not in the procedure manual. Only the minimum commands were described in the procedure manual, and the commands for confirmation were omitted. ・ The work was carried out by one person. Due to man-hour problems, it was basically a one-person work.
What was implemented as a recurrence prevention measure at that time -Do not execute any commands other than those described in the procedure manual. Describe all necessary procedures such as commands for confirmation. ・ Be sure to copy and paste commands, not by hand. Eliminate command input mistakes by hand. -Use commands that do not affect the system when checking. Avoid using commands that may change settings as much as possible. ・ Be sure to carry out the work by two people and perform a double check. Secure a system to immediately notice typos. By distributing the responsibilities between the two people, there is room in the heart. What I can think of now -Automate the work itself. Minimize room for human intervention and reduce human error.
This is my Misogi. Thank you for reading.
Recommended Posts