[PYTHON] Web service down for 20 days due to virus infection. really sorry

A failed web service is a service operated by an individual.

In February 2016, service resumed 20 days after the failure, but the number of active users is 18% of the previous level. There is no prospect of recovery yet. The non-redundant server was infected with a virus, and the subsequent response was delayed.

I wrote an article about the obstacles that should have occurred at the end of January 2016. We apologize for the inconvenience.

■ Users will not come back anymore スクリーンショット 2016-02-22 11.58.15.png

What kind of virus did you get infected with?

I haven't confirmed that I was infected with a virus that launches SYN Flood Attack on other websites, but I think I was attacking other services by SYN Flood Attack. Also, when a virus was infected, the server sshd was rewritten and it became impossible to connect with ssh. I was horrified when I logged in to the console after the infection and saw the rewritten ugly ʻauthorized_keys`.

Obstacles that happened to happen

Looking back, this failure was a failure that was supposed to happen.

-The firewall was turned off in the settings of the server infected with the virus. -The server was not redundant -A simple Web server should have been set up during the failure period to report the circumstances to the user. Left unattended and returned 404 Error ・ I was the only engineer in charge of disability ・ The only person in charge, E, moved to an area (Cuba) where he could not connect to the Internet immediately after the failure was discovered.

How will the numbers change if the web service is down for 20 days?

■ Google Search Console data

スクリーンショット 2016-02-22 12.28.37.png

Lessons learned from this failure

I think the lessons learned from this failure are fundamental to Web service operations. A long-term site outage will irreversibly damage the page evaluation from both existing users and Google, and will continue to be treated as an unreliable site for a long time after restoration.

Since the impact of service interruption is large, you should prepare two systems for each server and build a redundant system. In addition, there are countless viruses on the Internet in 2016. Firewalls should always be enabled. You should challenge with a well-thought-out secure server.

If possible, I think everyone would be happy if there were two engineers.

What was happening at that time

I have a strong attachment to the services I make. So I used it every day. It's fun to watch the number of PVs that keep rising. It was a day when approval requests continued to be met. If you notice it, you will see Google Analytics three times a day. One day when I was on vacation and spending time abroad, I tried connecting to the site from the airport's WiFi, but for some reason I didn't get a response. I opened my computer and connected with ssh, but I couldn't access it.

When I opened the email saying something was wrong, the company that operates the cloud server contacted me as follows.

In VPS, which is being used by our customers
We have confirmed the situation where high packet communication is occurring.

There is no upper limit on the amount of transfer, but the host can be accommodated.
Because there is a load and there is concern about the impact on other users,
As an emergency measure, we have shut down the relevant VPS.

I wrote a complaint email by restarting the stopped server while thinking that it was a popular service. If the transfer amount is insufficient, the response is not a sudden shutdown, but a bandwidth limit, isn't it? When

I got the following reply quite quickly.

We have reconfirmed the inquiries.
Attack to the outside(SYN flood)Communication that seems to occur
It seems that it was a high packet.

If you are not aware of this matter, VPS security measures or
We would appreciate it if you could identify the cause.

While apologizing to the operating company, I made excuses that the attack on the outside was not intended and was due to a virus, and this time I stopped the server by myself. Since I was infected with a virus, I have no choice but to rebuild it. Unfortunately, it's time to fly, so I gave up building and went to Cuba! At that time, I thought it would be good to fix it while drinking mojito at the beach.

Paradise on earth = Cuba

Cuba is famous as a beach resort. When I look around the passengers at the airport, they are all floating. Cuba, a paradise on the ground, is safe, medical expenses are free even for foreign tourists, tuition is free for residents, and there is a distribution system, and almost all daily necessities are available at distribution stations for free, so starvation There is no one.

■ On the beach, I was able to take pictures like some free material image2.JPG

That's right. Cuba is a ** apt socialist country **. At the distribution station in Havana, there was a line of about 50m.

In addition, the censorship of 7th worst censorship nation is extremely high. In a fierce society, ** the Internet is all permitted **. All are permitted. Looking back, I found only one encrypted WiFi for airport staff during the trip.


In Japan I monitor web services.
In Cuba, a web service monitors me.

Technology to walk in Cuba

■ The security is extremely good, but it is better to raise the skill level to avoid hides as it occasionally patrols the Homeless Dog 1grp. If found, do the same thing as a late-night Yankee "Don't make eye contact" homeless_dog.jpg

■ Technology to connect to the Internet It was written on a blog somewhere that if you stay at a foreigner-only hotel that costs over $ 100 a night, you can use the Internet for about 500 yen per hour. The WiFi cafe was never found. The local leisure young people were watching baseball.

■ If you use CASA (private lodging), you can stay for about $ 15 per night. Of course there is no internet. Since it does not have diplomatic relations with the United States, it is often not registered on American sites. When I searched in Italian, I got caught in a lot of searches. The probability of hitting an inn where hot water comes out of the shower was 50% (1 out of 2 times). Everywhere is owned by a nice lady and she is very kind. ↓ Breakfast, omelet and grilled banana that the lady brought to the room image1 (1).JPG

■ Exchange Only the airport exchange office could exchange Japanese yen. Euro can be redeemed in downtown areas. The rates are almost the same everywhere.

■ English About 1 in 20 people were fluent in English even at tourist spots

■ Cigars / shopping The regular shop is over 500 yen per bottle. I could buy three from the tramp for $ 1, but two were sick and didn't catch fire.

■ Taxi can be dropped Even after negotiating for 20 minutes, the taxi did not drop to the local price taught at the inn. The bus costs about 30 yen and is super cheap. All pre-negotiation without meters. At the largest tourist destination, Old Havana, the lowest price is 600 yen, which is 500m-2km in the city. A long-distance shared 4-seater taxi is 120km and costs 1700 yen per person. How crazy are you ... It seems that a taxi driver has a better salary than a doctor.

■ Procurement of drinking water You cannot drink tap water. 2L for 50 yen at a local shop. If it is a shop for tourist spots, it costs 250 yen for 2L. A cup of mojito, a Cuban-born cocktail, costs around 250-350 yen. Delicious with the taste that Japanese people like

1. Measures to prevent recurrence

I reflected on it. Repeating the same mistake twice is disqualifying as an engineer, so I took all possible measures.

1-1. Redundant application server

There was only one app server. If there were multiple units, service down would not have occurred in one lung, so we introduced a load balancer and changed the application server to a two-unit configuration.

スクリーンショット 2016-02-22 16.56.37.png

1-2. Reviewed server security settings

It used to be no guard. But for cloud services like AWS that don't have security groups, it's too dangerous without a firewall. I got a virus infection on my server ** in a month **. Therefore, I reviewed the settings and changed to the following settings.

No. Settings
1 enabled firewalld
2 Only the services used for whitelisting were pierced in firewalld
3 Changed ssh connection port

-CentOS 7 firewalld settings to finish in 5 minutes

1-3. Confirmed the error page setting of nginx

When the http server died, we checked and verified the settings to display the error page with nginx alone.

1-4. Backed up the machine image

Since it took a long time to rebuild the server and the service recovery was delayed, I took a machine image (snapshot) of each server. Especially it took a long time to set up the Python pip file. Dependencies and mysql library are buggy 3 series has a lot of bad know-how.

2. Settings that I'm glad I did

2-1. Git management of configuration files for each middleware

Since there was no trial and error between each middleware, we were able to recover quickly.

2-2. I chose a cloud service that can install a load balancer

Load balancer 1,000 yen / month The load balancer could be installed from the web with the touch of a button.

3. Where I was confused about the settings and did not understand well

I don't know what to do with an unstoppable backup of an RDB. With AWS, there is an automatic backup of the DB, and you can easily return to your favorite timing with point-in-time recovery. I wonder if there is only a muddy method such as dividing into master and slave configurations, separating the slave at the time of backup, backing up and replicating. I don't know the correct answer.

Amazon's RDS, which manages the effort around RDB with a good feeling, is wonderful because it has added value as it is expensive. There is even Multi AZ. My senior in the company told me.

A failed web service is a service operated by an individual.

This article is a sequel to Knowledge when developing and operating a new web service by yourself in 2016. I didn't expect to write such an article. It was a curation service that did not have login data, so I did not have to consider the occurrence of information leakage.

I'm really sorry for all the users who caused inconvenience. I couldn't confirm the facts, but I'm really sorry to all the web server administrators who were attacked by SYN Flood from the server I manage. m (____) m

Last but not least, thank you to the person in charge of the operating company that operates the virtual server for their prompt support and advice. I was able to enjoy the recovery.

Recommended Posts

Web service down for 20 days due to virus infection. really sorry