This is the article on the 20th day of iRidge Advent Calendar 2020.

This is sent by @orfx, a server-side engineer.

TL;DR

AWS NAT gateways have a specification that times out if communication is idle for more than 350 seconds.

Internet connection is interrupted after 350 seconds

Problem

The instance can access the internet, but will be disconnected after 350 seconds.

Cause

If a connection using a NAT gateway is left idle for more than 350 seconds, the connection times out.

Solution

You can initiate additional traffic through the connection so that the connection is not interrupted. Alternatively, you can enable TCP keepalives on your instance for values less than 350 seconds.

Troubleshooting NAT Gateway

According to this specification, for example, when making a request to the API that takes a long time to start sending a response after receiving the request, the connection will be disconnected due to a timeout on the NAT gateway side, and the response will not be returned forever. An event may occur.

Normal remedy

As a workaround, adjust TCP keepalive as described in the document to prevent disconnection.

For example, if you are using Linux and have an environment where you can change kernel parameters, set the following parameters to less than 350 seconds. The Linux default is 7200 seconds.

net.ipv4.tcp_keepalive_time

Also, if you change the following kernel parameters in connection with it, you can adjust the time required for disconnection when communication becomes impossible for some reason during communication.

net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes

What to do with ECS Fargate

For ECS Fargate environments, as of December 2020, changing kernel parameters is not supported (https://docs.aws.amazon.com/ja_jp/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_systemcontrols).

Therefore, it is necessary to set TCP keepalive from the communication application.

Python socket

Communication using the Python standard library socket can be set using socket.setsockopt ().

As shown in the example below, pass the Target kernel parameter as the second argument and the value you want to set as the third argument.

import socket

s = socket.socket()
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) #keepalive enabled
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 349) # tcp_keepalive_time 
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10) # tcp_keepalive_intvl
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3) # tcp_keepalive_probes

Python requests

The OSS library requests is also made by wrapping the OSS library urllib3, and by adding a setting to HTTPConnection.default_socket_options, TCP keepalive settings can be set even for communication using requests. You can do.

urllib3 also calls sock.setsockopt () internally, so the settings you add are similar to the socket example.

import requests
import socket
import sys

from urllib3.connection import HTTPConnection

HTTPConnection.default_socket_options = HTTPConnection.default_socket_options + [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
    (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 349),
    (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10),
    (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3),
]
test = requests.get("http://google.com/")

Survey history

From here on, it will be a slapstick for those who browsed for a remedy, but I would like to talk about how such an investigation was necessary.

The origin was an inquiry that 504 Gateway Time-out occurs when downloading a CSV file on the management screen of a certain web service of our company.

This admin screen is a Python application running on ECS Fargate, generating the data needed to generate a CSV file from another backend API.

Untitled Diagram (1).png

The request was recorded in the log of each component as follows.

log	Response time	Response code	Remarks
ALB1 access log	-	ELB 504	Idle timeout(Set value: 1,200 seconds)Disconnection by
Management screen nginx log	1200.000 sec	HTTP 499	Idle timeout(Set value: 1,200 seconds)Disconnection by
ALB2 access log	363.627 sec	HTTP 200

This record was strange to me, who didn't know the specifications of the NAT gateway. The response for 504 Gateway Time-out was the response by ALB with an idle timeout, but the backend API was sending the response normally in about 6 minutes.

Since the difficulty of solving such a survey varies greatly depending on whether it can be reproduced in the verification environment, we first searched for a reproduction method. As a result, it was found that requests with CSV file download time exceeding 360 seconds are reproduced with high probability. However, I could not find any place where some threshold was set in about 360 seconds for various middleware and infrastructure components.

Next, I moved my suspicion to the implementation on the application side and started verification in the local environment, but unfortunately the same phenomenon was reproduced there. It was later discovered that this was due to a similar setting accidentally included in the network of the local environment, but I misunderstood that it was due to the implementation on the application side, and I investigated the library used for communication more deeply. I started to work on it. Here, I got the knowledge that the structure of requests, the powerful and feature-rich Python debugger of VS Code, and the fact that it cannot be reproduced by hyper or curl, etc. Was not an implementation of the application, so the investigation was naturally stalled.

Eventually, I re-entered the network side investigation, started investigating at the packet level with tcpdump, and suddenly turned my eyes to the NAT gateway, and found the AWS official document that describes the idle timeout specification. It was an unfortunate solution.

Examination of coping method

The AWS official document I found also described a workaround by TCP keepalive, but another matter that ECS Fargate, which is running the management screen, does not support the function to change kernel parameters directly from the task definition. I knew it from the survey. So I had to find another way to set up TCP keepalives, but the insights I've gained from previous research have helped me to just use the requests I'm using to make requests to the backend API. However, I was able to find a simple way to set it.

To verify that the TCP keep-alive settings are reflected in the ECS Fargate environment, tcpdump is performed on the EC2 server that imitates the backend API of the request destination, and TCP Keep-Alive packets are sent at intervals according to the set values. I confirmed that it was sent from the application of the environment.

Finally

After taking such measures, the same phenomenon did not occur and the problem could be solved. As a result, it was simply due to the specifications of the AWS managed service, and the countermeasure was a simple modification, but it took many cycles of observation, reasoning, hypothesis, verification, and consideration to get there. have become. This time, an unfortunate coincidence that led to a wrong hypothesis occurred, and it was not possible to solve it in the shortest distance, but I believe that the knowledge gained on the detour is not wasteful and will be useful someday, and I will continue to solve the problem. I would like to hit.

NAT gateway idle timeout in ECS Fargate/Python (350 seconds) Workaround

Internet connection is interrupted after 350 seconds

Problem

Cause

Solution

Normal remedy

What to do with ECS Fargate

Survey history

Examination of coping method

Finally