[LINUX] (Almost) troubleshooting techniques from nothing

Defense against dark magic Advent Calendar 2019 This is the article on the 16th day.


In some cases, the difficulty of troubleshooting is unintentionally increasing due to the progress of deployment automation, container-type virtualization, and microservices.

--Almost no awareness of application settings and access routes in daily development work --The container is too light to contain ps or even netstat --Infrastructure as Code (No documentation exists) --There is a foundation for log management, but it is useless because I want to check other than logs to see if the logs I want are collected.

This time, we will discuss how to proceed with troubleshooting ** in these cases even if you are not familiar with the target system.


You have obtained the information required for SSH login to the Linux server and the login was successful. You can even switch to the root user! But ** I don't know anything other than login information **.

[email protected]:~ $ sudo su -
[email protected]:~ # eixt
-bash: eixt: command not found
[email protected]:~ # exit
[email protected]:~ $

Meanwhile, I received an ambiguous inquiry e-mail saying "Accessing a web page causes an error" and is being asked to respond.

Let's move on.

Troubleshooting techniques

The only way to troubleshoot from scratch is to find and find out.

Find an application

** ps ** (Neither docker ps nor pstree is used)

You can also search for the currently running processes by using the top command if there are many processes. ).

##Display all processes in BSD format / user-oriented format
$ ps aux
root      3052  3.2 19.0 1122908 89636 ?       Ssl  02:24   0:03 /usr/bin/dockerd --default-ulimit nofile=1024:4096
root      3086  0.4  4.5 1004052 21332 ?       Ssl  02:24   0:00 docker-containerd --config /var/run/docker/containerd/containerd.toml
root      3865  0.0  1.1  10632  5332 ?        Ss   02:25   0:00 nginx: master process nginx -g daemon off;
101       3914  0.0  0.5  11088  2588 ?        S    02:25   0:00 nginx: worker process

Nginx may be running on the Docker container. Let's check.

The best way to check is to use docker ps or docker top CONTAINER. However, it is also possible to display the process tree with the f ( --forest) option of the ps command and check it from the parent-child relationship.

##Tree display of all processes in BSD format and user-oriented format
$ ps auxf
root      3052  0.4 16.7 1122908 78684 ?       Ssl  02:24   0:05 /usr/bin/dockerd --default-ulimit nofile=1024:4096
root      3086  0.4  4.8 1005108 22660 ?       Ssl  02:24   0:04  \_ docker-containerd --config /var/run/docker/containerd/containerd.toml
root      3829  0.0  0.8   7512  4184 ?        Sl   02:25   0:00  |   \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/contain
root      3865  0.0  1.1  10632  5332 ?        Ss   02:25   0:00  |       \_ nginx: master process nginx -g daemon off;
101       3914  0.0  0.5  11088  2588 ?        S    02:25   0:00  |           \_ nginx: worker process

It's running on Docker!

Find configuration files and log files

** File descriptor **

In Linux (POSIX), each process has a list of file descriptors, or in short, a "list of open files". This can be seen through procfs (although in most cases the same user as the boot process). Must be).

Let's take a look at the file descriptor of the nginx process.

$ sudo ls -l /proc/3865/fd
total 0
lr-x------ 1 root root 64 Dec 15 02:25 0 -> 'pipe:[16216]'
l-wx------ 1 root root 64 Dec 15 02:25 1 -> 'pipe:[16217]'
l-wx------ 1 root root 64 Dec 15 02:25 2 -> 'pipe:[16218]'
lrwx------ 1 root root 64 Dec 15 02:25 4 -> 'socket:[75443]'
lrwx------ 1 root root 64 Dec 15 02:25 5 -> 'socket:[75444]'
lrwx------ 1 root root 64 Dec 15 02:25 6 -> 'socket:[16370]'

I didn't get any files with just pipes and sockets (aside, Linux allows pipes and sockets to be treated as special files).

But don't give up. [The file descriptor of the process is set to 0 as standard input (stdin), 1 as standard output (stdout), and 2 as standard error output (stderr)](https://linuxjm.osdn. Let's try to suck the standard output with cat from jp / html / LDP_man-pages / man3 / stdin.3.html # lbAD).

$ sudo cat /proc/3865/fd/1
##(Access the web page here)
xxx.xx.xxx.xxx - - [15/Dec/2019:06:03:51 +0000] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" "-"
## ( Ctrl+Stop sucking with C key)

The access log of nginx came out. Spitting logs to standard output is an orthodox way of operating a container.

If it is output to a file, you will see the path as follows:

$ sudo ls -l /proc/11178/fd
Total 0
lrwx------1 root root 64 December 15 14:35 0 -> /dev/null
lrwx------1 root root 64 December 15 14:35 1 -> /dev/null
l-wx------1 root root 64 December 15 14:35 2 -> /var/log/nginx/error.log
l-wx------1 root root 64 December 15 14:35 44 -> /var/log/nginx/access.log


The location of the config file is not yet known. Since it is nginx, it is * almost * sure that it is / etc / nginx /, but let's pretend not to know it.

The strace command can monitor system calls and signals. System calls also include reading and writing files.

$ sudo strace -t -p 3865
strace: Process 3865 attached
05:36:54 rt_sigsuspend([], 8)           = ? ERESTARTNOHAND (To be restarted if no handler)

Monitoring has started (end with Ctrl + C). In this state, let nginx read the configuration file.

nginx reloads the configuration file when it receives the HUP signal. Apache is a ʻUSR1` signal, and other applications may be able to reload with a specific signal.

HUP changing configuration, keeping up with a changed time zone (only for FreeBSD and Linux), starting new worker processes with a new configuration, graceful shutdown of old worker processes

Controlling nginx

You can send a HUP signal with a noisy command called the kill command.

$ sudo kill -HUP 3865
05:37:25 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=53, si_uid=0} ---
05:37:25 rt_sigreturn({mask=[HUP INT QUIT USR1 USR2 ALRM TERM CHLD WINCH IO]}) = -1 EINTR (Interrupted system call)
05:37:25 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
05:37:25 uname({sysname="Linux", nodename="1b01e2a57209", ...}) = 0
05:37:25 openat(AT_FDCWD, "/etc/nginx/nginx.conf", O_RDONLY) = 8
05:37:25 fstat(8, {st_mode=S_IFREG|0644, st_size=643, ...}) = 0

There was a movement in the terminal I was monitoring with strace. The focus is on the ʻopenatsystem call. You can see that you have opened the/etc/nginx/nginx.conf` file.

I know the location of the config file. Congratulations!

Find another server

It's a little annoying when the problem is likely to be on another server. You can check the reverse proxy settings of nginx, or you may be able to find other servers by the following methods.


Hosts in the same subnet can be found by ARP scan, although I won't go into detail. If you don't mind the cache contents, you can check it with the ʻarp` command.

##Check the cache contents
$ arp -a
ip-172-31-16-1.ap-northeast-1.compute.internal ( at 06:d0:4e:xx:xx:xx [ether] on eth0

##ARP scan (arp-The scan command needs to be installed separately)
###Calculate the ARP scan range (subnet)
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet  netmask  broadcast
###Subnet from above(Network address/CIDR)Is
$ sudo arp-scan
Interface: eth0, datalink type: EN10MB (Ethernet)
Starting arp-scan 1.9.2 with 4096 hosts (http://www.nta-monitor.com/tools-resources/security-tools/arp-scan/) 06:d0:4e:xx:xx:xx (Unknown) 06:4e:7e:xx:xx:xx (Unknown)  06:8b:fe:xx:xx:xx (Unknown)


You can use the netstat command to see which TCP connections and TCP / UDP ports are listening.

$ sudo netstat -anp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0    *               LISTEN      3214/sshd
tcp        0      0        xxx.xx.xxx.xxx:58168    ESTABLISHED 32051/sshd: USERNAME
tcp6       0      0 :::22                   :::*                    LISTEN      3214/sshd
tcp6       0      0 :::80                   :::*                    LISTEN      3003/docker-proxy

This is the execution result on the server (outside the container) where nginx was started earlier. I don't feel any sign of nginx, but this is normal. This is because communication with the Docker container is NATed with iptables (nftables).

If you hurry and do netstat inside the container, you may get command not found. But calm down. Let's check from outside the container.

You can check the NAT settings with the ʻiptables` command.

$ sudo iptables -L
Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere

Chain DOCKER (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             ip-172-17-0-2.ap-northeast-1.compute.internal  tcp dpt:http

And TCP connections NATed by iptables / nftables can be found in procfs from ʻip_conntrackornf_conntrack`.

$ sudo cat /proc/net/nf_conntrack
ipv4     2 tcp      6 431997 ESTABLISHED src=xxx.xx.xxx.xxx dst= sport=57245 dport=80 src= dst=xxx.xx.xxx.xxx sport=80 dport=57245 [ASSURED] mark=0 zone=0 use=2
ipv4     2 tcp      6 103 TIME_WAIT src= dst=xx.xx.xx.xxx sport=46684 dport=8080 src=xx.xx.xx.xxx dst= sport=8080 dport=46684 [ASSURED] mark=0 zone=0 use=2

The first line is the connection where access to the TCP 80 port is NATed to the Docker container. The second line is the NAT of the connection from inside the Docker container to the TCP 8080 port on the external host.

From the second line, we can infer that nginx may be reverse proxying to an external host.

** From consoles such as AWS and GCP ** (load balancer system)

If the communication destination found is a load balancer, the existence of the server beyond that cannot be searched. Unfortunately, I have no choice but to look at the load balancer settings.

** From everyone's command history **

If you think "someone must be doing it", set a precedent. The command history output by the history command is saved in .bash_history for bash.

$ sudo cat /home/*/.bash_history | grep ssh
ssh -p 11122 example.com
ssh [email protected]

Take a closer look at the server

When you get to a suspicious server, check it out. In addition to the commands explained so far, the following commands and elements can also be used.


In addition to file descriptors and NAT sessions, the proc file system can acquire various information such as command strings at startup including parameters.

In addition to procfs man page, linux procfs thorough introduction --SIer but I want to do technology blog can be confirmed with the output example.


Collected by syslogd (rsyslogd) Various logs of the system are output. Rarely, but when OOM Killer runs, which kills a process when it runs out of memory, it keeps a record (For OOM Killer, The OOM CTF And [Out Of Memory Management](see https://www.kernel.org/doc/gorman/html/understand/understand016.html).

Dec 15 10:51:07 ip-172-31-24-219 kernel: Out of memory: Kill process 6359 (bash) score 635 or sacrifice child
Dec 15 10:51:07 ip-172-31-24-219 kernel: Killed process 6360 (yes) total-vm:114636kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
Dec 15 10:51:07 ip-172-31-24-219 kernel: oom_reaper: reaped process 6360 (yes), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


Although it partially overlaps, it is a major command group that outputs memory, CPU, I / O, and other information. I won't explain it, but it's worth remembering that you can also display LWPs (threads) with ps -L.

## PID:Process number, LWP: LWP(thread)ID, NLWP: LWP(thread)number
$ ps -efL
root      3564     1  3564  0    8 02:24 ?        00:00:00 /usr/libexec/amazon-ecs-init start
root      3564     1  3567  0    8 02:24 ?        00:00:00 /usr/libexec/amazon-ecs-init start
root      3564     1  3568  0    8 02:24 ?        00:00:00 /usr/libexec/amazon-ecs-init start


With the methods described so far, you can now find servers, applications, settings, log files, and see what the server looks like.

You will be able to respond to inquiries such as "Accessing a web page causes an error".

unicorn_err.log:E, [2019-12-15T21:18:43.339882 #10627] ERROR -- : worker=4 PID:14246 timeout (98s > 60s), killing
unicorn_err.log:E, [2019-12-15T21:18:43.339910 #10627] ERROR -- : worker=5 PID:14254 timeout (80s > 60s), killing

Wow, it takes more than 60 seconds to process! ??

(does not continue)


The commands and elements used this time are as follows.

--Processes, services

Recommended Posts

(Almost) troubleshooting techniques from nothing
Keras starting from nothing
Keras 5th starting from nothing
Keras starting from nothing 1st
Keras 4th starting from nothing
Keras starting from nothing 2nd
Keras starting from nothing 3rd