background

It's a sudden question, but how do you manage the Jobs that you run on a regular basis? Even in 2018, when serverless is popular, some people may be using the old-fashioned cron.

In this article, I would like to suggest a way to replace a server running batch with cron with zero downtime. Also, I would like to explain why it is possible by glancing at the source code of cron. (For time reasons, we only handle cronie on CentOS7.)

In actual operation, you should design a proper job management system, but this article will introduce a muddy method.

TL; DR

Prepare a new server with crond stopped
When the last batch is kicked, stop crond on the old server and start crond on the new server.
Wait for the completion of the last batch and shut down the old server completely

What was cron?

cron is a time-based job scheduler used in UNIX-like OS, and is suitable for scheduling regularly executed tasks. It is also used for system management and for executing tasks of actual services.

For example, if you register a job like the following with crontab -e, the aggregation batch will be executed at 0 minutes every hour.

0 * * * * /bin/bash -l -c '/home/vagrant/bin/execute_hourly_aggregation'

For more information, check out the Wikipedia articles below and the various documents that serve as references.

https://ja.wikipedia.org/wiki/Crontab

Prerequisites

In this proposed method, we aim to replace with zero downtime in an environment where some batch is always executed, such as a batch that takes a long time or a large number of batches.

However, please note that the proposed method cannot achieve strict zero downtime on a server where new jobs are kicked every minute.

Environmental information

I will explain in the following environment, which was set up in Vagrant. The version is slightly different from the source code to be explained, but there are no major changes, so I hope you understand the mechanism.

[vagrant@localhost ~]$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[vagrant@localhost ~]$ yum list installed | grep cron
Failed to set locale, defaulting to C
cronie.x86_64                   1.4.11-19.el7                   @anaconda
cronie-anacron.x86_64           1.4.11-19.el7                   @anaconda
crontabs.noarch                 1.11-6.20121102git.el7          @anaconda

procedure

1. Preparing a new server with crond stopped

First, let's prepare a new server. As for the file structure, there is no problem with the same file structure as the old server, but please make it so as to satisfy the following two.

The old server crond is working
The crontab on the new server must have the same settings as the old server
The new server crond is stopped

For example, in CentOS7 series, use the service command to stop it.

[vagrant@localhost ~]$ sudo service crond stop
Redirecting to /bin/systemctl stop crond.service
[vagrant@localhost ~]$ service crond status
Redirecting to /bin/systemctl status crond.service
● crond.service - Command Scheduler
   Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2018-09-11 17:44:33 UTC; 11s ago
  Process: 3406 ExecStart=/usr/sbin/crond -n $CRONDARGS (code=exited, status=0/SUCCESS)
 Main PID: 3406 (code=exited, status=0/SUCCESS)

2. Wait for the job to stop kicking

Allow time for cron jobs to stop kicking. The important thing here is not when the job is completed, but when the job is no longer kicked. Let's check the process under crond with ps aux f etc. and confirm that the last batch processing on the old server was kicked.

root      3492  0.0  0.3  26096  1704 ?        Ss   17:45   0:00 /usr/sbin/crond -n
root      3921  0.0  0.4  82144  2488 ?        S    18:05   0:00  \_ /usr/sbin/CROND -n
vagrant   3924  0.0  0.3  12992  1568 ?        Ss   18:05   0:00  |   \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant   3937  0.0  0.0   7764   352 ?        S    18:05   0:00  |   |   \_ sleep 120

3. Stop `crond` on the old server and start `crond` on the new server

Finally, we will move on to the replacement work. First, stop crond on the old server and start crond on the new server. At this point, the old server will no longer kick new batches.

4. Watch the batch process complete on the old server

crond has stopped, but there are still jobs that crond kicked. Let's wait for the batch process to complete. At this time, the job kicked by crond has lost its parent and is now hanging under / usr / sbin / CROND.

root      4199  0.0  0.4  82144  2488 ?        S    18:18   0:00 /usr/sbin/CROND -n
vagrant   4201  0.0  0.3  12992  1564 ?        Ss   18:18   0:00  \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant   4214  0.0  0.0   7764   352 ?        S    18:18   0:00  |   \_ sleep 120

5. Stop the old server completely

Wait until the last batch process is complete and then stop politely.

Reveal

Why is this possible? Let's take a moment to read the cron code and recheck the state of the process at runtime.

First, cron loops the following process while until it receives a SIGINT or SIGTERM signal.

Sleep for 1 minute
Check the time and execute the required job

	while (!got_sigintterm) {
		int timeDiff;
		enum timejump wakeupKind;

		/* ... wait for the time (in minutes) to change ... */
		do {
			cron_sleep(timeRunning + 1, &database);
			set_time(FALSE);
		} while (!got_sigintterm && clockTime == timeRunning);
		if (got_sigintterm)
			break;
		timeRunning = clockTime;
// ~Omission~
		handle_signals(&database);
	}

https://github.com/cronie-crond/cronie/blob/40b7164227a17058afb4f3d837ebb3263943e2e6/src/cron.c#L354-L481

In other words, you can see that the new batch runs every (approximately) every minute this check runs.

The new batch is discovered by find_jobs () and runs as a grandchild process of the original crond viajob_runqueue (),do_command (),child_process ().

root      3492  0.0  0.3  26096  1704 ?        Ss   17:45   0:00 /usr/sbin/crond -n
root      3921  0.0  0.4  82144  2488 ?        S    18:05   0:00  \_ /usr/sbin/CROND -n
vagrant   3924  0.0  0.3  12992  1568 ?        Ss   18:05   0:00  |   \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant   3937  0.0  0.0   7764   352 ?        S    18:05   0:00  |   |   \_ sleep 120

So what happens if you stop with SIGTERM here? Let's take a look at the code for the last cleaning part.


#if defined WITH_INOTIFY
	if (inotify_enabled && (EnableClustering != 1))
		set_cron_unwatched(fd);

	if (fd >= 0 && close(fd) < 0)
		log_it("CRON", pid, "INFO", "Inotify close failed", errno);
#endif

	log_it("CRON", pid, "INFO", "Shutting down", 0);

	(void) unlink(_PATH_CRON_PID);

	return 0;

https://github.com/cronie-crond/cronie/blob/40b7164227a17058afb4f3d837ebb3263943e2e6/src/cron.c#L482-L495

Did you notice? Instead of waiting or killing the child or grandchild process, the parent is quietly dead, entrusting the child with the processing of the grandchild.

root      4199  0.0  0.4  82144  2488 ?        S    18:18   0:00 /usr/sbin/CROND -n
vagrant   4201  0.0  0.3  12992  1564 ?        Ss   18:18   0:00  \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant   4214  0.0  0.0   7764   352 ?        S    18:18   0:00  |   \_ sleep 120

As a result, new jobs are not kicked, while jobs that have already been kicked can be completed to the end. (It's a little painful)

Summary

Since crond is implemented very simply, it is possible to replace it with zero downtime even with a crude method such as the proposed method. Of course, in actual operation, it is an area that should be systematized properly, but I hope that you will be interested in the familiar cron.

reference

https://rpmfind.net/linux/RPM/centos/7.5.1804/x86_64/Packages/cronie-1.4.11-19.el7.x86_64.html
https://github.com/cronie-crond
https://en.wikipedia.org/wiki/Cron
https://ja.wikipedia.org/wiki/Crontab

[LINUX] A method to replace a server that is executing batch processing with cron with zero downtime