It's been a full month since I started working. In cooperation with synchronization, we released one function and succeeded in operating normally. However, in fact, I once had a problem that the process could not be completed in the production environment. I will output it as a memorandum to prevent the same thing from happening again.
Simply put, it's because you've fallen into an "infinite loop." The implementation was to monitor jobs running on Sidekiq to detect duplication or loss of jobs and alert Slack.
The first thing I hit was ・ Execute from schedule ・ Scheduled from running If a job is acquired and checked at the timing of both of these, it will often be detected as an error.
There was a lag of 1 second up to 5 seconds in the speed of coming and going (checked in the local environment). If the new function monitoring job moves at that timing, it will be detected as an error even if the job is running normally.
The measures I took against this ** Turn a loop when acquiring a job, and if all the data is the same, take 5 seconds of "sleep" while acquiring 10 times, and pass it to the next process ** It was that. This wasn't good ...
In the production environment, it runs every 30 seconds and finishes the process in about 10 seconds, moving around the schedule and execution at a dizzying pace, which is a lively job.
It takes more than 50 seconds to get it while turning the loop (I was doing it on purpose), so the reason is that the data obtained in 10 acquisitions will not be the same forever. It was an infinite loop.
When making corrections, I stopped saying ** looping until all the data are the same **. But I knew that if I didn't screen the acquisition properly, I would end up messing up Slack's alerts.
Therefore, we decided the upper limit of the loop and changed it to the specification that passes the most data among the obtained data to the next process **.
monitoring_sidekiq.rb
#abridgement
def sidekiq_logs_arr_in
consistency_check_sidekiq_logs = []
#I used to use while statements and redo, but I set a clear upper limit.
50.times do
queues = []
#Avoid using sleep due to variations in timing from appointment to execution and execution to appointment
sleep 2
#Get a job on Sidekiq
Sidekiq::Workers.new.each { |_process_id, _thread_id, job| queues << job["queue"] }
Sidekiq::ScheduledSet.new.each { |job| queues << job["queue"] }
queues = queues.sort
consistency_check_sidekiq_logs << queues
end
#Finally returns the most data in the loop by counting
consistency_check_sidekiq_logs.max_by { |x| consistency_check_sidekiq_logs.count x }
end
#abridgement
By doing this, it is no longer necessary to throw an exception even during normal processing during the subsequent processing, and it is now possible to obtain a consistent log even if there is a healthy job. ..
First of all, if a specific condition is not met, try not to perform the process of redoing in the loop process as much as possible. I was keenly aware that it is extremely important to set an upper limit and form a loop that can be escaped under any circumstances.
I'm really careful in the future because causing an infinite loop in a production environment can cause great damage to the company.
Recommended Posts