[PYTHON] Answer to Splash memory overuse problem (draft)

I didn't expect the Advent calendar in our laboratory to end safely. This time I will write about Splash's memory problem.

Conclusion

** If you eat too much memory, it will fall without permission, so restart **

What is this

I wrote an article before Scrapy is good. But Scrapy has one problem. That is, of course, JavaScript doesn't work. A tool called Splash is often used when scraping pages that use JavaScript with Scrapy.

Splash is a server that renders JavaScript, and you can get the source after executing JavaScript of the specified site by accessing it using WebAPI. The contents are written in Python and seem to use Twisted and QT3.

problem

Such a convenient Splash has one big problem. It consumes a lot of memory. The figure below is a graph of Splash's memory consumption. Image from Gyazo The more requests you make, the more memory you consume. In this condition, no matter how much memory is loaded, it will disappear in a blink of an eye. There is a similar issue on Github, but it seems that the Python specification does not allow you to free memory. https://github.com/scrapinghub/splash/issues/674

solution

The problem with Python is untouched. If anyone knows, please let me know. As mentioned in the Issue above, the only way to free memory is to drop Splash once. However, Splash cannot be dropped manually, so it must be dropped automatically. It is ant to write a script like restarting every minute with cron, but it is troublesome. What should I do in such a case?

** Wait until it falls due to lack of memory. And let's restart automatically **

It's not a very smart solution, but it's the easiest.

what to do

Target the following situations.

First, set the upper limit of the memory that Splash's Docker container can use. For docker-compose, use 2 because mem_limit has disappeared from version 3. Also, add restart: always so that it will restart automatically when it falls.

version: "2"
services: 
    splash:
        image: "scrapinghub/splash:3.3"
        ports: 
            - "8050:8050"
        mem_limit: 2g
        restart: always
        command: --disable-browser-caches --maxrss 4000

Now you don't eat more memory than you need.

At the end

As I said in the middle, it's not a very smart solution. Please let me know if there is a smarter and easier way to do it.

Referenced site

Recommended Posts

Answer to Splash memory overuse problem (draft)
Answer to "Offline real-time writing F04 problem"
Answer to "Offline real-time writing F05 problem"
Answer to "Offline Real-Time Writing E12 Problem"
Answer to "Offline real-time how to write F02 problem"
Answer to "Offline Real-time How to Write F01 Problem"
Answer to "Offline Real-time How to Write E13 Problem"
[Python] Try to read the cool answer to the FizzBuzz problem