HTTP split download guy made with Python

Trigger

――I made the contract for the optical line of my house 1 Gbps, but when I tried to download a large file (ISO of Linux) etc. by HTTP, the speed did not come out

why?

--There is a limit to the throughput that can be output with a single TCP connection. --How to check ~~ TCP receive window size ~~ for Linux (to be exact, kernel buffer size)

$cat /proc/sys/net/ipv4/tcp_rmem
4096    87380   6291456

From the left [min, default, max]

So the maximum throughput is

T_{max} = win / RTT

Therefore, if it is a single TCP connection communication with a server with RTT = 30ms in the default state ~~ About 87380 [byte] * 8/30 [ms] ≒ 23.3 [Mbps] only ~~ In fact, ** if there is no congestion ** the window size will get bigger and bigger, so it will be faster.

Of course, TCP has a window scale option to support wideband networks. The window size can be expanded up to 1 Gbyte (it is unknown whether it is actually used)

Theoretically speaking, if there are multiple connections, the bandwidth will be N times if N are bundled.

Existing tools

Wget and curl are famous as tools that can be used from the command line There is aria2 etc. that can use multiple connections Use explosive downloader aria2, which is several times faster than curl and wget --Qiita When establishing multiple connections, do not overload the other server

Creating an HTTP client for the subject

HTTP has a Range Request RFC 7233 — HTTP / 1.1: Range Requests Implementation is in popular Python for the time being

About Range Request

--For example, suppose you request a 1000 byte file --If you send a GET request with'Range: bytes = 0-499' in the header, --Add'Content-Range: bytes 0-499 / 1000'to the response header and return only the first 500 bytes of the file in the body. --Status code is '206 Partial Content'

However, in some cases the server does not accept Range headers.

Use this feature to request different parts of a file from multiple TCP connections at the same time

Multiplexing

Python has a module called selectors that can handle select system calls at a higher level (in the standard library!) 18.4. selectors — High level I / O multiplexing — Python 3.6.1 documentation This guy monitors and multiplexes multiple sockets Use like this

#Imagine a connection with two TCP echo servers, A and B
import selectors
import socket

#Omission
sel = selectors.DefaultSelectors()
sock_A = socket.create_connection(address_A)
sock_B = socket.create_connection(address_B)

sel.resister(sock_A, selectors.EVENT_READ)
sel.resister(sock_B, selectors.EVENT_READ)


sock_B.sendall('Hello'.encode()) # send something to A
sock_B.sendall('Hello'.encode()) # send something to B

while True:
    events = sel.select()
    for key, mask in events:
        message = key.fileobj.recv(512)
        print(message.decode())

point

--Since it is not possible to keep all the pieces of the file that are returned separately in the memory, write them to the file sequentially from the place where the order is aligned. --It is not a good decision to continue using a poorly performing TCP connection, so evaluate each connection, discard the poorly performing connection, replace it with a new one, and resend the request.

Rough flow

  1. Send an HTTP HEAD request to check the file size (using an existing HTTP library here)
  2. Determine the total number of divisions and division size, and establish a connection
  3. Send the initial request
  4. Monitor the sockets with the selectors mentioned above, read sequentially from the sockets that became readable, and put each socket in the primary buffer.
  5. When the contents of the primary buffer are long enough to be processed as an HTTP response, divide them into a header and a body.
  6. Identify which part of the file the response corresponds to from the header and move from the primary buffer to the secondary buffer
  7. Write to the file from the first in the secondary buffer, then delete it from the secondary buffer
  8. Update the evaluation value of each connection, discard the connection judged to have low performance, and re-send the request to the newly established connection.
  9. Repeat steps 4-8 until the entire file is complete

Implemented

https://github.com/johejo/rangedl There are still some bugs

How to use

Environment Python 3.6.1

$ pip install git+http://github.com/johejo/rangedl.git
$ rangedl [URL] -n [NUM_OF_CONNECTION] -s [SPLIT_SIZE_MB]

--By default, tqdm shows the progress bar. The progress bar is not displayed with the -p option. --For security reasons, the number of connections cannot exceed 10. --If the split size specified by the option is smaller than the value of'File size / Number of connections', the value of'File size / Number of connections' is forcibly set as the split size.

result

――Depending on the mood of the line, I was able to download at about 200Mbps. --When split_size is set to 1MB, the memory usage is about 30-80MB. Is it unavoidable that the CPU usage is high ...

Recommended Posts

HTTP split download guy made with Python
HTTP communication with Python
I made blackjack with python!
Easy HTTP server with Python
I made blackjack with Python.
Othello made with python (GUI-like)
I made wordcloud with Python.
Download csv file with python
SNS Python basics made with Flask
Implemented file download with Python + Bottle
Numer0n with items made in Python
I made a fortune with Python.
Othello game development made with Python
I made a daemon with Python
Download python
Read CSV file with python (Download & parse CSV file)
Simple Slack API client made with Python
I made a character counter with Python
Download Japanese stock price data with python
Download files on the web with Python
I made a Hex map with Python
Easily download mp3 / mp4 with python and youtube-dl!
Serverless face recognition API made with Python
I made a roguelike game with Python
I made a simple blackjack with Python
I made a configuration file with Python
I made a neuron simulator with Python
Othello app (iOS app) made with Python (Kivy)
Split mol2 file with python (-> 2016.04.17 Also supports sdf file)
[Python] Python and security-② Port scanning tool made with Python
FizzBuzz with Python3
Automatically search and download YouTube videos with Python
Scraping with Python
I made a weather forecast bot-like with Python.
Statistics with python
I made a GUI application with Python + PyQt5
Scraping with Python
Python with Go
Python Web Content made with Lolipop cheap server
GUI image cropping tool made with Python + Tkinter
I made a Twitter fujoshi blocker with Python ①
Twilio with Python
Procedure for creating a LineBot made with Python
Integrate with Python
Play with 2016-Python
[Python] I made a Youtube Downloader with Tkinter.
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
Download and import files with Splunk external python
Bingo with python
Zundokokiyoshi with python
Bulk download images from specific URLs with python
Send HTTP with Basic authentication header in Python
Excel with Python
Microcomputer with Python
I made a bin picking game with Python
I made a Mattermost bot with Python (+ Flask)
Cast with python
I made a Twitter BOT with GAE (python) (with a reference)