[PYTHON] SLICECAP: Split parallel processing of PCAP files

Do you analyze packets?

Those who are analyzing network traffic have used the libpcap library, or even without using libpcap directly [tcpdump] You may have dealt with PCAP format files captured by (http://www.tcpdump.org/) etc.

Many traffic analysis tools support the PCAP format, so if you keep your data in this format, you can rest assured that you can use it with various tools later. Existing tools will suffice for general analysis.

However, when trying to perform advanced analysis, it is tempting to take a closer look at the PCAP data and the internal structure of the packets contained therein. Since the data in PCAP format is almost the same as the binary data in which packets are enumerated, the necessary parts must be extracted and processed using a preprocessing program.

Networks are getting faster these days, and even a few minutes of capture can sometimes be tens of gigabytes of data. Depending on the content of the analysis, unless the method has already been established, the data will be repeatedly analyzed from various aspects using various methods. There is no problem with small data, but as the amount of data increases, the time required for preprocessing cannot be ignored.

In this article, we will introduce the software "** SLICECAP **" for processing large PCAP format files at high speed.

Divide and process

Parallel processing is a common practice when working on large amounts of data. First, the original data is divided into smaller data, and each small data is distributed to multiple threads, processes, and in some cases multiple servers for parallel processing.

I would like to work on PCAP format files as well, but I have a problem here.

Since the data in PCAP format is for recording the packet stream, there is no index information for each packet. Information such as how many packets are contained in a PCAP file and how many bytes the 1000th packet from the beginning starts on the file must be examined in order from the beginning of the PCAP file. not. This process is overwhelming when splitting a PCAP file that contains a large number of packets.

The PCAP format is briefly shown below. The red header is called the global header and contains the entire PCAP data information. The blue header part called the packet header and the white block that follows it represent one packet, and the packet header part contains information specific to each packet. PCAP_FORMAT_FIGURE

Remember the purpose

Now, the purpose was to split the PCAP format file into smaller sizes. Since it is only necessary to divide the data into multiple pieces and process them in parallel, it is not always necessary that the number of packets contained in the divided data is the same. SLICECAP calculates the data size after division from the number of divisions specified by the user and the size of the entire PCAP file, and searches for the "** nicely **" packet delimiter and divides it.

The time stamp information recorded in each packet header is used to find the packet boundary. Since the packets are recorded continuously, you can assume that the timestamp values for each packet are similar. First, move the file pointer to an appropriate position for splitting, and then advance the pointer byte by byte, and then advance the pointer to the position where the time stamp of a similar value is recorded. After that, it is verified whether other fields of the packet header have abnormal values, and if it is judged as "** like **", the division is confirmed there.

The divided data is reconstructed in PCAP format by adding a global header, and is passed to the child process by the mechanism of interprocess communication. Since the divided PCAP data is processed by a separate process, the processing performance improves according to the number of cores.

Let's use

As an example, here is an example of splitting a PCAP file into 10 small PCAP files.

slicecap -r source.pcap -n 10 -- "cat - > dest-{SLICE_ID}.pcap"

In the actual analysis, you will specify a filter program that is a little more useful than cat.

The figure below shows the processing time when the header information is converted to CSV using the p2c command while increasing the number of parallel processes. is. SLICECAP_PARA_FIGURE The PC used for the measurement is equipped with two Intel Xeon E2697 v3 (14 core) and 256GB of memory. The 58GB PCAP file was analyzed, the IP / TCP / UDP header was extracted, and the process was executed to output it in CSV format. You can see that the processing time is shortened as the number of parallel processes is increased. The effect of hyper-threading was not so much seen, probably because it was completed only by a simple integer operation. It may be improved a little depending on the processing content.

where to get

SLICECAP is registered in the PyPi repository. Please install with the pip command. The source code is available at https://github.com/keiichishima/slicecap/. Bagrepo and Pururiku are welcome.

Recommended Posts

SLICECAP: Split parallel processing of PCAP files
Parallel processing with Parallel of scikit-learn
About the behavior of Queue during parallel processing
Parallel processing with multiprocessing
Various processing of Python
Basic processing of librosa