[PYTHON] Fuzzing and statistics

Introduction

This article summarizes the statistical approach to the ** seed scheduling problem ** of fuzzing. The basics of statistics and reinforcement learning, especially the binomial distribution and sample testing and multi-armed bandit problems, are known. Fuzzing is a type of software test for finding software bugs (especially vulnerabilities). It is a test in which a wide variety of inputs are given to the program under test and executed, and the inputs that cause some problems are recorded. It's a method. Of course, multiple inputs may be causing the same bug (root cause), so these recorded inputs are deduplicated to correspond 1: 1 with the bug in a post-fuzzing triage process. ) And recorded in the database. This series of steps is called ** fuzzing campaign **. ossfuzz.png Google's OSS-Fuzz example

The biggest difference from the random test is that fuzzing observes the change in state when the program is executed (Observation) and feeds it back as a hint for the next input generation. Fuzzing is classified into the following three types depending on how much information is observed when executing the program.

Blackbox Fuzzing Observe only the exit status of the program
Whitebox Fuzzing Observe even the semantics of the program
Greybox Fuzzing Observe partial information about the program

When Whitebox Fuzzing executes a program for one input, it records all the logic on the executed code flow and derives the conditions for generating another new flow with the SMT solver to input the next input. Generate. The so-called ** Symbolic Execution ** falls into this category. Partial information in Graybox Fuzzing is a rather vague definition, but most use coverage information when the program is run. This is especially called ** Coverage based (Greybox) Fuzzing **. For example, AFL observes the edge coverage that passes when an input is executed, and preferentially holds that input when it passes through an edge that has never been seen before.

FCS(Fuzzing Configuration Schedule) Now, I will explain the seed scheduling problem of the main subject. According to VALENTIN J.M [^ 11], the fuzzing algorithm is summarized as follows. fuzzalgo.png

This paper deals with the Schedule () algorithm that performs seed scheduling in the above loop. You can think of the seed in fuzzing as the input file. The father first generates the following input based on an initial seed called Initial Seed, and then adds it to the seed pool with ConfUpdate () based on the feedback information. However, there is also a method called ** Generation based Fuzzing ** where the seed uses meta information for input value generation instead of a concrete value like PEACH father, so the seed is further generalized "Fuzzing Configuraion". It is common to call it. This paper deals with the more generalized ** Fuzzing Configuration Schedule (FCS) ** as well as seed scheduling.

Grey/White box FCS The purpose of fuzzing is to detect more bugs in a given amount of time. Therefore, the question of which seed (Configuration) to select at a certain stage is an important factor in designing efficient fuzzing. As a simple example, AFL preferentially selects the ** with the shortest program execution time ** among the generated seeds. This is a strategy to generate more seeds within a given time by choosing seeds with low time cost. AFF Fast [^ 5], on the other hand, records the program path executed by the seed and preferentially selects the seed that has passed less than ** program paths **. Alternatively, there are various studies such as AFL Go [^ 7], a strategy that prioritizes seeds that can easily reach a specific program location, a combination with static analysis [^ 6], and application to White box Fuzzing [^ 4].

Black box FCS with a statistical approach

The Gray / White box FCS above used feedback such as coverage and program path, but what about the Black Box Fuzzer? As I explained at the beginning, Black Box Fuzzing only observes the exit status of the program, so the feedback information obtained is very small. Therefore, the CERT team of CMU developed a Black Box Fuzzer called BFF, generated a certain number of inputs from each seed selected, and the ratio of inputs that assigned the bugs contained in it ** (#) We proposed an algorithm that preferentially selects seeds with large bugs / # runs) **. [^ 1] Strictly speaking, a mutation for a seed with fuzzing can be regarded as a ** Bernoulli trial ** that generates a certain number of inputs from a seed that attracts bugs with a probability p, and the binomial distribution at this time is a Poisson distribution. Can be approximated to (because the probability p is very small). Below is the patch I developed when I actually dumped the binomial distribution using CERT BFF.

python


diff --git a/bff-2.8/batch.sh b/bff-2.8/batch.sh
index a7fb5ef22a..6c0af417df 100755
--- a/bff-2.8/batch.sh
+++ b/bff-2.8/batch.sh
@@ -66,7 +66,8 @@ contains() {
 scriptlocation=`echo "$(cd "$(dirname "$0")"; pwd)/"`
 echo Script location: $scriptlocation/bff.py
 platform=`uname -a`
-PINURL=https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.0-76991-gcc-linux.tar.gz
+#PINURL=https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.0-76991-gcc-linux.tar.gz
+PINURL=https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.2-81205-gcc-linux.tar.gz
 if ( contains "$platform" "Darwin Kernel Version 11" ); then
     mypython="/Library/Frameworks/Python.framework/Versions/2.7/bin/python"
 else
diff --git a/bff-2.8/certfuzz/scoring/multiarmed_bandit/arms/base.py b/bff-2.8/certfuzz/scoring/multiarmed_bandit/arms/base.py
index 2fed07c3c3..47fd90477a 100644
--- a/bff-2.8/certfuzz/scoring/multiarmed_bandit/arms/base.py
+++ b/bff-2.8/certfuzz/scoring/multiarmed_bandit/arms/base.py
@@ -60,7 +60,8 @@ class BanditArmBase(object):
         self.successes = 0
         self.trials = 0
         self.probability = None
-
+        self.successratio = {}
+        
         # initialize probability
         self.update()
 
@@ -79,6 +80,7 @@ class BanditArmBase(object):
         '''
         self.successes += successes
         self.trials += trials
+        self.successratio[successes] = self.successratio.get(successes, 0) + trials
         self._update_p(successes, trials)
         if self.probability is None:
             logger.debug("MAB arm: %s", self)
@@ -86,7 +88,13 @@ class BanditArmBase(object):
         elif not (0.0 <= self.probability <= 1.0):
             logger.debug("MAB arm: %s", self)
             raise BanditArmError('probability must be between 0.0 <= {:f} <= 1.0'.format(self.probability))
-
+        if self.trials <= 0:
+            return
+        with open(str(id(self)) + ".dat", "w") as fp:
+            fp.write("# {} trials\n".format(self.trials))
+            for s in list(range(self.successes + 1)):
+                if s in self.successratio:
+                    fp.write("{} {}\n".format(s, float(self.successratio[s]) / float(self.trials)))
     def _update_p(self, *_unused_args):
         '''
         Internal method, ensure that self.probability gets assigned
diff --git a/bff-2.8/configs/bff.yaml b/bff-2.8/configs/bff.yaml
index 79d4d074a8..15197c559d 100644
--- a/bff-2.8/configs/bff.yaml
+++ b/bff-2.8/configs/bff.yaml
@@ -11,7 +11,7 @@
 #
 ##############################################################################
 campaign:
-    id: convert v5.2.0
+    id: LAVA-M v5.2.0
 
 
 ##############################################################################
@@ -31,8 +31,8 @@ campaign:
 #
 ##############################################################################
 target:
-    program: ~/convert
-    cmdline_template: $PROGRAM $SEEDFILE /dev/null
+    program: ~/who-lava
+    cmdline_template: $PROGRAM $SEEDFILE
 
 
 ##############################################################################
@@ -53,7 +53,7 @@ target:
 ##############################################################################
 
 directories:
-    seedfile_dir: seedfiles/examples
+    seedfile_dir: seedfiles/who
     working_dir: ~/fuzzing
     results_dir: results
 
diff --git a/bff-2.8/seedfiles/who/f1.utmp b/bff-2.8/seedfiles/who/f1.utmp
new file mode 100644
index 0000000000..b30aa294cf
Binary files /dev/null and b/bff-2.8/seedfiles/who/f1.utmp differ
diff --git a/bff-2.8/seedfiles/who/f2.utmp b/bff-2.8/seedfiles/who/f2.utmp
new file mode 100644
index 0000000000..201990c980
Binary files /dev/null and b/bff-2.8/seedfiles/who/f2.utmp differ
diff --git a/bff-2.8/seedfiles/who/f3.utmp b/bff-2.8/seedfiles/who/f3.utmp
new file mode 100644
index 0000000000..841846d9b8
Binary files /dev/null and b/bff-2.8/seedfiles/who/f3.utmp differ

patch

If there are enough input generations (trials) for each seed, (# bugs / # runs) can be used as the evaluation value as it is, but it is impossible to try all the inputs that can be generated from a certain seed. near. Therefore, they generate about 500 inputs for each seed and use the upper limit UCB (Upper Confidence Bound) of ** 95% confidence interval ** as the evaluation value as a sample test. The FCS was designed so that the larger the seed, the higher the probability that this UCB would be selected.

Multi-armed bandit problem in fuzzing

Now, the Black box FCS by sample estimation explained in the previous section is a so-called ** online algorithm **, and the correct information on how many buggy inputs a seed contains is unknown. Therefore, the higher the number of trials, the higher the quality of UCB. So which of the seeds in the seed pool should be tried more often? This is the so-called ** Exploration and Exploitation dilemma **. For each seed, the evaluation value of "easiness to allocate bugs" can be calculated by the UCB explained in the previous section. So should we always try the seed with the highest rating at some stage? Or if you choose another seed and increase the number of trials, there may be something that gives a higher evaluation value. To this dilemma, CERT BFF uses ϵ-greedy and EXP3.

mab.png

Weighted coupon collector problem and no free lunch theorem

However, there is actually a problem with the FCS by CERT BFF in the previous section. The point is that it only distinguishes between 0/1 whether the generated input is buggy or not. As explained in the first section, basically buggy inputs in fuzzing are ** multiple inputs may have the same bug **. For example, suppose the following input is generated for a seed.

python


0, 0, 0, 1, 0, 1, 0, 0, 1, 1,

This is a Bernoulli trial that generated four buggy inputs with a probability of p in 10 trials. However, if you actually triage and label each bug (root cause), it may be as follows.

python


0, 0, 0, 1, 0, 2, 0, 0, 2, 2,

** Actually, I only pulled two types of bugs, labels 1 and 2. ** That's why. However, in the FCS of CERT BFF, this is also counted as 4 and the sample test is performed. Maverick Woo [^ 2] pointed out this problem and showed that the FCS by the conventional Bernoulli trial did not perform the correct test. In fact, even in the fuzzing campaign using CERT BFF, many bugs appeared that were below the minimum number of regrets of EXP3. This is one proof that the conventional seed evaluation method is far from the number of bugs after passing through triage. So I want to use a more accurate statistical model, but what's the matter? I will post the previous trial again.

python


0, 0, 0, 1, 0, 2, 0, 0, 2, 2,

Suppose you also generate 10 inputs from other seeds, and now you have:

python


0, 0, 0, 1, 0, 2, 0, 0, 3, 4,

Apparently, this seed was able to find four types of bugs. In this case, intuitively, the latter seed should be selected first.

In other words, ** I want to prioritize the seed that attracts more types of bugs out of N types of bugs ** in a certain number of trials. This can be modeled exactly as the [Coupon Collector Problem](https://ja.wikipedia.org/wiki/Coupon Collector Problem). Since the probability of assigning each bug is different, use Weighted Coupon Collector Problem (WCCP) to be exact. .. Unfortunately, WCCP cannot be modeled without knowing the distribution of coupon weights. So what this means is that ** there is no strategy for optimizing the Black Box FCS for any program than the no free lunch theorem **. In addition to showing this, Maverick Woo [^ 2] proposed a method called Rule of Three as a 50-50 proposal, and developed a system called FuzzSim that finds the correct optimum FCS value by an offline algorithm using dynamic programming. I'm a little maniac, so if you're interested in these details, see the paper.

By the way, Dr. Maverick Woo is my academic advisor at CMU.

in conclusion

This time, I introduced the Black Box FCS as one of the statistical approaches to fuzzing. If you are interested, there is also a study using a statistical model [^ 3] as a way to select the earliest Initial Seed.

Reference

Recommended Posts

Fuzzing and statistics
Basic statistics and Gaussian distribution
Stock price and statistics (mean, standard deviation)
About _ and __
[Statistics for programmers] Conditional probabilities and multiplication theorems
[Statistics for programmers] Lorenz curve and Gini coefficient