[PYTHON] [I studied] RANSAC story

In three lines


Hello. Today, I would like to send you a story about informatics that I am doing at graduate school. The other day, I suddenly thought, "By the way, I had never studied properly," so I studied and implemented RANSAC.

What is RANSAC

When taking natural data such as images in graduate school research, the data may contain "outliers" that appear to deviate significantly from the law due to noise or the like. Outliers interfere with finding the law in the data. In such a case, RANSAC is a method of estimating the law (parameter) by ignoring outliers.

... it's hard to understand when talking about concepts, so let's look at a concrete example. Hereafter, the law is read as "model".

Straight line model estimation

Consider the problem of estimating from a given point cloud what straight line the point cloud is along. The point cloud is distributed along this straight line, but 100 samples are mixed with outliers that jump up one level.

The graph looks like this.

(Generated code by python)

# true values
_a = 0.5
_b = 0.3

# samples
# samples
points = np.array([[x, _a * x + _b + .1 * np.random.randn() + (np.random.randint(100) == 0) * np.random.rand() * 1000] for x in np.arange(0, 10, 0.01)])

Since the straight line is represented by y = ax + b, this problem is synonymous with finding the values of a and b. This time, let the true value of a be 0.5 and the true value of b be 0.3.

If you try to find this (a, b) normally, it will be pulled by the "outliers" and the estimated straight line will be pulled up greatly. For example, the most orthodox method of least squares would look like this.

Red is the true straight line and green is the estimated straight line. The estimated parameters are (a, b) = (0.20, 7.4). In this way, the outliers pull the estimated green straight line up more than the true red straight line.

(Generated code by python, with an annealing approach)

def leastSquare(data):
    # Simulated Annealing
    tau = 100
    bestfit = None
    besterr = float('inf')
    model = np.zeros(2)
    while tau >= 0.0001:
        for _ in range(10):
            grad = errorGrad(model, data)
            grad /= np.linalg.norm(grad)
            grad *= -1
            model += grad * tau
            
        tau *= 0.1
    return model

a, b = leastSquare(points)

Solved with RANSAC

If you ask for it normally like this, the "outliers" will interfere and it will not work, so ignore the outliers. RANSAC uses the following algorithm.

  1. Randomly select more "small" samples from the data set than needed to determine the model (two or more because the unknowns are (a, b) this time)
  2. Derivation of a temporary model from the obtained "small number" of samples, such as by the least squares method.
  3. Apply a temporary model to the data and add it to the "correct model candidates" if there are not too many outliers
  4. Repeat 2-3 several times
  5. Among the obtained "correct model candidates", the one that best matches the data is set as the true model.

If there are a lot of outliers in step 3., ignore them because the sample in step 2. has caught the outliers. By doing this, the influence of outliers can be nicely excluded and the true value can be obtained. In other words, it is the act of finding the point cloud that best represents the true model among the point clouds of the data, and finding a straight line from that point cloud.

The straight lines obtained in this RANSAC are as follows.

Green is the estimated straight line. You can see that it is close to the distribution of the data. You can see that (a, b) = (0.496228 ..., 0.299107 ...) is fairly close to the true value.

The code in python looks like this.

def ransac(data,
        # parameters for RANSAC
        n = 2, # required sample num to decide parameter
        k = 100, # max loop num
        t = 2.0, # threshold error val for inlier
        d = 800 # requrired inlier sample num to be correnct param
    ):

    good_models = []
    good_model_errors = []
    iterations = 0
    while iterations < k:
        sample = data[np.random.choice(len(data), 2, False)]
        param = getParamWithSamples(sample)

        inliers = []
        for p in data:
            if (p == sample).all(1).any(): continue
            if getError(param, p) > t:
                continue
            else:
                inliers.append(p)


        if len(inliers) > d:
            current_error = getModelError(param, data)
            good_models.append(param)
            good_model_errors.append(current_error)

        iterations += 1
        
    best_index = np.argmin(good_model_errors)
    return good_models[best_index]

a, b = ransac(data)

Summary

This time, I took up the RANSAC algorithm using a two-parameter linear model as an example. It is a very important algorithm because it appears in various papers.

Originally I liked a lot of algorithms, but for a little over a year after I entered graduate school, I was dealing with various algorithms as a black box with an emphasis on practice such as implementation, so from now on I will occasionally touch the contents of each algorithm. I also want to chase after him.

By the way, this code is summarized here in the format of ipython notebook.

https://github.com/smurakami/study-ransac/blob/master/ransac.ipynb

Recommended Posts

[I studied] RANSAC story
I studied this! !!
I studied about Systemd properly
I studied about Linux, so I summarized it.