[PYTHON] Amazing algorithm "BEADS" that can nicely separate the troublesome background of measurement data without function definition

Recently, the chances of coming into contact with various types of measurement data have increased, and the feeling that "I want to perform background evaluation in a batch without depending on the sample or measurement method ..." has increased. After searching for various things, I found that "you can do background fitting without specifying a range that avoids peaks with just a little ingenuity" Paper viewdoc / download? doi = 10.1.1.64.4698 & rep = rep1 & type = pdf) was found. Since there was continuous measurement data that was difficult to specify the range just because the peak moved, I thought that it was the best that I did not need to specify the range, so I immediately adopted it and realized the effect, but "Maybe background evaluation technology I dig a little deeper and dig a little deeper. The algorithm BEADS I found. (Although there are hyperparameters) It is possible to evaluate the background model-free and at high speed. When I tried it, I was able to do it, and I thought, "This should be more widely known ...", so I will introduce it.

background

(If you want to see an example quickly, you can read it from the preparation)

Background removal I knew

Data that is handled daily in all experiments and measurements such as astronomical, physics, chemistry, and biology always has a signal called background (baseline in some fields), which I wanted to avoid measurement if possible. It contains. There are various forms in which the background is mixed in with the measurement data. For example, a very straightforward one that raises the entire measurement data with a constant value or a straight line that rises / falls to the right, or multiple gentle hills overlap. There is something that looks like. バックグラウンドの例

The first step in analyzing any experimental data is to evaluate this background, but the evaluation method actually used depends on the experimental method, sample shape and constituent elements, and measurement conditions. The luckiest thing is if you know the cause of the background and can theoretically calculate it, or if you can measure background-only data. The next lucky thing is that the software provided by the measuring equipment manufacturer includes an automatic background removal function, and the background-removed data pops out with the touch of a button. I think most unlucky people probably use the least squares method for fitting using functions (polynomials, etc.) that are good at expressing the shape of the background. [^ 1]. The procedure required at this time will be as follows.

  1. View the measurement data
  2. Define a function that can express the background (for example, how many orders of polynomial to use)
  3. Select the "measurement point only for the background that does not seem to contain the true signal" to be fitted.
  4. Find a good initial value by trial and error
  5. Fit using some ready-made software or package

Actually, 3 "Selection of measurement points to be used for fitting" is in Paper introduced at the beginning It can be skipped by using the method of making the cost function positive and negative and asymmetric, but this story is not mentioned here [^ 2]. And 4 is still a hassle with or without 3. This can be inconvenient, especially when you want to batch process large amounts of data.

I'm also worried about the definition of the function. For example, in the example of "two" hills "+ constants" at the bottom right of the above figure, the two broad Gaussian functions and constants shown by the broken lines are added to the true signal. However, few people can judge that "this background can be expressed by putting constant clogs on two broad Gaussian functions" just by looking at the blue line. It may be possible to express it reasonably well with a polynomial that is an option for the time being in such a case, but it seems that it will be difficult to determine the order and find the initial value, and after all batch processing such as continuous measurement remains uneasy.

Changes in background evaluation technology from around 2000

In the field of condensed matter physics and materials science that I have experienced, background correction is not considered to be a special pain, and people including me probably adopt the method of fitting by function or selecting and complementing points. I think there are many. However, it seems that the movement to search for better background evaluation technology has been active since around 2000 in the community that uses measurements with complicated background shapes such as Raman spectroscopy, IF-IR, chromatography, mass spectrometry, and NMR. [^ 3]. Among them, the algorithm called airPLS (adaptive iteratively reweighted penalized least squares), which became particularly famous and has always appeared in benchmarks since then, was published in 2010 Paper. The main methods announced so far in blob / master / airPLS_manuscript.pdf) are categorized. According to the general method

--Polynomial fitting --Penalty or weighted least squares --Wavelet transform (separation in frequency space) --Differentiation

It is divided into two, and each has its own good and bad. This article does not intend to refer to background evaluation technologies in general, so we will not discuss them further here. If you are interested, please follow the quote of airPLS to find out. Some measurement data may perform better than BEADS.

BEADS BEADS (__B__aseline __E__stimation __A__nd __D__enoising using __S__parsity) was announced in 2014 in the magazine Chemometrics and Intelligent Laboratory Systems (author Duval). There is also a [preprint] published by Mr. (http://www.laurent-duval.eu/Articles/Ning_X_2014_j-chemometr-intell-lab-syst_chromatogram_bedusbeads-preprint.pdf). The authors, from the signal processing community, are the algorithms proposed as a solution to the chromatographic baseline processing problem.

I'm not very good at reading mathematical formulas, but the point of this algorithm is

--Assuming that the signal itself and its derivative are sparse --Asymmetrical penalties can also be applied --Adopts MM (Majorizer-Minimization) algorithm, which is one of the convex optimizations and guarantees asymptoticity.

I understand that. If you want to understand based on mathematical formulas, please see the original paper. From now on, we will show its power by examples.

Preparation

The original BEADS is written in MATLAB, but [Python translation] I found (https://github.com/hsiaocy/Beads), and based on that, I made a version that improved the execution speed to the same level as the MATLAB version. https://github.com/skotaro/pybeads/ Since it is also published on PyPi, it can be installed as a package using pip.

pip install pybeads

This article uses the following packages, including pybeads.

import numpy as np
import matplotlib.pyplot as plt
import pybeads

There is also a Jupyter notebook that does the same thing as this article. https://github.com/skotaro/pybeads/blob/master/sample_data/pybeads_demo.ipynb

Sample data

The sample data is https://github.com/skotaro/pybeads/blob/master/sample_data/chromatograms_and_noise.csv. This is the actual measurement data of chromatography attached to MATLAB Code. /chromatograms.mat and artificial white noise data / noise.mat` are converted from MATLAB format to CSV file and combined into one file (redistribution permission has been obtained). There are 9 columns in total, the first 8 columns are 8 actual data with different S / N ratios, and the last 9 columns are white noise that seems to have been created for the demonstration of noise reduction. Here, I will use the fourth data with a particularly large background plus noise [^ 4].

sample_data = np.genfromtxt('sample_data/chromatograms_and_noise.csv', delimiter=',')
y = sample_data[:, 3] + sample_data[:, 8]
print(y.shape)

fig, axes = plt.subplots(2, 1, figsize=(7, 6))
axes[0].plot(y)
axes[1].plot(y)
axes[1].set_ylim(-10, 200)

サンプルデータの様子

I'm not familiar with chromatography, so I don't know if such a background is common, but when it comes to fitting this background with a polynomial, it seems a little difficult to find the initial value. Let's apply BEADS to this data and estimate the background.

fc = 0.006 #Cutoff frequency used to create a high-pass filter
d = 1 #Parameters when creating a high-pass filter. 1 is fine. See the paper for details.
r = 6 #Asymmetry of asymmetry penalty

#Normalized parameters for the measured data and its derivative
amp = 0.8
lam0 = 0.5 * amp
lam1 = 5 * amp
lam2 = 4 * amp
#MM algorithm loop count
Nit = 15
#Penalty function
pen = 'L1_v2'

signal_est, bg_est, cost = pybeads.beads(y, d, fc, r, Nit, lam0, lam1, lam2, pen, conv=None)

fig, axes = plt.subplots(3, 1, figsize=(12, 7), sharex=True)
fig.subplots_adjust(hspace=0)
fig.patch.set_color('white')
axes[0].plot(y, c='k', label='original data')
axes[0].plot(bg_est, c='r', label='BG estimated by BEADS')
axes[0].legend()
axes[0].set_ylim(-20, 350)
axes[0].set_xlim(0, 4000)

axes[1].plot(signal_est, label='signal estimated by BEADS')
axes[1].legend()
axes[1].set_ylim(-20, 350)
axes[2].plot(y-signal_est-bg_est, label='noise estimated by BEADS')
axes[2].set_ylim(-35, 35)
axes[2].legend()

BEADSの結果

The black line in the top graph is the sample data. The red line is the estimation result. Are you surprised? I was honestly surprised that "Eh ... it's amazing ...". The execution time is about 400ms with 4000 points. In the paper using MATLAB code, 1000 points is 120ms, so it seems that the speed is the same even in the Python version. Also, although I haven't touched on it so far, in fact, as shown in the graph at the bottom, noise separation is also done at the same time.

Since BEADS is based on the Majorization-Minimization algorithm that asymptotes to the solution by turning the loop, plot how the cost calculated from the measurement data and the derivative (a measure of good or bad estimation) is reduced for each loop, and then to the solution with the set parameters. You can see the degree of convergence of. It seems good to judge that the data this time has converged sufficiently in the 15 times specified at the time of execution.

コストの遷移

Difference from actual data

In fact, this sample data has the property that "measured values gradually drop to 0 at both ends of the data", which is convenient for BEADS. As a test, add constants so that both ends are not 0, and try to estimate using the same parameters.

bg_const = 50
signal_est, bg_est, cost = pybeads.beads(y+bg_const, d, fc, r, Nit, lam0, lam1, lam2, pen, conv=None)

両端が0でなくなるとダメ

I can't estimate well at both ends. You can also see that the estimate is 0 at both ends. In actual measurement data, the measured value does not always become 0 smoothly at both ends. This is a problem. What to do now?

If it does not fall (to 0), drop it. Measured value

I tried it with the data at hand and it didn't work, and I was surprised that "It's such a wonderful algorithm, but it must be measurement data that smoothly drops to 0 at both ends ...", but I noticed the trick of the sample data mentioned above and smoothly. Isn't it possible to extend the measurement data by creating a part that becomes 0? It flashed [^ 5].

First of all, let's try to make the background more suspicious than raising the bottom with a constant.

bg = 5e-5*(np.linspace(0, 3999, num=4000)-2000)**2
y_difficult = y + bg
plt.plot(y_difficult)
plt.ylim(0, 350)

もっとやらしいバックグラウンド

It has become very easy. Add an extension to this. The sigmoid function is used for the function that falls from the lick. xscale_l etc. are parameters that "extend" the part of the sigmoid function that falls to 0, and the larger the number, the more "extend" the extension part. In other words, many pseudo measurement points are added to the extension part before it drops to 0. For the time being, set the extension parameter to 30 for both left and right.

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

xscale_l, xscale_r = 30, 30
dx = 1
y_difficult_l = y_difficult[0]*sigmoid(1/xscale_l*np.arange(-5*xscale_l, 5*xscale_l, dx))
y_difficult_r = y_difficult[-1]*sigmoid(-1/xscale_r*np.arange(-5*xscale_r, 5*xscale_r, dx))
y_difficult_ext = np.hstack([y_difficult_l, y_difficult, y_difficult_r])
len_l, len_o, len_r = len(y_difficult_l), len(y_difficult), len(y_difficult_r)
plt.plot(range(len_l, len_l+len_o), y_difficult)
plt.plot(y_difficult_l, 'C1')
plt.plot(range(len_l+len_o, len_l+len_o+len_r), y_difficult_r, 'C1')

両端を延長させた

The orange line is the extension. It looks good somehow, so I'll estimate it.

signal_est, bg_est, cost = pybeads.beads(y_difficult_ext, d, fc, r, Nit, lam0, lam1, lam2, pen, conv=None)

惜しい

It doesn't seem to be delayed, but it looks like a nice approach. Let's set the extension parameter to 100.

xscale_l, xscale_r = 100, 100

もっと間延びさせてみる

signal_est, bg_est, cost = pybeads.beads(y_difficult_ext, d, fc, r, Nit, lam0, lam1, lam2, pen, conv=None)

うまくいった!

Succeeded! This will probably handle any measurement data (if the peak is sparse to some extent).

About high para adjustment

As a sneak peek, BEADS is model-free and can estimate the background without defining a function, but it has so many hyperparameters.

fc = 0.006 #Cutoff frequency used to create a high-pass filter
d = 1 #Parameters when creating a high-pass filter. 1 is fine. See the paper for details.
r = 6 #Asymmetry of asymmetry penalty

#Normalized parameters for the measured data and its derivative
amp = 0.8
lam0 = 0.5 * amp
lam1 = 5 * amp
lam2 = 4 * amp
#MM algorithm loop count
Nit = 15
#Penalty function
pen = 'L1_v2'

I'm not used to the feeling of the size of numbers yet, but if I play with it a little, the most important thing is fc, which decides what frequency or less the signal should be in the background, and the next one is effective. Normalization parameter. In fact, depending on the data, even a small change in the normalization parameters (for example, 0.001 to 0.0011) can make a big difference in the result. In that case, try conv = 3 or conv = 5. The differential result is smoothed by the moving average and the result is stable [^ 6].

Regarding d, r, Nit, pen, I think it depends on the data, but I didn't need to change it with the data I handle.

Summary

--BEADS allows you to estimate and separate background and noise from measurement data without defining a function. --Depending on the measurement data, preprocessing is required to smoothly set both ends to 0. ――High para adjustment is necessary, but the actual important parameters are likely to be limited.

Reference material

--Original paper (2014) https://doi.org/10.1016/j.chemolab.2014.09.014 --Preprint laurent-duval.eu -Project site by thesis author -Original MATLAB Code

[^ 1]: There are other primitive but reliable methods such as spline interpolation and linear interpolation of the selected points. This is not a bad method when the amount of data you want to process is not so large or when you don't mind the blurring caused by manual operation. I avoid this method as much as possible because there is a lot of measurement data I want to process and I am worried about blurring when doing it manually, and above all it makes me sad. There is also a way to use the wavelet transform, but before I looked into it in detail, I wasn't interested in it because I was told that BEADS is okay.

[^ 2]: If you know a suitable function for background evaluation, or if you need to evaluate by function, the asymmetric cost function method is quite effective and well worth trying. There is also a matlab toolbox by the author of the paper, but you can easily write it in Python using scipy.optimize.minimize. ..

[^ 3]: Since these measurement methods are widely used in fields with many researchers in industry and academia such as chemistry, biology, medicine, and food, there was simply a lot of demand and the impact of improvement was great, which is one reason why the improvement progressed. I wonder if.

[^ 4]: Looking at the value of sample_data, there is a negative value even before adding white noise. Perhaps the actual data has already been processed to draw a line connecting both ends from the raw measurement data.

[^ 5]: This does not forge measurement data at all. There is no problem if you extend it only to estimate the background of the measurement part and remove the extension part after subtracting the background.

[^ 6]: This is a feature I added, so it is not included in the original MATLAB version (for now).

Recommended Posts

Amazing algorithm "BEADS" that can nicely separate the troublesome background of measurement data without function definition