Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data

Feature extraction is performed from the time series sensor data using the "field creation" node that processes data from existing data using functions with SPSS Modeler. And let's rewrite the process with Python pandas.

SPSS Modeler provides nodes for various data processing, but the "Field Creation" node is a fairly general-purpose node for data processing with a high degree of freedom.

image.png

The processing pattern can be selected from the "Derived" list. Derivation is hard to imagine, but it is an English translation called derivative, which means a processing pattern created by deriving from the original data. I will explain in order of personal use.

Both process records from top to bottom. Especially for count type and state type, it is essential to be aware of the record processing order.

Since it is a general-purpose node, various processing can be considered, but this time we will use it for the purpose of extracting features from time-series sensor data.

Since the time-series sensor data does not have much information as it is, the key to analysis is to process and create effective features. For example, it would be easy if we could grasp the simple feature that "an error will occur if the power exceeds 200W", but how the value of the sensor has actually changed, for example, the amount of power is rising rapidly and the amount of power is stable. In most cases, meaningful analysis cannot be performed without analysis using information such as repeating the increase / decrease in a zigzag manner without doing so.

The data to be analyzed are as follows. M_CD: Machine code UP_TIIME: Startup time POWER: Power TEMP: Temperature ERR_CD: Error code

For each machine code, changes in power and temperature along the startup time, and any errors are recorded in chronological order.

image.png

This time, we will create the following features from this data. ① Conditional: Power difference 1 hour ago (2) Flag type: A flag that catches the zigzag that power increases or decreases ③ Count type: Cumulative number of zigzag occurrences ④ State type: A state in which zigzag occurs frequently or not.

image.png

In each case, the record order is required, so use the sort node to sort by each machine code and startup time.

image.png

1m. ① Conditional: Power difference 1 hour ago Modeler version

Let's make a feature of "difference from the power one hour ago".

Set to Derived: Conditional on the field creation node. Then, an input item showing the structure of the IF statement will appear. Actually, the IF statement can be written by "Derivation: CLEM expression", but it is recommended to use "Derivation: Conditional" to improve readability.

First, enter @ DIFF1 (POWER) in Then :. @ DIFF1 is Modeler's built-in function called CLEM function, which calculates the difference from the previous line. Now you can calculate the difference from the power one hour ago.

Next, set If: to M_CD = @OFFSET (M_CD, 1) and Else: to undef. @OFFSET is a function that refers to the value N lines before. Here, the previous line is referenced. undef means NULL. In other words, if it is the same as M_CD in the previous line, @ DIFF1 (POWER) is calculated, and if it is different from M_CD in the previous line, it is meaningless to calculate the difference with the power of another machine, so it means to put NULL. Will be.

image.png

The result is as follows. There is a new derived column called POWER_DIFF that contains the value of POWER in the previous row minus the current POWER. In the example on line 930, 988W-991W = -3W is included.

Also, if you look at line 941, you will find $ null $. This is the data of the machine with M_CD = 204 up to the 940th line, and the data of the machine with M_CD = 209 from the 941st line, so it is meaningless to calculate the difference with the power, so NULL is entered.

image.png

By the way, let's look at two machines, M_CD = 1000 and M_CD = 229, in a time series graph.

M_CD = 1000 has a monotonous decrease of -1W and -2W from the beginning, and has never increased. At the end, there is a relatively large reduction of -5W and -6W. image.png

In the case of M_CD = 229, there was a considerable positive and negative difference, and the increase and decrease were repeated. image.png

1p. ① Conditional: Power difference 1 hour ago pandas version

In pandas, we group by M_CD and calculate diff (1), which represents the calculation one hour ago, for POWER and put it in a new column called df ['POWER_DIFF'].

#Power difference 1 hour ago
df['POWER_DIFF'] = df.groupby(['M_CD'])['POWER'].diff(1)

image.png

2m. (2) Flag type: Flag that captures the zigzag that power increases or decreases Modeler version

There may be something wrong with the power supply if the power goes up and down repeatedly like a machine with M_CD = 229. It is not possible to capture the zigzag of power increase / decrease with only the single value (example: -5W) of "the difference in power one hour ago". Create a feature that indicates that the difference in power has changed from positive to negative, or from negative to positive.

Set to "Derived: Flag type" in the field creation node. Since I wanted to display the data type of the field in the time series graph later, I set it to continuous type, 1 for true and 0 for false. If you just want a flag, you can leave the data type as a flag type. To true conditions POWER_DIFF * @OFFSET(POWER_DIFF,1) < 0 To set. "Difference in power 1 hour ago" * "Difference in power 1 hour ago" is calculated to determine whether it will be negative. The multiplication of plus and minus is minus, and the multiplication of plus and minus is plus. Therefore, it is flagged when the sign is inverted, that is, when zigzag occurs.

image.png

The result is as follows. There is a new derived column called FLUCTUATION, which contains a 1 if the POWER_DIFF and POWER_DIFF in the previous row have different signs. In line 1195, it increased by 5W one hour ago, and this time also increased by 5W, so it is increasing monotonously. So the flag is 0. On the other hand In line 1197, 5W increased 1 hour ago, but this time decreased to -1W, so zigzag is occurring. So the flag is 1. The zigzag situation that could not be understood without looking at the graph can now be understood by just looking at one record on line 1197.

image.png

Let's look at two machines, M_CD = 1000 and M_CD = 229, again in a time series graph.

Since M_CD = 1000 has a monotonous power reduction from the beginning, there is no zigzag. image.png

If M_CD = 229, you can see that the increase and decrease are repeated finely. image.png

2p. (2) Flag type: Flag that catches the zigzag that power increases or decreases pandas version

Creating a zigzag flag in pandas can be a bit confusing. First, create a variable for POWER_DIFF one hour ago. Grouped by M_CD, for POWER_DIFF, the value one hour ago is referenced by shift (1) and put in a new column called df ['PREV_POWER_DIFF'].

#POWER 1 hour ago_Added DIFF column
df['PREV_POWER_DIFF'] = df.groupby(['M_CD'])['POWER_DIFF'].shift(1)

This column isn't created in Modeler because it's not needed in the end, but it's needed for calculations in pandas. image.png

Next, define the function func_fluctuation. In the following IF statement in this function if x.POWER_DIFF * x.PREV_POWER_DIFF < 0: "Difference in power 1 hour ago" * "Difference in power 1 hour ago" is calculated and judged whether it becomes negative.

I then call this function with lambda for each row and put the result in a new column called df ['FLUCTUATION']. Note that we are converting from pandas.Series to pandas.DataFrame by setting axis = 1.

#Function to judge plus and minus inversion
def func_fluctuation(x):
    if x.POWER_DIFF * x.PREV_POWER_DIFF < 0:
        return 1
    else:
        return 0
    
#Call a function that determines the inversion of plus and minus from each line
df['FLUCTUATION'] = df.apply(lambda x:func_fluctuation(x),axis=1)

I was able to generate it as follows. image.png

3m. ③ Count type: Cumulative number of zigzag occurrences Modeler version

If you have a lot of zigzags like a machine with M_CD = 229, you may have some problem with the power supply, but if you have a few zigzags, you may think that there is no problem. Let's create a feature of the cumulative sum of how many times the zigzag has occurred cumulatively after startup.

Set to "Derived: Count type" in the field creation node. Incremental condition FLUCTUATION = 1 Increment is 1 To set. It means that when a zigzag occurs, it counts up by one. In addition, M_CD / = @OFFSET (M_CD, 1) is set as the reset condition, and the counter is set to 0 when the machine changes.

image.png

The result is as follows. There is a new derived column called FLUC_COUNT, and when 1 is entered in FLUCTUATION, it will be counted up one by one. Looking at line 1194, FLUC_COUNT is 1 because FLUCTUATION has occurred. After that, 1 is maintained until line 1197. And since FLUCTUATION is occurring again on line 1197, it has increased to 2.

image.png

Now let's look at two machines with M_CD = 104 and M_CD = 229 in a time series graph.

M_CD = 104 has two zigzags after 40 hours, after which the power is steadily decreasing. So FLUC_COUNT will remain at 2 after about 50 hours. image.png

When M_CD = 229, the increase / decrease was repeated finely, and the zigzag state was repeated 40 times or more. image.png

3p. ③ Count type: Cumulative number of zigzag occurrences pandas version

In pandas, you can calculate with a function called cumsum () that calculates the cumulative sum. Grouped by M_CD, the cumulative sum of FLUCTUATION is calculated by cumsum () and put in a new column called df ['FLUC_COUNT'].

#Cumulative number of zigzag
df['FLUC_COUNT'] = df.groupby(['M_CD'])['FLUCTUATION'].cumsum()

I was able to generate it as follows. image.png

4m. ④ State type: A state in which zigzag occurs frequently or does not occur Modeler version

If the zigzag state also fluctuates a little, there may not be a big problem. On the other hand, if the zigzag repeats in a short period of time, the effect may remain even if the zigzag subsides after that. "Derivation: state type" can express such a complicated situation.

Set to "Derived: State type" on the field creation node. Since I wanted to display the data type of the field in the time series graph later, I made it continuous type, and set it to 1 for "on" and 0 for "off". If you just want a flag, you can leave the data type as a flag type. In the conditional expression of the switch "on" FLUCTUATION = 1 and @OFFSET(FLUCTUATION,1) = 1 To set. This means that there was a zigzag and that the zigzag happened an hour ago. In other words, the zigzag occurred for two hours in a row.

Next, in the conditional expression of the switch "off" @SINCE(FLUCTUATION = 1) >= 5 or M_CD /= @OFFSET(M_CD,1) To set. @SINCE returns a number that indicates how many lines before the expression given as an argument holds. @SINCE (FLUCTUATION = 1)> = 5 means that the zigzag last occurred more than 5 lines ago. In other words, it means that it is stable because there is no zigzag for more than 5 hours in a row.

Also, M_CD / = @OFFSET (M_CD, 1) is a reset condition, and it is set to return the status to off when the machine changes.

Similar to the flag type, but the state type allows the on and off conditions to be asymmetric. Here, if the zigzag unstable situation continues twice, it is turned on, while the stable state does not return to off until it continues five times.

image.png

The result is as follows. There is a new derived column called UNSTABILITY. First, looking at the 902nd line, 1 is entered in FLUCTUATION for 2 consecutive records, and 1 is reached. If the zigzag continues for 2 hours in a row, it is judged to be unstable.

Next, from line 903 to line 906, FLUCTUATION does not occur at 0, but UNSTABILITY remains at 1. And on line 907, FLUCTUATION was 1 more than 5 records ago, that is, FLUCTUATION was 0 more than 5 records in a row, so UNSTABILITY returned to 0. Since the zigzag did not occur for more than 5 hours, it was judged to be stable.

image.png

Now let's look at two machines with M_CD = 204 and M_CD = 229 in a time series graph.

M_CD = 204 has two zigzags after 49 hours, after which the power is steadily decreasing. So UNSTABILITY will stay at 0 5 hours after it becomes 1.

image.png

When M_CD = 229, it keeps increasing and decreasing finely, and UNSTABILITY is 1 for a long period of time, but there are 3 times without zigzag for 5 hours in a row, and UNSTABILITY is 0 during that period.

image.png

4p. ④ State type: Whether zigzag occurs frequently or not pandas version

Since it is difficult to express such a complicated condition with pandas, I thought about serial processing with loop processing.

#The first line is the initial value of stability
df.at[0, 'UNSTABILITY'] = 0
stable_seq_count = 0

#2nd line(index=1)Loop processing from
for index in range(1,len(df)):
    #The default is to keep the previous status
    df.at[index, 'UNSTABILITY'] = df.at[index-1, 'UNSTABILITY']
    
    #If there is a change
    if df.at[index, 'FLUCTUATION'] == 1 :
        #Initialize continuous stability count
        stable_seq_count = 0
        #Instability judgment when fluctuation continues twice
        if df.at[index-1, 'FLUCTUATION'] == 1:
            df.at[index, 'UNSTABILITY'] = 1
    #If there is no fluctuation, increase the continuous stability count
    elif df.at[index, 'FLUCTUATION'] == 0:
        stable_seq_count += 1
    
    #Stable status judgment when continuous stability count continues 5 times or more or when the machine becomes another machine
    if stable_seq_count >= 5 or df.at[index, 'M_CD'] != df.at[index-1, 'M_CD']:
        df.at[index, 'UNSTABILITY'] = 0

I was able to generate it as follows.

image.png

5. Sample

The sample is placed below.

stream https://github.com/hkwd/200611Modeler2Python/raw/master/derive/derive3.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/derive/derive.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/Cond4n_e.csv

■ Test environment Modeler 18.2.2 Windows 10 64bit Python 3.6.9 pandas 0.24.1

6. Reference information

Field creation node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_mainhelp_client_ddita/clementine/derive_overview.html

Recommended Posts

Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data
Rewrite the record addition node of SPSS Modeler with Python.
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling
Rewrite field order nodes in SPSS Modeler with Python.
"Getting stock price time series data from k-db.com with Python" Program environment creation memo
Using Python with SPSS Modeler extension node (2) Model creation using Spark MLlib
Comparing R, Python, SAS, SPSS from the perspective of European data scientists
Plot CSV of time series data with unixtime value in Python (matplotlib)
View details of time series data with Remotte
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Learn Nim with Python (from the beginning of the year).
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
Reformat the timeline of the pandas time series plot with matplotlib
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
Change node settings in supernodes with SPSS Modeler Python scripts
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
Try to extract the features of the sensor data with CNN
[Python] Plot time series data
How to extract features of time series data with PySpark Basics
Color extraction with Python + OpenCV solved the mystery of the green background
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Python: Time Series Analysis: Preprocessing Time Series Data
Existence from the viewpoint of Python
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
Try to image the elevation data of the Geographical Survey Institute with Python
Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.