[PYTHON] I tried to make OneHotEncoder, which is often used for data analysis, so that it can reach the itch.

Posts of Hisabisa

Background made

There are category_encoders and pandas get_dummies () for OneHot Encoding, but I find it difficult to use in the following points.

  1. get_dummies () can only be onehot for the input series → There is no consistency when the data is separated into train and test in the data analysis
  2. In category_encoders, the above 1 can be solved because it is in fit and transform format, but it is difficult to understand because the column name and category name are inconsistent.

It was itchy because I couldn't reach the itch.

I made my own ohehot encoder to solve this problem

code

class BaseEncoder():
    def __init__(self):
        pass
    
    def fit(self):
        raise Exception('not implemented')
    
    def transform(self):
        raise Exception('not implemented')
    
    def fit_transform(self):
        raise Exception('not implemented')


class OneHotEncoder(BaseEncoder):
    # library requirement
    # import pandas as pd
    # import numpy as np
    # mojimoji
    
    def __init__(self, 
                 col_name=None, 
                 categories=None, 
                 handle_unknown="summarize", 
                 handle_nan="onehot", 
                 col_order="name", 
                 col_name_type="category",
                 force_hankaku=True,
                 return_type="df",
                 handle_rare=None,
                 dummy=None,
                ):
        
        import pandas as pd
        import numpy as np
        import mojimoji
        
        #---
        #  args
        #    col_name : target column [str, default : None]      get column name from training data. If training data is np values, col is None
        #
        #    categories : encoded category list [list, default : None]
        #
        #    handle_unknown : handle unknown category method [str, default : "summarize"] 
        #      "summarize" : unknown category (not appeared in training data) is treated as "unknownCategory"
        #      "ignore" : unknown category is ignored  
        #
        #    handle_nan : handle nan method [str, default : "onehot"]  
        #      "onehot" : nan is treated as onehot
        #      "ignore" : nan is ignored
        #
        #    col_order : output order [str, default : "name"]  
        #      "name" : sorted by category name
        #      "count_asc" : sorted by ascending appearance count
        #      "count_des" : sorted by descending appearance count
        #
        #    col_name_type : column name type [str, default : "category"]
        #      "name" : return column name is category name
        #      "index" : return column name is index number (rare : -1, nan : -2, impute : -3)
        #
        #    force_hankaku : whether apply hankaku or not [bool , default : True]
        #
        #    return_type : return values type [str, default : "df"]   "pd" : pd.DataFrame , "np" : np.values
        #
        #    handle_rare : rare category  treat method [float, list, default : None]
        #      float : rare threshold of appearance category , list : this list category is treated as rare
        #
        #   dummy : dummy method [str, bool, None, defult: None]
        #      str : category name , this category is treated as dummy
        #      True : dummy is valid, and dummy category is selected automatically
        #
        
        self.col_name = col_name
        
        if type(categories) is list:
            raise Exception(f"[Error] argument categories is invalid , shuold be list, but>> {categories}")
        self.categories = categories
        
        checks = ["summarize" , "ignore"]
        if handle_unknown not in checks:
            raise Exception(f"[Error] argument handle_unknown is invalid , shuold be {checks}, but {handle_unknown}")
        self.handle_unknown = handle_unknown
        
        checks = ["onehot" , "ignore"]
        if handle_nan not in checks:
            raise Exception(f"[Error] argument handle_nan is invalid , shuold be {checks}, but {handle_nan}")
        self.handle_nan = handle_nan
        
        checks = ["name" , "count_asc", "count_des"]
        if col_order not in checks:
            raise Exception(f"[Error] argument col_order is invalid , shuold be {checks}, but {col_order}")
        self.col_order = col_order
        
        checks = ["category" , "index"]
        if col_name_type not in checks:
            raise Exception(f"[Error] argument col_name_type is invalid , shuold be {checks}, but {col_name_type}")
        self.col_name_type = col_name_type
        
        checks = [bool]
        if type(force_hankaku) not in checks:
            raise Exception(f"[Error] argument force_hankaku should be {checks} type , but {force_hankaku}")
        self.force_hankaku = force_hankaku
        
        checks = ["df" , "np"]
        if return_type not in checks:
            raise Exception(f"[Error] argument return_type is invalid , shuold be {checks}, but {return_type}")
        self.return_type = return_type
        
        checks = [int, float, list]
        if type(handle_rare) not in checks and handle_rare is not None:
            raise Exception(f"[Error] argument handle_rare should be {checks} type or None, but {handle_rare}")
        
        if type(handle_rare) in [int, float]:
            if handle_rare >= 1 or handle_rare <= 0:
                print(f"[Warning] handle_rare may be meaningless value >> {handle_rare}")
        self.handle_rare = handle_rare if handle_rare is not None else -1.
    
        checks = [str, bool]
        if type(dummy) not in checks and dummy is not None:
            raise Exception(f"[Error] argument force hankaku should be {checks} type or None , but {dummy}")
        self.dummy = dummy # True only
        
        self.encode_map = {}
        self.unknown_categories = []
        self.dummy_category = None
        
        
    def fit(self, Xs):
        _Xs = pd.Series(Xs.copy()).astype(str)
        _Xs = _Xs.apply(lambda x : mojimoji.zen_to_han(x))
            
        # get column name
        if self.col_name is None:
            self.col_name = _Xs.name
            if self.col_name is None:
                self.col_name = "onehotEncode"
                print(f"[Warning] column name is {self.col_name}")
                
        new_cols = []
                
        # if categories is inputted
        if self.categories is not None:
            cats = pd.Series(self.categories).astype(str)
            if self.force_hankaku:
                cats = _Xs.apply(lambda x : mojimoji.zen_to_han(x))
                
            for c in [x for x in cats if x not in ["nan", "None"]]:
                onehot_name = f"{self.col_name}_{c}"
                self.encode_map[c] = onehot_name
                new_cols.append(onehot_name)
            
            # handle nan
            if self.handle_nan == "onehot":
                for nan_v in ["nan", "None"]:
                    if nan_v in cats:
                        onehot_name = f"{self.col_name}_nan"
                        self.encode_map[nan_v] = onehot_name
                        new_cols.append(onehot_name)
                    

            # handle unknown
            if self.handle_unknown == "summarize":
                new_cols.append(f"{self.col_name}_unknownCategory")
                
            self.new_cols = new_cols
            
            return
            
        # get category
        vc = _Xs.value_counts(dropna=False, normalize=True)
        
        # sort category
        if self.col_order == "name":
            vc.sort_index(inplace=True)
        elif self.col_order == "count_asc":
            vc.sort_values(inplace=True, ascending=True)
        elif self.col_order == "count_des":
            vc.sort_values(inplace=True, ascending=False)

        # rare category (threshold)
        if type(self.handle_rare) is float:
            for c_ind, c in enumerate([x for x in vc[vc > self.handle_rare].index if x not in ["nan", "None"]]):
                
                # skip dummy
                if (self.dummy == True and c_ind == 0) or (self.dummy == c):
                    self.dummy_category = c
                    self.encode_map[c] = "DUMMY_CATEGORY"
                    continue
                
                if self.col_name_type == "category":
                    onehot_name = f"{self.col_name}_{c}"
                elif self.col_name_type == "index":
                    onehot_name = f"{self.col_name}_{c_ind}"
                self.encode_map[c] = onehot_name
                new_cols += [onehot_name]
                
            for c in [x for x in vc[vc <= self.handle_rare].index if x not in ["nan", "None"]]:
                if self.col_name_type == "category":
                    onehot_name = f"{self.col_name}_rareCategory"
                elif self.col_name_type == "index":
                    onehot_name = f"{self.col_name}_-1"
                    
                self.encode_map[c] = onehot_name
                if onehot_name not in new_cols:
                    new_cols += [onehot_name]
                    
        # rare category (list)
        if type(self.handle_rare) is list:
            for c_ind, c in enumerate([x for x in vc.index if x not in ["nan" , "None"] + self.handle_rare]):
                
                # skip dummy
                if (self.dummy and c_ind == 0) or (self.dummy == c):
                    self.dummy_category = c
                    continue
                
                if self.col_name_type == "category":
                    onehot_name = f"{self.col_name}_{c}"
                elif self.col_name_type == "index":
                    onehot_name = f"{self.col_name}_{c_ind}"
                    
                self.encode_map[c] = onehot_name
                new_cols += [onehot_name]
                
            for c in self.handle_rare:
                if self.col_name_type == "category":
                    onehot_name = f"{self.col_name}_rareCategory"
                elif self.col_name_type == "index":
                    onehot_name = f"{self.col_name}_-1"
                    
                self.encode_map[c] = onehot_name
                if onehot_name not in new_cols:
                    new_cols += [onehot_name]

        # handle nan
        if self.handle_nan == "onehot":
            for nan_v in ["nan", "None"]:
                if nan_v in vc.index:
                    if self.col_name_type == "category":
                        onehot_name = f"{self.col_name}_nan"
                    elif self.col_name_type == "index":
                        onehot_name = f"{self.col_name}_-2"
                    self.encode_map[nan_v] = onehot_name
                    if onehot_name not in new_cols:
                        new_cols += [onehot_name]
                
        # handle unknown
        if self.handle_unknown == "summarize":
            if self.col_name_type == "category":
                new_cols.append(f"{self.col_name}_unknownCategory")
            elif self.col_name_type == "index":
                new_cols.append(f"{self.col_name}_-3")
                
        encode_map_inv = {}
        
        for k, v in self.encode_map.items():
            if v in encode_map_inv.keys():
                encode_map_inv[v] += [k]
            else:
                encode_map_inv[v] = [k]
                
        self.new_cols = new_cols
        self.categories = list(self.encode_map.keys())
        self.encode_map_inv = encode_map_inv
        
        del _Xs
        
        
    def transform(self, Xs):
        _Xs = pd.Series(Xs.copy()).astype(str)
        if self.force_hankaku:
            _Xs = _Xs.apply(lambda x : mojimoji.zen_to_han(x))
            
        # return dataframe
        res_df = pd.DataFrame(index=range(len(_Xs)))

        for k, v in self.encode_map_inv.items():
            if k == "DUMMY_CATEGORY":
                continue
            
            res_df[k] = 0 # fill 0
            res_df.loc[_Xs.isin(v), k] = 1 # one hot
                
        # handle unknown
        if self.handle_unknown == "summarize":
            new_col = f"{self.col_name}_unknownCategory"
            res_df[new_col] = 0 # fill 0
            
            known_cats = self.categories
            if self.handle_nan == "ignore":
                known_cats += ["nan", "None"]
            
            res_df.loc[~_Xs.isin(known_cats), new_col] = 1 # one hot
            for cat in list(set(_Xs.values) - set(known_cats)):
                if cat not in self.unknown_categories:
                    self.unknown_categories += [cat]
        
        del _Xs
        
        # return type redefine
        if self.return_type == "np":
            res_df = res_df.values
            
        return res_df
    
    
    def fit_transform(self, Xs):
        self.fit(Xs)
        return self.transform(Xs)

How to use

Create sample data for training and testing as follows

The test has categories (elephant, bird, etc.) that are not found in learning (the encoder created also has a function to put these in the unknown category).

# generate sample category
import random
random.seed(42)

vals1 = ['salamander'] * 10 + ['snake'] * 8 + ['cameleon'] * 5 + ['rizard'] * 7 + ['frog'] * 2 + ['jellyfish'] * 3 + [np.nan] * 3 + [None] * 2
vals2 = ['cute'] * 4 + ['cool'] * 12 + ['colurful'] * 3 + ['nice'] * 2 + ['Wonderful'] * 3 + ['foooo'] * 3 + ['Excellent'] * 3 + [np.nan] * 6 + [None] * 4

vals3 = ['salamander'] * 13 + ['snake'] * 5 + ['cameleon'] * 7 + ['rizard'] * 5 + ['turtle'] * 3 + ['bird'] * 1 + ['elephant'] * 1 + ["jellyfish"] * 2 + [np.nan] * 1 + [None] * 2
vals4 = ['cute'] * 4 + ['cool'] * 12 + ['colorful'] * 3 + ['nice'] * 2 + ['Wonderful'] * 3 + ['foooo'] * 3 + ['Excellent'] * 1 + ['good'] * 1 + ['OK'] * 1  + [np.nan] * 3 + [None] * 7 

random.shuffle(vals1)
random.shuffle(vals2)

random.shuffle(vals3)
random.shuffle(vals4)

train_df =  pd.DataFrame({'animal' : vals1, 'feature' : vals2})
test_df =  pd.DataFrame({'animal' : vals3, 'feature' : vals4})

Use it simply

Try one hot the animal column

#Create an instance
ohe = OneHotEncoder()

# train data de
Train encode
ohe.fit(train_df['animal'])
#Actually encode. Put the training data in transform,

ohe.transform(train_df['animal'])

image.png

Concat the original data and see the result

pd.concat([train_df, ohe.transform(train_df['animal'])], axis=1)

image.png

Some nans are properly onehot, and unknown categories are also available.

Let's look at the test data

pd.concat([test_df, ohe.transform(test_df['animal'])], axis=1)

image.png

Since the same column as train is prepared, it can be used as it is with light gbm or elastic net

I want to check which column the category was encoded in

This function was scarce as it was, so I implemented it. dict returns with ohe.encode_map

ohe.encode_map

You can also see the reverse version

ohe.encode_map_inv

Get the added column name

ohe.new_cols

If you ignore nan

Specify handle_nan = "ignore"

ohe = OneHotEncoder(handle_nan="ignore")
ohe.fit(train_df['animal'])

nan column is gone

image.pngimage.png

Handling of rare categories that are infrequent

You can specify a category to be a rare category such as handle_rare = 0.1 (the number 0.1 is%)

See how often animals appear

try to encode

ohe = OneHotEncoder(handle_rare=0.1)
ohe.fit(train_df['animal'])

rareCategory has been added

image.png

If you look at encode_map, you can see what became rare

In addition, if you put a list of categories in handle_rare, the entered categories will be encoded in rareCategory.

ohe = OneHotEncoder(handle_rare=["cameleon", "frog"])
ohe.fit(train_df['animal'])

image.png

I want to ignore the unknown category

handle_unknown = "ignore", unknown is not encoded

ohe = OneHotEncoder(handle_unknown="ignore")
ohe.fit(train_df['animal'])

image.png

When you don't want the column name to be a category

If col_name_type = "index" is set, it becomes an index (same as category_encoders)

ohe = OneHotEncoder(col_name_type="index")
ohe.fit(train_df['animal'])

image.png

Specify the column name

By default, the column name of dataframe is prefix, but you can change it with col_name = "XXXX"

(If you enter a numpy value instead of a dataframe, onehotEncode will be a prefix)

ohe = OneHotEncoder(col_name="new_col")
ohe.fit(train_df['animal'])

image.png

I want to dummy encode

If you want dummy encoding (encoding that reduces the number of features by not making one category a column) set dummy = True

ohe = OneHotEncoder(dummy=True)
ohe.fit(train_df['animal'])

image.png

Dummy categories can be retrieved with ohe.dummy_category

If dummy = "xxx", the category will be dummy

Conclusion

I would like to create a library of category encodings that can be reached where it is itchy

I'm glad if you can get an impression by using the above

There are other detailed functions besides the above, but I'm tired of writing, so when I get more likes, I plan to make a library and put together usage such as git.

Recommended Posts

I tried to make OneHotEncoder, which is often used for data analysis, so that it can reach the itch.
I tried to expand the database so that it can be used with PES analysis software
I tried to predict the J-League match (data analysis)
I tried to scrape YouTube, but I can use the API, so don't do it.
I tried to summarize the code often used in Pandas
I tried to summarize the commands often used in business
I want to see something beautiful, so I tried to visualize the function used for benchmarking the optimization function.
[Flask] I tried to summarize the "docker-compose configuration" that can be created quickly for web applications
I tried to make a site that makes it easy to see the update information of Azure
I tried to summarize the methods that are often used when implementing basic algo in Quantx Factory
I tried logistic regression analysis for the first time using Titanic data
I tried to make a calculator with Tkinter so I will write it
The story that it turns blue when the data read by Pillow is converted so that it can be handled by OpenCV
I thought I could make a nice gitignore editor, so I tried to make something like MVP for the time being
I tried to summarize the operations that are likely to be used with numpy-stl
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
I tried to verify the Big Bang theorem [Is it about to come back?]
I tried to process and transform the image and expand the data for machine learning
I tried to rescue the data of the laptop by booting it on Ubuntu
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
I tried to save the data with discord
I tried to make AI for Smash Bros.
I tried to visualize the lyrics of GReeeen, which I used to listen to crazy in my youth but no longer listen to it.
[LPIC 101] I tried to summarize the command options that are easy to make a mistake
I tried to make the phone ring when it was posted at the IoT post
[Updated from time to time] Python memos often used for data analysis [N division, etc.]
I tried to make a memo app that can be pomodoro, but a reflection record
A Python beginner made a chat bot, so I tried to summarize how to make it.
It's Cat Day, so I tried to make something that translates into cat-like words.
I realized that it is nonsense to use the module without thinking because it is convenient.
The tangent equation problem (high school level) is troublesome so I can solve it
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
It's getting cold, so I tried to make it possible to turn on / off the AC heater automatically with Raspberry Pi!
AI Gaming I tried it for the first time
I want to say that there is data preprocessing ~
I tried "Streamlit" which turns the Python code into a web application as it is
What is the ROC curve? Why shouldn't it be used for imbalanced data? Easy-to-understand explanation
How to set variables that can be used throughout the Django app-useful for templates, etc.-
[Even in the video] Data science at the famous university level that you can learn for free [Yes, if it is coursera. ]
I tried porting the code written for TensorFlow to Theano
I tried to make various "dummy data" with Python faker
Let's make the analysis of the Titanic sinking data like that
Which should I study, R or Python, for data analysis?
Miscellaneous notes that I tried using python for the matter
I passed the Python data analysis test, so I summarized the points
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I made it because I want JSON data that can be used freely in demos and prototypes
[First scraping] I tried to make a VIP character of Smash Bros. [Beautiful Soup] [Data analysis]
I tried to make it possible to automatically send an email just by double-clicking the [Python] icon
Hypothesis / Verification (176) How to make a textbook that is easier than "The easiest textbook for quantum computers"
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
[Python] I tried to make a simple program that works on the command line using argparse.
[Shell script] It's annoying to send the same content every week, so I tried to automate it! !! !!
I want to specify a file that is not a character string for logrotate, but is it impossible?