[PYTHON] I checked Pandas's Categorical relationship-it's convenient if you get used to it (I think)

"Pandas" supports data analysis, but the other day (2016/10/2) ver. 0.19.0 (stable) has been released. Among some new features, there is an option for read_csv () to scan (parse) Categorical data. Speaking of Categorical data, the factor type of R language comes to mind, but I have never used Categorical-related functions of pandas myself. I was curious, so I took the opportunity of ver. 0.19.0 and did some research.

(The operating environment is pandas 0.19.0 (some 0.18.1 is used for comparison), numpy 1.11.1, python is 3.5.2.)

Comparison of Pandas'Categorical'support status with R language

In R language, a variable type called factor type (factor type) is supported. This is a type that is not found in general-purpose programming languages, so it is difficult to get used to it, but it is prepared for handling categorical data. However, it seems that even R programmers like it and dislike it, and when inputting from the csv file to data.frame, it seems that there are quite a few people who specify stingAsFactor = FALSE to suppress the conversion of data to factor type. .. (Note. In read.csv (), {data.table} fread (), the default is stringAsFactor = TURE (unless the user's default is changed).

Of course, there is no factor type in the language specification in Python, but it seems that Categorical type (dtype) has been supported in pandas since ver 0.15.0 (maybe it is a function request of a programmer who is accustomed to R factor type). (I did not know...)

This time, in ver 0.19.0, the function has been expanded so that Category can be parsed with read_csv () and operations can be performed in consideration of Category such as data concatenation.

** Quoted from Documentation ** pandas_cat_1.PNG

(Pandas' Category type is an implementation that is conscious of the factor type of R language, but there seems to be a difference between the two in the details part. I do not have resolution for this, so if you are interested, pandas Please refer to the document (ver 0.19.0).)

Operation check using data set "Mushroom"

Now, let's check the operation below. "Mushroom" was prepared and used as a data set from the UCI machine learning repository. The content of "Mushroom" is to classify whether mushrooms are poisonous mushrooms or not from the features such as the shape of mushrooms, but the contents (header part) are as follows.

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m

The data are all letters of the alphabet and are perfect for this article. The first column is a label indicating ** "e" ** .. "edible" "edible", ** "p" ** .. "poisonous" "poisonous".

If you input this to a file normally, it will be as follows.

fn = '../Data/Mushroom/agaricus-lepiota.data'
# names for all columns
cols = ['label', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
    'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
    'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
    'stalk-color-above-ring', 'atalk-color-below-ring', 'veil-type', 
    'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 
    'population', 'habitat']
# names for subset
col_subset = ['label', 'cap-shape', 'cap-surface', 'cap-color', 'bruises']

mr1 = pd.read_csv(fn, header=None, names=cols, usecols=col_subset)

In [1]: mr1.head()
Out[1]:
  label cap-shape cap-surface cap-color bruises
0     p         x           s         n       t
1     e         x           s         y       t
2     e         b           s         w       t
3     p         x           y         w       t
4     e         x           s         g       f

At this time, the data (p, x, s, n, t ...) is treated as a character string type. On the other hand, in pandas 0.19.0, you can enter as follows. (Added dtype ='category')

mr2 = pd.read_csv(fn, header=None, names=cols, usecols=col_subset, dtype='category')

Since the type is not known by looking at the header part of the data, check the dtype.

In [4]: mr1.dtypes
Out[4]:
label          object
cap-shape      object
cap-surface    object
cap-color      object
bruises        object
dtype: object

In [5]: mr2.dtypes
Out[5]:
label          category
cap-shape      category
cap-surface    category
cap-color      category
bruises        category
dtype: object

The individual data of mr1 is str type, but the data type of pd.Series () is ʻobjectof abstract object type. On the other hand, it can be confirmed that mr2 entered with thedtype ='category' option is properly converted to category`.

By the way, even in the previous version (pandas 0.18.1), if you perform type conversion with ʻas type` after inputting the file, you can get the same data as mr2.

mr11 = mr1.apply(lambda x: x.astype('category'))

In [9]: mr11.dtypes
Out[9]:
label          category
cap-shape      category
cap-surface    category
cap-color      category
bruises        category
dtype: object

For Category type data, some functions (methods) are supported through the cat (abbreviation for category) accessor. For example, the types of categories can be obtained as follows.

In : mr2['cap-shape'].cat.categories
Out: Index(['b', 'c', 'f', 'k', 's', 'x'], dtype='object')

In : mr2['cap-color'].cat.categories
Out: Index(['b', 'c', 'e', 'g', 'n', 'p', 'r', 'u', 'w', 'y'], dtype='object')

In the above operation, the order and order of the obtained category types (lists) are not particularly determined. A similar operation is unique () for the pd.Series object.

In : mr2['cap-shape'].unique()
Out:
[x, b, s, f, k, c]
Categories (6, object): [x, b, s, f, k, c]

The result set obtained here is the same as xx.cat.categories above, but the order of this result is the order of appearance in the process of scanning the data set (pd.Series object). thing. (I don't think there are many cases where the order of appearance has a special meaning.)

I would like to confirm the order of the categories with other data. (See below)

By the way, when machine learning is performed in the post-data analysis process, the dataset needs to be converted to a numerical type (int type, float type). Category type of pandas can be converted to int type by codes function (method) as follows.

In : mr2_numeric = mr2['cap-shape'].cat.codes

In : mr2_numeric[:10]
Out:
0    5
1    5
2    0
3    5
4    5
5    5
6    0
7    0
8    5
9    0
dtype: int8

The return value for outliers not included in the category is -1 as shown below.

In : mr2.loc[3, 'cap-shape'] = np.nan

In : mr2.loc[6, 'cap-shape'] = np.nan

In : mr2['cap-shape'].cat.codes[:10]
Out:                                
0    5                                  
1    5                                  
2    0                                  
3   -1                                  
4    5                                  
5    5                                  
6   -1                                  
7    0                                  
8    5                                  
9    0                                  
dtype: int8                             

Survey on "Ordered" Category

The pandas Category type has an "ordered" option. Here, we confirm with an example of element symbols. First, prepare the data (pd.Series).

#Functions for creating data samples
def mk_rand_elements():
    elem_dict = {1: 'H', 2: 'He', 3: 'Li', 4: 'Be', 5: 'B', 6: 'C', 7: 'N'}
    sz = 10
    r = np.random.randint(1, 7, size=sz)
    rand_el = [elem_dict[i] for i in r]

    return rand_el

elem_series = pd.Series(mk_rand_elements())

Next, prepare the correct order you want to set with variables. (Although it is an element symbol, I can remember it up to nitrogen ...)

elem_ord = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N']

As described above, create a categorical variable from the data series and the data in the correct order.

# convert to categorical and encoding 
elem_cat = elem_series.astype('category', categories=elem_ord, ordered=True)

# check
In : elem_cat
Out:
0     B
1     H
2    Li
3    Be
4    He
5    Li
6     B
7    Li
8    He
9     C
dtype: category
Categories (7, object): [H < He < Li < Be < B < C < N]

What we want to pay attention to is the bottom line Categories (7, object): [H <He <Li <Be <B <C <N]. The part indicated by the inequality sign indicates that the variable'elem_cat'is the dtype of the ordered category.

When this is encoded into a numeric type, the numerical values are properly assigned in the order of the categories.

# Encoding to numeric data
encoded = elem_cat.cat.codes

#Array is index in python=Since it starts from 0, the whole is offset
encoded = encoded + 1

#Display before and after Encode together
result = pd.DataFrame(columns=['elem', 'num'])
result['elem'] = elem_series
result['num'] = encoded

In : result
Out:
  elem  num
0    B    5
1    H    1
2   Li    3
3   Be    4
4   He    2
5   Li    3
6    B    5
7   Li    3
8   He    2
9    C    6

As shown above, it can be seen that the data string of the element symbol is properly encoded into the atomic number. By setting the order properly from the outside and setting the ʻordered option to True` in this way, the order of the categories can be maintained. I don't want to be aware of the order of the data like the shape features of mushrooms in the dataset "Mushroom", but for example, when the student's grades have ['A','B','C','D'] In addition, the ratings ['AAA','AA','A','BBB','BB','B'], which are often used by investment companies, have information in the order itself. In such cases, you may want to use the'ordered category'.

Summary

In the whole process of machine learning, a data set consisting of character strings etc. is read from a file, and after a predetermined process, it is input to a model (classification model, regression model). Since the model can handle numerical data, it is not necessary to process it as a "category type" once the character string is directly converted to a numerical value.

For conversion from character strings to numbers, you can prepare your own function and apply it, or you can use scikit-learn (preprocessing) functions. However, since the functions related to pandas "Categorical" investigated this time can also be processed inside pandas, it is expected to be used in various places such as wanting to use it with jupyter notebook, plotting figures and looking at it. In other words, it is considered to be a convenient function that you know and "do not lose".

(Since it is a function that is entered according to the version up in Feature request, it seems that there is a certain demand.)

(Addition) date.11 / 21/2016

There seems to be a function that supports Pandas' Categorical type conversion, pd.factorize ().

>>> myseq = ['a', 'b', 'c', 'a', 'b']
>>> encoded = pd.factorize(myseq)
>>> encoded
(array([0, 1, 2, 0, 1]), array(['a', 'b', 'c'], dtype=object))

The return value is a tuple consisting of the converted numeric type data (indexer) and the original data unique. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html (Pandas document)

References / web site

Recommended Posts

I checked Pandas's Categorical relationship-it's convenient if you get used to it (I think)
If you just want to get the dump file of the server, it was convenient to build an http server
What to do if you get "coverage unknown" in Coveralls
I want Sphinx to be convenient and used by everyone