[PYTHON] Processing datasets with pandas (2)

Yesterday explained about dataset processing with pandas, but it is a continuation.

Normalize the data

Actually, articles so far also casually appeared normalization, but I think I didn't explain it properly.

** Normalize ** in statistics is to transform data of different criteria according to certain criteria to make it easier to use.

For example, let's say you have 90 points in Japanese and 70 points in math. If you simply compare the numbers, you will get better grades in the national language, but what if the average score in the national language is 85 points and the average score in mathematics is 55 points? The advantage of normalization is that you can compare data with different criteria in this way.

Generally, it means converting the values so that the mean is 0 and the variance (and standard deviation) is 1.

This can be calculated with the following formula.

Normalized(A(n)) = \frac {(A(n) - μ(A))} {\sigma(A)}

That is, subtract the mean and divide by the standard deviation. This results in a mean of 0 and a standard deviation of 1.

Visualize normalization

It's best to move your hands and see everything with your eyes. Let's do the same with pandas.

First, divide the data frame by the total value in the column direction and normalize it so that the total sum is 1.

1.png

data.div(data.sum(1), axis=0)

2.png

Normalize in the interquartile range

(data - data.quantile(0.5).values) / (data.quantile(0.75)-data.quantile(0.25)).values

7.png

Logarithmic conversion

Logarithmic conversion is to create a variable that follows a normal distribution by taking the logarithm of the variable that follows a lognormal distribution. That is.

Logarithmic conversion makes it easy to organize and express decimal numbers and huge numbers.

It may be easier to understand if it is expressed in code.

data.apply(np.log)

8.png

Find the migration rate

The movement rate (increase rate) is a numerical value that indicates how much the movement has changed with respect to a certain standard value.

pct_change () converts the data frame values to migration rates. The point to keep in mind is that the first number has no front, so the migration rate is NaN. The migration rate also casually appeared in Previous article.

data.T.pct_change().dropna(axis=0)

As I introduced yesterday, you can make a table by deleting missing values. However, it is a little confusing because the first value of the graph becomes large.

9.png

Save IPython work history

It is not directly related to the processing of the dataset, but it would be nice to be able to output the results of the IPython trials to a file and save them. If you have made the correct trial, you can use it as a script as it is, and it will be more reusable, such as extracting the code from the work history.

import readline
readline.write_history_file("history.py")

This saves the history of the code you type into IPython as history.py. It's very convenient.

Summary

This time as well, we have summarized various processes that are often used when processing datasets.

Recommended Posts

Processing datasets with pandas (1)
Processing datasets with pandas (2)
Merge datasets with pandas
Data processing tips with Pandas
Quickly try to visualize datasets with pandas
Example of efficient data processing with PANDAS
Image processing with MyHDL
Quickly visualize with Pandas
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Draw a graph by processing with Pandas groupby
Learn Pandas with Cheminformatics
Data visualization with pandas
Data manipulation with Pandas!
Image processing with Python
Parallel processing with multiprocessing
Shuffle data with pandas
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
Image Processing with PIL
Process csv data with python (count processing using pandas)
Image processing with Python (Part 2)
Read csv with python pandas
Load nested json with pandas
Parallel processing with local functions
Image processing with PIL (Pillow)
"Apple processing" with OpenCV3 + Python3
Acoustic signal processing with Python (2)
Acoustic signal processing with Python
[Python] Change dtype with pandas
Parallel processing with Parallel of scikit-learn
Image processing with Python (Part 1)
Image processing with Python (Part 3)
Standardize by group with pandas
Prevent omissions with pandas print
[Python] Image processing with scikit-image
Study natural language processing with Kikagaku
Real-time image processing basics with opencv
Pandas basics for beginners ① Reading & processing
[Python] Easy parallel processing with Joblib
Extract the maximum value with pandas.
Pandas basics for beginners ⑧ Digit processing
100 Language Processing Knock with Python (Chapter 1)
[Natural language processing] Preprocessing with Japanese
Pandas
Try audio signal processing with librosa-Beginner
100 Language Processing Knock with Python (Chapter 3)
Versatile data plotting with pandas + matplotlib
Image processing with Python 100 knocks # 3 Binarization
[Python] Join two tables with pandas
Path processing with takewhile and dropwhile
Dynamically create new dataframes with pandas
Extract specific multiple columns with pandas
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Convenient analysis with Pandas + Jupyter notebook
100 Language Processing Knock-31 (using pandas): Verb
Draw a graph with pandas + XlsxWriter
Manipulating strings with pandas group by
Bulk Insert Pandas DataFrame with psycopg2
I want to do ○○ with Pandas
Create an age group with pandas
Image processing with Python 100 knocks # 2 Grayscale