Python Pandas Data Preprocessing Personal Notes

About this article

A personal note that I update from time to time as I learn about "preprocessing" data using pandas.

A note about this article

――When you think of preprocessing, you might think of AI / machine learning. The purpose of pre-processing learning is because I want to be able to process and analyze accumulated data as I wish. It may be insufficient as preprocessing information for machine learning purposes. ――I am not a person in the AI field, so if you are looking for knowledge in that area, I think it would be better to keep the content of this article for reference only. ――Because it is about to be written, there are many ambiguous expressions, and there may be information that is difficult to understand. We plan to update it from time to time, but we would appreciate it if you could point out any incorrect information.

What is pretreatment?

To clean or process the accumulated data in order to perform the desired work (machine learning, etc.). It seems that machine learning people also spend the most time on this work together with the data collection work.

Words that appear in learning preprocessing

--Data cleansing --Synonymous with data cleaning. Perhaps refers to some work in "pretreatment". --Data cleaning --Synonymous with data cleansing. --Data Langling -Broader than the above two. Maybe "synonymous with pretreatment" or "includes pretreatment".

Pretreatment flow

Do the following: There may be a more efficient order, and there are more detailed tasks.

  1. Pre-analysis --Check the amount of data information, status, etc.

  2. Cleaning --Changing column names --Checking notation fluctuations --Check for duplicate lines --Check for missing values --Delete unnecessary columns

  3. Processing / conversion / configuration change --Add column --Merge other data --Addition of features

Preparing the test environment

testing environment

Prepare a Python test environment for learning.

--Internet environment -Google Colabratory is free and amazing. ――If you write a process to dump the result to Google Drive even if it is a heavy process, you can check the dump file when the process is completed even if you close the browser or PC once. --Refer to the past article for how to link Google Drive X Colab (Mount and operate Google Drive with Google colabo)

--Local environment

  1. Put Python on your own PC.
  2. If you put Python and Jupyter Notebook on an in-house server and set to allow remote access from a browser, you can use python on a PC that does not have Python, and you can distribute a little script to team members who do not understand Python with a simple explanation. ..
  3. For the remote access permission setting method, refer to here.
  4. If you want to install Python and Jupyter Notebook from scratch, it's easy to install them as a set with the Anaconda package.

test data

It would be nice to be able to quickly prepare test data in both series and Dataframe.

Syntax / method used for pre-analysis

#References for maximum, minimum, mean, standard deviation, etc.
df.describe()
#Data type reference for each column
df.info()
#Browse the first few lines of data
df.head()
#Browse the last few lines of data
df.tail()
#Number of dimensions of data (how many rows and columns)
df.shape
#Number of values
df["column"].value_counts()
#Number of values (same as above)
pd.value_counts(df["column"])
#List of column names ①
df.columns
#List of column names ②
df.count()

Syntax / method used for missing value handling operation

Confirmation of missing values

#Missing value count ("Number of lines-The number of data"When requesting at)
#You can list the missing status of the entire column.
df_s.shape[0] - df_s.count()

#Missing value count(Specific column)
df_s["Defective content.1"].isnull().value_counts()

#Show rows with missing values(In a specific column)
df_s.loc[df_s["Defective content.1"].isnull()]

Delete missing values


#Delete a row if there is even one missing value in df
df.dropna(axis = 0, how = 'any', inplace=True)

#Set a threshold and drop (delete rows with 3 or more missing columns)
#If it is not replaced, it will not be reflected in the source data.
df.dropna(thresh=3,inplace=True)

Fill in the missing values


fillna
df.loc[df["unit price"].isna(),"unit price"] = 0

Calculate by ignoring missing values

#If there are any missing values in the calculation, the result will be Nan. Use the option to ignore missing values.
df.sum(skipna=False)
df.mean(skipna=False)

Syntax / method used for duplication

Check for duplicates


df_s.duplicated().value_counts()
df_s[df_s.duplicated() == True].count()
#Column unit
df_s[df_s.duplicated() == True ]["parts"].count()

Duplicate removal

Syntax / methods used for data type checking / conversion

Check that the data type is correct and convert it to the appropriate type if necessary. When converting to a numeric type, it is necessary to remember the work such as setting the character string to a numerical value and setting the Null data to 0 in advance. ʻAstype` method is mainly used.

#Data type reference
df2.info()
#Data type conversion (realized by adding a new column)
df["Part number 2"] = df["Product number"].astype(object)

#Data type conversion(Substitute as is in the original column and convert)
df_s["test"] = df_s["test"].astype(object) # object(string)
df_s["test"] = df_s["test"].astype(str) # object(string)
df_s["test"] = df_s["test"].astype(float) # Float
df_s["test"] = df_s["test"].astype(int) # integer

#Convert to numeric type while forcibly eliminating (To Nan) mistyped values
df_s["test"] = pd.to_numeric(df_s["test"] , errors="coerce")

Syntax / method used for data extraction / retrieval


#Extracting a specific column (does not change the original data)
df[["colB","colD"]]

#Extract using a regular expression
tanka_nan = df["unit price"].str.extract("(^\D*)", expand=False)
tanka_nan.value_counts()

#「[The value in the unit price column is"Bulk product"]Extracted by "row extraction" and "column extraction of unit price column"
df.loc[df["unit price"] == "Bulk product","unit price"]

#Filter rows from list and create new DF
df2 = df[df["Product number"].isin(mylist)]

Syntax / method used for adding columns and features

Add columns using functions

dftest["grade"] = dftest["Product name"].apply(lambda x : x[0:2] )

Add columns using map function


dftest = pd.DataFrame({"prefecture":["hokkaidou","shizuoka","okinawa"],"city":["sapporo","shizuoka","naha"]})
dftest_map = {"hokkaidou":10,"shizuoka":20,"okinawa":30}
dftest["maptest"] = dftest.prefecture.map(dftest_map)
dftest

Add column using apply function


# Create the following function.
# # If the argument value contains "limited", the string "limited product" is returned and
# # If the argument value contains "new", the string "new product" is returned and
# # If the value of the argument is other than that, the string "na" is returned.
def newitem(x):
  if "Limited" in str(x):
    return "Limited product"
  elif "new" in str(x):
    return "new product"
  else:
    return "na"

# Create a new column "Product Category" and return the result while evaluating the product name column with the created function.
df["Product category"] = df["Product name"].apply(newitem)

Column name / row name (index)


#Rename column(Change in dictionary format)
df = df.rename(columns={'Product name':'Product name'})
#Rename column (change the very first column)
df.rename(columns={df.columns[0]: "Product name"})

Assign to column


df.loc[:,"unit price"] = 0

Extract and assign a specific column

df.loc[df["unit price"] == "Supply", "unit price"] = 0

Add column with regular expression

df["grade"] = df["Product name"].str.extract(
"(^[A-Z]{2}|^[A-Z]{2}|^/d/d?//d+|^[0-9][0-9]?/[0-9]*)", expand=False)

Syntax / methods used for grouping

Group uses groupby object

--Use df.groupby (). --If you group and put it in a variable, the variable will contain a Groupby object. (Not DataFraem, Series)


df.groupby('items').get_group('item_1')
g1 = df.groupby('items')
print(g1.groups)
df.groupby('items').size()
#Collectively aggregate
df_s.groupby("Bagging location").agg(np.sum)
#Aggregation using a dictionary
df_s.groupby("Bagging location").agg({"Total production number":np.sum, "lot":"first" })

Added by Lambda function

train["year"]= train["datetime"].apply(lambda x : x.split("-")[0] )

Syntax / method used for data format conversion between vertical holding ⇔ horizontal holding

Mainly done below. Writing.

--melt (meaning melt)

Convert vertical holding to horizontal holding

in preparation

Convert horizontal holding to vertical holding

in preparation

Syntax / method used for binning

What is binning

Dividing non-continuous values into smaller numbers using bins to group them. Also called bin division. Bin seems to mean a shelf. For example, if there is a column that contains the age of a person, use it to group by section by age group such as [teens, 20s, 30s ...]. Use cut and qcut.

An image of sorting the original data (age data this time) into the created "shelf". Not surprisingly, the number of bins is less than the source data.


#First, suppose that there is such age data.
ages = [random.randint(0, 100) for i in range(20)]
print(ages)
#[84, 73, 27, 85, 8, 17, 46, 16, 95, 62, 38, 47, 63, 44, 69, 26, 0, 40, 34, 79]

#Prepare Bins to group this age data
bins = np.arange(0,90,10) #Generate serial number
print(bin)
# >>> [ 0 10 20 30 40 50 60 70 80]

#Cut
cats = pd.cut(ages, bins)
print(cats)

# [NaN, (70.0, 80.0], (20.0, 30.0], NaN, (0.0, 10.0], ..., (20.0, 30.0], NaN, (30.0, 40.0], (30.0, 40.0], (70.0, 80.0]]
# Length: 20
# Categories (8, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80]]

What is a feature?

Features = columns given directly in creating a prediction model for machine learning.

When creating a predictive model, not all columns of the collected data are required for the predictive model. It is also possible that there are not enough columns. In machine learning, it is important to select this column or add a new column to increase the feature quantity, and increasing the feature quantity sequence is called feature quantity creation. Creating and tuning features is called Feature Engeneering.

reference

--Book (link to Amazon) -[O'Reilly-Introduction to Data Analysis with Python 2nd Edition-Data Processing Using NumPy and pandas-Wes McKinney (Author)](https://www.amazon.co.jp/Python%E3%81%AB%E3 % 82% 88% E3% 82% 8B% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96 % 80-% E7% AC% AC2% E7% 89% 88-% E2% 80% 95NumPy% E3% 80% 81pandas% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81 % 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-Wes-McKinney / dp / 487311845X / ref = sr_1_1? Adgrpid = 58778961568 & dchild = 1 & gclid = Cj0KCQjwz4z3BRCgARIsAES_OVca29l_HJZEkHR-UKU58RZaO0JEp_5F8-CuX5NBsVuVFIxZOEhTzw4aAnNmEALw_wcB & hvadid = 338517691080 & hvdev = c & hvlocphy = 1009444 & hvnetw = g & hvqmt = e & hvrand = 6973103540103090801 & hvtargid = kwd-332404598096 & hydadcr = 27269_11561183 & jp-ad-ap = 0 & keywords = python% E3% 81% AB% E3% 82% 88% E3% 82% 8B % E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96% 80 & qid = 1592014932 & sr = 8-1 & tag = googhydr -twenty two) -[Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization (impress top gear) --Daniel Y. Chen (Author)](https://www.amazon.co.jp/Python%E3%83 % 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 82% B3% E3% 83% BC% E3% 83% 87% E3% 82% A3% E3% 83% B3% E3% 82% B0-pandas% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E6% B4% BB% E7% 94% A8% E5% 85% A5% E9% 96% 80-impress-gear / dp / 4295005657 / ref = pd_aw_sbs_14_5 / 356-4805604-8432224? _ Encoding = UTF8 & pd_rd_i = 4295005657 & pd_rd_r = 35939a14-656d-4771-bcbc-f72c522feffb & pd_rd_w = QLYKo & pd_rd_wg = tZRuL & pf_rd_p = bff3a3a6-0f6e-4187-bd60-25e75d4c1E-4187-bd60-25e75d4c1

--Video (udemy) -[180,000 people in the world] Practical Python Data Science

end

Recommended Posts

Python Pandas Data Preprocessing Personal Notes
python personal notes
python pandas notes
Pandas Personal Notes Summary
missingintegers python personal notes
[Personal notes] Python, Django
[Python] Notes on data analysis
Data analysis using python pandas
Hit treasure data from Python Pandas
Python: Time Series Analysis: Preprocessing Time Series Data
Notes on handling large amounts of data with python + pandas
Pandas notes
Preprocessing template for data analysis (Python)
(Personal notes) Python metaclasses and metaprogramming
WEB scraping with Python (for personal notes)
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
pandas self-study notes
Python scraping notes
My pandas (python)
Python learning notes
Python beginner notes
Python study notes_006
Read pandas data
python C ++ notes
[Python] Pre-processing tricks
Python study notes _005
Python grammar notes
Python Library notes
Python personal Q.A
Python study notes_001
python learning notes
Python3.4 installation notes
[python] Read data
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Personal notes to doc Python code in Sphinx
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I started machine learning with Python Data preprocessing
Data analysis with python 2
Python Data Visualization Libraries
Data analysis using Python 0
Data analysis overview python
Python package development notes
Manipulate excel files from python with xlrd (personal notes)
Python memorandum (personal bookmark)
Data cleaning using Python
Data analysis environment construction with Python (IPython notebook + Pandas)
python decorator usage notes
Python ipaddress package notes
Installing pandas on python2.6
Summary of Pandas methods used when extracting data [Python]
Python Pickle format notes
[Python] pytest-mock Usage notes
Correlation by data preprocessing
Data visualization with pandas
First Python miscellaneous notes
Data manipulation with Pandas!
Python data analysis template
Process csv data with python (count processing using pandas)
Shuffle data with pandas
Matlab => Python migration notes