python jupyter notebook Data preprocessing championship (target site: BicCamera)

things to do

I'll do something like this. This time, we will do "3, data preprocessing". [Data analysis basics] 1, data collection (scraping) 2, data storage 3, data preprocessing 4, Data visualization and consideration 5, Conclusions and measures for data

By the way, the last time was "1, Data collection (scraping)". Previous article Link for people who want to watch the video

Preprocessing is performed using the data collected by the above scraping. If you haven't read it, it's difficult to understand the flow, so I hope you can read the previous article roughly.

Creation background

Even if you search for "python data preprocessing", all of them are the same Titanic and scikit-learn data. It was boring, so I wanted to do data pre-processing that no one was doing with data that no one was doing.

environment

Required skills & environment

1, observe the data

■ First, read the csv file as a pandas data frame.

Well, I got the data myself, so the content is easy.

python


import pandas as pd
df = pd.read_csv("biccamera_all_laptop.csv")
df.head()

スクリーンショット 2019-11-09 20.09.01.png

■ Get the number of rows, check the column name, check the number of nulls

Normally, I don't know until I look at the data, but Wai processes in advance that the number of nulls is zero.

python


#Get the number of lines
print(len(df))
#Check the column name
print(df.columns)
#Check the number of nulls
print(df.isnull().sum())

スクリーンショット 2019-11-09 20.10.02.png

■ Check dataframe information, count unique numbers for each column

I also know info ().
However, looking at the unique numbers, the titles do not overlap a little.
There are 30 makers, and the number of prices and points is different.
I saw the point in advance and it was "point (10%)", so
I thought it was 10% of the price, but I wonder if it's different.
I thought.

python


#Check dataframe information
print(df.info())
#Count the number of uniques per column
print(df.nunique())

スクリーンショット 2019-11-09 20.15.28.png

■ Try various value_counts ()

Well, is it like this?
I don't say much (laughs)

python


#Delivery date
print(df.terms.value_counts())
#Inventory information
print(df.stock.value_counts())
#Manufacture name
print(df.maker.value_counts())

2, extract necessary data

■ The title is suspicious, so take a look

python


for t in df.title:
    print(t)
    print(len(t))
    print("*" * 100)

スクリーンショット 2019-11-09 20.25.17.png

■ Since it is a mixture of full-width and half-width characters, I made a function to unify it.

A function that converts half-width katakana to full-width and full-width alphanumeric characters to half-width.
I often use it personally.

python


import re
import jctconv

def han2zen2han(string):
    """
Make half-width katakana full-width,
Make full-width alphanumeric characters half-width
    :param string: string text
    :return: string text
    """
    string = jctconv.h2z(string, kana=True, digit=False, ascii=False)
    string = jctconv.z2h(string, kana=False, digit=True, ascii=True)
    return string

■ Get a list like [XXXX / XXXX /] from the title.

Get all [] with the regular expression r "\ [. +? ]".
Some patterns have multiple [] in the title.
So, take the screen size of the notebook PC with a regular expression.
Some [] have a size, and some do not.

python


#Try to get with Series
df.title.apply(get_spec_list)
#Take out one and check inside
df.title.apply(get_spec_list)[0]

スクリーンショット 2019-11-09 20.30.53.png

The function is below.

python


def get_spec_list(title):
    """
    spec_list =From the product title[]Extract with the contents
    inch_list =Extract PC screen inch text from product title
    l = spec_Put the PC specs extracted from the list back into the list
    :param title: string text
    :return: list
    """
    l = []
    t = han2zen2han(title)
    spec_list = re.findall(r"\[.+?\]", t)
    inch_list = re.findall(r"(\d\d\.\d|\d\d|\d\..|\d)(inch|Mold)", t)
    inch = "".join(inch_list[0]) if inch_list else ""
    for spec in spec_list:
        specs = spec.replace("[", "").replace("]", "").replace(" ", "").replace("・", "/").replace(":", "").split("/")
        for s in specs:
            l.append(s)
    if inch:
        l.append(inch)
    return list(set(l))

■ It's hard to write and paste screenshots, so I'll stick to this area.

For the time being, please try the following.

python


#Extract the list that is the basis of PC specifications
df["spec_list"] = df.title.apply(get_spec_list)

#Get CPU data
df["intel_cpu"] = df.spec_list.apply(get_intelcpu)
df["amd_cpu"] = df.spec_list.apply(lambda x: "".join([i for i in x if re.search(r"amd", i.lower())]))

#Memory data acquisition(int)
df["memory"] = df.spec_list.apply(get_memory)

#HDD data acquisition(int)
df["hdd"] = df.spec_list.apply(get_hdd)

#SSD data acquisition(int)
df["ssd"] = df.spec_list.apply(get_ssd)

#eMMC data acquisition(int)
df["emmc"] = df.spec_list.apply(get_emmc)

#Inch, type data acquisition(float)
df["inch"] = df.spec_list.apply(get_inch)

#Inch, type data acquisition(int)
df["int_inch"] = df.inch.astype("int")

#Acquired manufacturer name(str)
df["new_maker"] = df.maker.apply(get_maker)

#Get PC price(int)
df["new_price"] = df.price.str.replace(r"\D", "").astype("int")

#Get points when purchasing a PC(int)
df["new_point"] = df.point.str.replace(r"(point|\n).*", "").str.replace(",", "").astype("int")

#Get PC rating(int)
df["new_ratings"] = df.ratings.str.replace(r"\D", "").astype("int")

#Get the number of characters in the PC title(int)
df["string_len"] = df.title.str.len()

#Get the number of words in your PC title(int)
df["words_len"] = df.title.str.split().str.len()

The final result will be like this.

スクリーンショット 2019-11-09 20.34.08.png

At the end

I put it as a video, so if you want to see the process flow, please watch it on youtube.

Video link

If you want to see the code running, go to "data processing 02" in the link above. The explanation is quite long, so fast forward is recommended.

Recommended Posts

python jupyter notebook Data preprocessing championship (target site: BicCamera)
3 Jupyter notebook (Python) tricks
The definitive edition of python scraping! (Target site: BicCamera)
python3.8 venv environment jupyter notebook
<Python> Build a dedicated server for Jupyter Notebook data analysis
Snippet settings for python jupyter notebook
Python memo Anaconda x Jupyter Notebook
Python: Time Series Analysis: Preprocessing Time Series Data
Generate Jupyter notebook ".ipynb" in Python
Python Pandas Data Preprocessing Personal Notes
Preprocessing template for data analysis (Python)
Easy to use Jupyter notebook (Python3.5)
Memory leak in Python Jupyter Lab (Notebook)?
Python: Preprocessing in machine learning: Data acquisition
Linking python and JavaScript with jupyter notebook
Python: Preprocessing in machine learning: Data conversion
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I started machine learning with Python Data preprocessing