[PYTHON] [Introduction to Pandas] Read a csv file without a column name and give it a column name

Give a column name to a csv file without a column name

I sometimes read a csv file without a column name and give it a column name, but I often forget how to do it, so make a note of it as a memorandum.

sorry. The content is really not a big deal.

Data to use

The data used was the housing data published in the UCI machine learning repository. housing data

Data read

First, read the data. The data is separated by whitespace instead of commas, so specify whitespace in sep. Also, since housing.data does not have a column name, the data in the first row will be recognized as a column name when read normally, so specify header = None to avoid that.

import pandas as pd
df = pd.read_csv("housing.data", header=None, sep="\s+")

The result of reading the data is

スクリーンショット 2019-11-17 16.55.48.png

It will be. Numbers from 0 to 13 are automatically assigned to become column names. Replace this automatically created column name with the original column name. First, create a dictionary (labels_dict) that associates the column name before conversion with the column name after conversion. If you specify labels_dict in the rename method of the data frame, the column names will be replaced according to the correspondence shown in the dictionary.

labels =  ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
labels_dict = {num: label for num, label in enumerate(labels)}
df = df.rename(columns = labels_dict)
#Save the data frame with the column name added as a csv file.
df.to_csv("housing_data.csv", index=False)

If you check the inside of df after execution, you can see that the column name has been changed.

スクリーンショット 2019-11-17 17.02.34.png

Omake (Please note that the following has nothing to do with the original content of this article)

Since it's a big deal, let's use this data to roughly estimate the house price.

Let's take a quick look at the data

If you execute the following code, you can see that this data is all numerical data and there are no missing values. You can also display statistics. Please try it if you like.

from IPython.display import display
#Data type display
display(df.dtypes)
#Display of the number of missing values
display(df.isnull().sum())
#Displaying statistics
display(df.describe())

Normally, data is preprocessed while checking the statistics of the data, and then the data is input to the machine learning algorithm, but this time it will be omitted. What I said is that it's okay.

Learning with a linear regression model

I'm omitting various things. After all, it's okay. At a minimum, we standardize the data and evaluate it with test data, but we do not adjust hyperparameters at all. The evaluation was simply based on mean square error (RMSE). The code is below.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

#Pipeline settings
pipe = Pipeline([
    ("scl", StandardScaler()),
    ("pca", PCA(n_components=10)),
    ("lr", LinearRegression(normalize=False))
])

#Data split
xtrain, xtest, ytrain, ytest = train_test_split(df[df.columns[df.columns != "MEDV"]], df["MEDV"], test_size=0.3, random_state=1)

#Model learning
pipe.fit(X=xtrain, y=ytrain)

#Price forecast
ypred = pipe.predict(xtest)

#Model evaluation
display(mean_squared_error(ytest, ypred))

#View results
result = pd.DataFrame(columns=["index", "true", "pred"])
result["index"] = range(len(ytest))
result["true"] = ytest.tolist()
result["pred"] = ypred

plt.figure(figsize=(15,5))
plt.scatter(result["index"], result["true"], marker="x", label="true")
plt.scatter(result["index"], result["pred"], marker="v", label="predict")
plt.xlabel("ID")
plt.ylabel("Median price")
plt.grid()
plt.legend()
plt.show()

When I did this, I got an average squared error of 21.19. I don't know if this is good or bad without looking at the data properly, but for the time being, I was able to evaluate the difference between the price forecast and the true value.

In addition, the predicted value and the true value are converted into grams as follows. At a glance, you can see that the higher the price, the larger the deviation, and the lower the predicted value.

スクリーンショット 2019-11-17 18.33.46.png

Recommended Posts

[Introduction to Pandas] Read a csv file without a column name and give it a column name
Read and format a csv file mixed with comma tabs with Python pandas
How to get a specific column name and index name in pandas DataFrame
[Python] How to read a csv file (read_csv method of pandas module)
Read CSV file with Python and convert it to DataFrame as it is
How to read a CSV file with Python 2/3
Read CSV file: pandas
How to read a serial number file in a loop, process it, and graph it
Read the csv file and display it in the browser
Read and write csv file
Every time I try to read a csv file using pandas, I get a numpy error.
Read and write a file
Write and read a file
How to paste a CSV file into an Excel file using Pandas
[Python] How to scrape a local html file and output it as CSV using Beautiful Soup
How to make a container name a subdomain and make it accessible in Docker
Read Python csv and export to txt
[pandas] .csv file reading and display method
How to read CSV files in Pandas
Download Pandas DataFrame as a CSV file
Read CSV and analyze with Pandas and Seaborn
A command to specify a file with a specific name in a directory with find and mv, cp, or gzip it (linux)
Various ways to read the last line of a csv file in Python
[Python] How to name table data and output it in csv (to_csv method)
I want to write an element to a file with numpy and check it.
[Python] Concatenate a List containing numbers and write it to an output file.
[Python] How to read excel file with pandas
Read CSV files uploaded to Flask without saving
How to read a file in a different directory
[Python] Read Japanese csv with pandas without garbled characters (and extract columns written in Japanese)
[Introduction to system trading] I drew a Stochastic Oscillator with python and played with it ♬
When reading a csv file with read_csv of pandas, the first column becomes index
Read the csv file with jupyter notebook and write the graph on top of it
I want to give a group_id to a pandas data frame
Created a module to monitor file and URL updates
How to convert JSON file to CSV file with Python Pandas
[Python] A memo to write CSV vertically with Pandas
Read json file with Python, format it, and output json
Python script to create a JSON file from a CSV file
Output a binary dump in binary and revert to a binary file
2 ways to read all csv files in a folder
Python --Read data from a numeric data file to find the covariance matrix, eigenvalues, and eigenvectors
[Python] What is a tuple? Explains how to use without tuples and how to use it with examples.
Read Python csv file
Read the function name from the DB and execute it dynamically
[python] Change the image file name to a serial number
[Python] Read the csv file and display the figure with matplotlib
A handy function to add a column anywhere in a Pandas DataFrame
Get a global IP and export it to Google Spreadsheets
[Introduction to Python] Combine Nikkei 225 and NY Dow csv data
[Python] How to output a pandas table to an excel file
How to read an Excel file (.xlsx) with Pandas [Python]
[Introduction to Tensorflow] Understand Tensorflow properly and try to make a model
What to do if you grep a text file and it becomes Binary file (standard input) matches
Read the old Gakushin DC application Word file (.doc) from Python and try to operate it.