[PYTHON] Be careful when reading data with pandas (specify dtype)

When reading data with pandas, it is safer to specify dtype

In this article pandas 0.18.I am using 1.

If you do not specify anything for dtype, the type will be determined without permission. For example, if there is the following tab-delimited data

data_1.txt

id	x01	x02	x03	x04	x05	x06	x07	x08	x09	x10
0001	0.54	0.54	0.85	0.79	0.54	0.36	0.28	0.52	0.21	0.49
0002	0.72	0.68	0.77	0.69	0.07	na	0.29	0.42	0.32	0.51
0003	0.68	0.99	0.19	0.16	0.31	0.76	0.57	0.08	0.07	0.98
0004	0.98	na	0.49	0.47	0.09	0.52	0.42	0.35	0.83	0.64
0005	0.37	0.35	0.99	0.88	0.81	0.46	0.57	0.47	0.06	0.55
# coding: UTF-8

import pandas as pd
df = pd.read_csv('‪data_1.txt', header = 0, sep = '\t', na_values = 'na')
print df
	id	x01	x02	x03	x04	x05	x06	x07	x08	x09	x10
0	1	0.54	0.54	0.85	0.79	0.54	0.36	0.28	0.52	0.21	0.49
1	2	0.72	0.68	0.77	0.69	0.07	NaN	0.29	0.42	0.32	0.51
2	3	0.68	0.99	0.19	0.16	0.31	0.76	0.57	0.08	0.07	0.98
3	4	0.98	NaN	0.49	0.47	0.09	0.52	0.42	0.35	0.83	0.64
4	5	0.37	0.35	0.99	0.88	0.81	0.46	0.57	0.47	0.06	0.55

If you do not specify the type, it will be as above and the id will be zero. When I check the data type of id in df.dtypes, it is int.

In such a case

df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na',
                 dtype = {'id':'object', 'x01':'float', 'x02':'float','x03':'float','x04':'float','x05':'float','x06':'float',
                          'x07':'float','x08':'float','x09':'float','x10':'float'})

print df
     id   x01   x02   x03   x04   x05   x06   x07   x08   x09   x10
0  0001  0.54  0.54  0.85  0.79  0.54  0.36  0.28  0.52  0.21  0.49
1  0002  0.72  0.68  0.77  0.69  0.07   NaN  0.29  0.42  0.32  0.51
2  0003  0.68  0.99  0.19  0.16  0.31  0.76  0.57  0.08  0.07  0.98
3  0004  0.98   NaN  0.49  0.47  0.09  0.52  0.42  0.35  0.83  0.64
4  0005  0.37  0.35  0.99  0.88  0.81  0.46  0.57  0.47  0.06  0.55

In this way, you can keep the original shape by specifying dtype. It's col Classes in R. I feel that the data is read faster when dtype is specified.

You can also read everything as an object for the time being, and then change only the necessary parts later.

#At first read everything with object
df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na', dtype = 'object')

var_lst = ['x01','x02','x03','x04','x05','x06','x07','x08','x09','x10']
df[var_lst] = df[var_lst].astype(float)    #Change data type to float

Recommended Posts

Be careful when reading data with pandas (specify dtype)
Be careful when running CakePHP3 with PHP7.2
Be careful when working with gzip-compressed text files
Reading data with TensorFlow
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
[Python] Change dtype with pandas
Data processing tips with Pandas
A collection of methods used when aggregating data with pandas
Be careful with easy method references
Be careful of LANG for UnicodeEncodeError when printing Japanese with Python 3
Versatile data plotting with pandas + matplotlib
(Note) Be careful with python argparse
[Stock price analysis] Learning pandas with fictitious data (001: environment preparation-file reading)
[Python] Be careful when using print
Be careful with Python's append method
[Python] Format when to_csv with pandas
EXCEL data bar and color scale can also be done with pandas
Settings when reading S3 files with pandas from Jupyter Notebook on AWS
Be careful when retrieving tweets at regular intervals with the Twitter API
Be careful of the type when making an image mask with Numpy
Specify options when running flake8 with flycheck
⚠️ Be careful with Python's default argument values ⚠️
Try converting to tidy data with pandas
Be careful when adding an array to an array
[Easy Python] Reading Excel files with pandas
Working with 3D data structures in pandas
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
A memorandum of method often used when analyzing data with pandas (for beginners)
When to_csv with Pandas, it became line by line
When Html cannot be output with Jupyter Notebook
Try to aggregate doujin music data with pandas
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Implement "Data Visualization Design # 3" with pandas and matplotlib
Interactively visualize data with TreasureData, Pandas and Jupyter.
100 language processing knock-20 (using pandas): reading JSON data
Investigation when import cannot be done with python
Make holiday data into a data frame with pandas
Flow memo when getting json data with urllib
When reading a csv file with read_csv of pandas, the first column becomes index