Hello! Suddenly, I'm sorry, but have you ever heard the word "** Tidy Data " (translated as " orderly data **" in Japanese)? I recently studied Pandas in a course called Pandas Data Analysis with Python Fundamentals in Safari Books Online. When I was doing that, the chapter "Tidy Data" came out at the end and I knew this story. What is "Tidy Data"? Tidy Data By Hadley Wickham (who is revered as a god in the world of R). It is a concept proposed in the paper (2014) called html). The paper also has a Japanese translation, which is introduced at the beginning of the translation as follows.
A complete translation of the paper “Tidy Data” that advocated the concept of “orderly data” for effective data analysis. Orderly data is a useful concept that facilitates data analysis with R and the like.
Tidy Data is a concept and method for cleaning / organizing data, which is said to "occupy 80% of the work of analysts", as easily and effectively as possible. The paper was written with R in mind from the author's background, but the Safari Online course mentioned above showed how to implement those data cleanup patterns in Python / Pandas. I was ashamed to say that I had never read a properly organized data organization method / methodology in data analysis, so it was very inspiring to write this article.
: triangular_flag_on_post: (2019/3/2) This article was posted in April 2017, but Daniel Y. Chen's pandas book "Pandas for Everyone" who was a lecturer in the original story video course of this article "[Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization](https://www.amazon.co.jp/Python%E3%83%87%E3%83%BC%E3%82" % BF% E5% 88% 86% E6% 9E% 90-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 82% B3% E3% 83% BC% E3% 83% 87% E3% 82% A3% E3% 83% B3% E3% 82% B0-pandas% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E6% B4% BB% E7% 94% A8% E5% 85% A5% E9% 96% 80-impress-gear / dp / 4295005657 / ref = sr_1_3? S = books & ie = UTF8 & qid = 1551490253 & sr = 1-3 & keywords = pandas) Translated and published in 22. There is also an orderly explanation of the data, so I will introduce it.
The author of the site Colorless Green Ideas explains the translation of the paper and the concept of "orderly data" in Japanese in a very easy-to-understand manner.
-What is orderly data -You can get a general idea just by reading this. Highly recommended. -[Translation] Orderly data
For more information, read the above article "What is orderly data" from my amateurish explanation. It's best to have it, but for those who are busy (very rough), I'll summarize it below.
[Introduction (Atarimae's story)]: triangular_flag_on_post: (October 2019) Changed the explanation example [^ 1]
[^ 1]: (2019/10) Initially, this place was described as an example of height and weight of body measurement, but @acevif commented that it is inappropriate as an example, and I think that is certainly the case. So I modified the example. Please refer to the comment section for details.
The following is a list of "treatments performed on patients at medical institutions and the resulting values". If you want to know the result of each treatment for each person, you can see it in the direction of the line. The results of each person are lined up in the row direction, which is called "** observation ". Also, the types of treatments performed are lined up in the column direction. These are called " variables **". This is a form that is easy for humans to understand.
[Meaning and structure of data] Here, assuming that there is a data group (observed value) that "the numerical value of the result of applying a certain treatment to each patient was like this", when actually expressing it as a data set, it is as follows. Can be represented by various ** structures **.
In the structure of the above figure,: one: contains two observations in one line and: two: contains three observations in one line, but the rightmost: three: contains only one observation in one line ( "As a result of treatment X performed on Mr. C, the value was 3."). In other words, the form of: three: is ** the "semantics" and "structure" of the data match **. This is the important point of "orderly data". In the "Ordered Data" paper, it is argued that "this form is easier to handle in data analysis (apart from being easy for humans to understand)". (For why it is easier to handle, see "Structure and" in the article "What is orderly data" See the section "Importance of Matching Meanings")
[In other words, what is "orderly data"? ] To put it very roughly
――It is easier to handle the data to be analyzed later when analyzing the data (for example, aggregation or plotting) by organizing and converting the data so that the meaning and structure match. ――The data to be analyzed should be normalized properly ~
I think that's what it means.
In the "Arrangement of cluttered data sets" chapter of the "Tidy Data" paper, list five "common patterns of cluttered data" and indicate the direction of each data organization. I am. In the following, this article will introduce an example of how to actually organize them with pandas.
――Of course, there are many other ways to implement it at the code level, so this is just an example. ――For the title line, I used the expression in the Japanese translation mentioned above. --The example is basically the one published in the original paper
Column headers are values, not variable names
Below is a count of the number of people by "religion" and "income group" in a survey. In the table on the left, "income class" is represented by the horizontal axis, but it is not good that the column headings themselves, such as "$ 10K or less" and "$ 10-20K", contain values. Convert this to the form of "orderly data" on the right.
This conversion can be easily done using pandas's melt.
--id_vars = Specify the basic axis column ("religion" in the above example) --var_name = Give a name to the column that will be a variable ("Range" in the above example) --value_name = Name the column that will be the Value ("count" in the example above)
python3
In [1]: import pandas as pd
#Data reading
In [2]: df=pd.read_excel('data/02_Religion.xls')
In [3]: print(df.head())
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
0 Agnostic 27 34 60 81 76 137
1 Atheist 12 27 37 52 35 70
2 Buddhist 27 21 30 34 33 58
3 Catholic 418 617 732 670 638 1116
4 Don’tknow/refused 15 14 15 11 10 35
#Run melt
In [4]: df_long = pd.melt(df ,id_vars='religion', var_name='range', value_name='count')
# tidy-Converted to data format
In [5]: print(df_long.head(n=10))
religion range count
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don’tknow/refused <$10k 15
5 EvangelicalProt <$10k 575
6 Hindu <$10k 1
7 HistoricallyBlackProt <$10k 228
8 Jehovah’sWitness <$10k 20
9 Jewish <$10k 19
Below is a dataset of Billboard hit charts. It holds the ranking data of the first week, the second week, and so on of the week ranked in the chart, along with data such as year, song title, and artist.
This is also converted into "orderly data". You can use melt as in Example 1, but in this example there are multiple axis columns. Since there is, specify the column in the list like id_vars = ['year','artist','track','time','date.entered']. The rest is the same as in Example 1.
python3
#Data reading
In [15]: billboard = pd.read_excel('data/04_HitChart.xls')
In [16]: print(billboard.head())
year artist track time date.entered wk1 wk2 wk3
0 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 87 82 72
1 2000 2Ge+her The Hardest Part Of ... 03:15:00 2000-09-02 91 87 92
2 2000 3 Doors Down Kryptonite 03:53:00 2000-04-08 81 70 68
3 2000 98^0 Give Me Just One Nig... 03:24:00 2000-08-19 51 39 34
4 2000 A*Teens Dancing Queen 03:44:00 2000-07-08 97 97 96
#Run melt(id_vars give in list)
In [17]: billboard_long = pd.melt(billboard, id_vars=['year','artist','track','time','date.entered'], var_name='week', value_name
...: ='rank')
# tidy-Converted to data format
In [18]: print(billboard_long.head(n=10))
year artist track time date.entered week rank
0 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk1 87
1 2000 2Ge+her The Hardest Part Of ... 03:15:00 2000-09-02 wk1 91
2 2000 3 Doors Down Kryptonite 03:53:00 2000-04-08 wk1 81
3 2000 98^0 Give Me Just One Nig... 03:24:00 2000-08-19 wk1 51
4 2000 A*Teens Dancing Queen 03:44:00 2000-07-08 wk1 97
5 2000 Aaliyah I Don’t Wanna 04:15:00 2000-01-29 wk1 84
6 2000 Aaliyah Try Again 04:03:00 2000-03-18 wk1 59
7 2000 Adams, Yolanda Open My Heart 05:30:00 2000-08-26 wk1 76
8 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk2 82
9 2000 2Ge+her The Hardest Part Of ... 03:15:00 2000-09-02 wk2 87
Multiple variables are stored in one column
This is data on tuberculosis (TB) and shows the number of tuberculosis patients by "gender" and "age group" in addition to "country" and "year". m014 means male (m) 0-14 years old, m1524 means male (m) 15-24 years old. If two variables, "gender" and "age group", are stored in one column in this way, it is not good when you want to analyze from the perspective of "gender", "age group", etc., so the column should be "gender". And "age group".
--First, melt around "country" and "year" as in the previous example. --Next, divide the variable column into two columns, "gender" and "age group". The method of division may vary depending on the case, but this time, we will divide it into the first character and the second and subsequent characters by using "the first character of the column name indicates the gender".
python3
#Data reading
In [2]: tb = pd.read_excel('data/05_TB.xls')
In [3]: print(tb.head())
country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
0 AD 2000 0.0 0.0 1.0 0.0 0 0 0.0 NaN NaN
1 AE 2000 2.0 4.0 4.0 6.0 5 12 10.0 NaN 3.0
2 AF 2000 52.0 228.0 183.0 149.0 129 94 80.0 NaN 93.0
3 AG 2000 0.0 0.0 0.0 0.0 0 0 1.0 NaN 1.0
4 AL 2000 2.0 19.0 21.0 14.0 24 19 16.0 NaN 3.0
#Run melt
In [4]: tb_long = pd.melt(tb, id_vars=['country','year'], var_name='variable', value_name='value')
In [5]: print(tb_long.head(n=10))
country year variable value
0 AD 2000 m014 0.0
1 AE 2000 m014 2.0
2 AF 2000 m014 52.0
3 AG 2000 m014 0.0
4 AL 2000 m014 2.0
5 AM 2000 m014 2.0
6 AN 2000 m014 0.0
7 AO 2000 m014 186.0
8 AR 2000 m014 97.0
9 AS 2000 m014 NaN
#Extract the first character of the variable column and add the column "gender"
In [6]: tb_long['gender'] = tb_long.variable.str[0]
#Extract the second and subsequent characters of the variable column and add the column "age"
In [7]: tb_long['age'] = tb_long.variable.str[1:]
#gender and age columns have been append
In [8]: print(tb_long.head(n=10))
country year variable value gender age
0 AD 2000 m014 0.0 m 014
1 AE 2000 m014 2.0 m 014
2 AF 2000 m014 52.0 m 014
3 AG 2000 m014 0.0 m 014
4 AL 2000 m014 2.0 m 014
5 AM 2000 m014 2.0 m 014
6 AN 2000 m014 0.0 m 014
7 AO 2000 m014 186.0 m 014
8 AR 2000 m014 97.0 m 014
9 AS 2000 m014 NaN m 014
Variables are stored in both rows and columns
This is the daily meteorological data at a meteorological station (MX17004) in Mexico. The first day (d1), the second day (d2), and the day are lined up on the horizontal axis with respect to the "year" and "month" axes, but in the element column, the maximum and minimum temperatures of the day are 2 I have one as a line. It is not good that the observed facts are stored across both rows and columns, so organize them in the form on the right.
--First, as in the previous example, melt around "id", "year", "month", and "element". --Next, use pivot_table to promote tmax and tmin to columns. (Pivot_table can be said to be the opposite of melt) --Finally, the index is not clean immediately after executing pivot_table, so reset_index () is used to adjust it.
python3
#Data reading
In [10]: weather = pd.read_excel('data/06_Weather.xls')
In [11]: print(weather.head())
id year month element d1 d2 d3 d4 d5 d6 d7 d8
0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN
1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN
2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN
3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN
4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN
#Run melt
In [12]: weather_melt = pd.melt(weather,id_vars=['id','year','month','element'], var_name='day', value_name='temp')
In [13]: print(weather_melt.head(n=10))
id year month element day temp
0 MX17004 2010 1 tmax d1 NaN
1 MX17004 2010 1 tmin d1 NaN
2 MX17004 2010 2 tmax d1 NaN
3 MX17004 2010 2 tmin d1 NaN
4 MX17004 2010 3 tmax d1 NaN
5 MX17004 2010 3 tmin d1 NaN
6 MX17004 2010 4 tmax d1 NaN
7 MX17004 2010 4 tmin d1 NaN
8 MX17004 2010 5 tmax d1 NaN
9 MX17004 2010 5 tmin d1 NaN
# pivot_Promote the value of the element column to a column in table
In [14]: weather_tidy = weather_melt.pivot_table(index=['id','year','month','day'], columns='element', values='temp')
#I was able to promote, but the index is not good
In [15]: print(weather_tidy.head(n=10))
element tmax tmin
id year month day
MX17004 2010 1 d1 NaN NaN
d2 NaN NaN
d3 NaN NaN
d4 NaN NaN
d5 NaN NaN
d6 NaN NaN
d7 NaN NaN
d8 NaN NaN
2 d1 NaN NaN
d2 27.3 14.4
# reset_index()Then it becomes beautiful
In [16]: weather_tidy_flat = weather_tidy.reset_index()
In [17]: print(weather_tidy_flat.head(n=10))
element id year month day tmax tmin
0 MX17004 2010 1 d1 NaN NaN
1 MX17004 2010 1 d2 NaN NaN
2 MX17004 2010 1 d3 NaN NaN
3 MX17004 2010 1 d4 NaN NaN
4 MX17004 2010 1 d5 NaN NaN
5 MX17004 2010 1 d6 NaN NaN
6 MX17004 2010 1 d7 NaN NaN
7 MX17004 2010 1 d8 NaN NaN
8 MX17004 2010 2 d1 NaN NaN
9 MX17004 2010 2 d2 27.3 14.4
Multiple types of observational units are stored in the same table
The following is the conversion result of the Billboard hit chart in Example 2 of "1. The column heading is a value, not a variable name". If you look closely, this dataset contains the "song-related part" and the "ranking-related part" in the same table, and the "song-related part" is repeated many times. From a "normalization" point of view, it seems better to split these into two datasets later.
First of all, proceed with the following procedure for the "song-related part".
--Create a new DataFrame by specifying only the columns related to the song -Eliminate duplicates with drop_duplicates () --Generate an ID column to join with "Ranking part" later
Follow the steps below for the "ranking part".
--JOIN the DataFrame created above and the original dataframe to generate an ID column. --After that, create a new DataFrame by specifying only the columns related to ranking.
python3
#Data confirmation
In [23]: print(billboard_long.head())
year artist track time date.entered week rank
0 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk1 87
1 2000 2Ge+her The Hardest Part Of ... 03:15:00 2000-09-02 wk1 91
2 2000 3 Doors Down Kryptonite 03:53:00 2000-04-08 wk1 81
3 2000 98^0 Give Me Just One Nig... 03:24:00 2000-08-19 wk1 51
4 2000 A*Teens Dancing Queen 03:44:00 2000-07-08 wk1 97
#The data of the same song is repeated
In [24]: print(billboard_long[billboard_long.track =='Baby Don’t Cry'].head())
year artist track time date.entered week rank
0 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk1 87
8 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk2 82
16 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk3 72
#Create a new DataFrame in "Song-related part" (repeated part still overlaps)
In [25]: billboard_songs = billboard_long[['year','artist','track','time']]
In [26]: billboard_songs.head(n=10)
Out[26]:
year artist track time
0 2000 2 Pac Baby Don’t Cry 04:22:00
1 2000 2Ge+her The Hardest Part Of ... 03:15:00
2 2000 3 Doors Down Kryptonite 03:53:00
3 2000 98^0 Give Me Just One Nig... 03:24:00
4 2000 A*Teens Dancing Queen 03:44:00
5 2000 Aaliyah I Don’t Wanna 04:15:00
6 2000 Aaliyah Try Again 04:03:00
7 2000 Adams, Yolanda Open My Heart 05:30:00
8 2000 2 Pac Baby Don’t Cry 04:22:00
9 2000 2Ge+her The Hardest Part Of ... 03:15:00
#24 rows x 4 columns
In [27]: billboard_songs.shape
Out[27]: (24, 4)
# drop_duplicates()Eliminate duplication with
In [28]: billboard_songs = billboard_songs.drop_duplicates()
#8 rows x 4 columns
In [29]: billboard_songs.shape
Out[29]: (8, 4)
In [30]: print(billboard_songs.head())
year artist track time
0 2000 2 Pac Baby Don’t Cry 04:22:00
1 2000 2Ge+her The Hardest Part Of ... 03:15:00
2 2000 3 Doors Down Kryptonite 03:53:00
3 2000 98^0 Give Me Just One Nig... 03:24:00
4 2000 A*Teens Dancing Queen 03:44:00
#Generate and add ID columns with serial numbers
In [31]: billboard_songs['id'] = range(len(billboard_songs))
#Completion of "Song-related part"
In [32]: print(billboard_songs.head())
year artist track time id
0 2000 2 Pac Baby Don’t Cry 04:22:00 0
1 2000 2Ge+her The Hardest Part Of ... 03:15:00 1
2 2000 3 Doors Down Kryptonite 03:53:00 2
3 2000 98^0 Give Me Just One Nig... 03:24:00 3
4 2000 A*Teens Dancing Queen 03:44:00 4
#From here on, the part related to "ranking"
#The original DataFrame and the billboard created just before_JOIN DataFrame of songs(merge)To create a new DataFrame
In [33]: billboard_ratings = billboard_long.merge(billboard_songs, on=['year','artist','track','time'])
#ID column added
In [34]: billboard_ratings.head()
Out[34]:
year artist track time date.entered week rank id
0 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk1 87 0
1 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk2 82 0
2 2000 2 Pac Baby Don’t Cry 04:22:00 2000-02-26 wk3 72 0
3 2000 2Ge+her The Hardest Part Of ... 03:15:00 2000-09-02 wk1 91 1
4 2000 2Ge+her The Hardest Part Of ... 03:15:00 2000-09-02 wk2 87 1
#Create a new DataFrame in "Ranking part"
In [35]: billboard_ratings = billboard_ratings[['id','date.entered','week','rank']]
#Completion of "ranking part"
In [36]: print(billboard_ratings.head(n=10))
id date.entered week rank
0 0 2000-02-26 wk1 87
1 0 2000-02-26 wk2 82
2 0 2000-02-26 wk3 72
3 1 2000-09-02 wk1 91
4 1 2000-09-02 wk2 87
5 1 2000-09-02 wk3 92
6 2 2000-04-08 wk1 81
7 2 2000-04-08 wk2 70
8 2 2000-04-08 wk3 68
9 3 2000-08-19 wk1 51
A single observational unit is stored in multiple tables
It is inconvenient for analysis if the observed facts are stored separately in multiple data sets. As shown below, multiple datasets with the same column layout are concatenated into one.
This case is easy because you only have to concat the dataset, but if you have a large number of datasets, it can be difficult to read with pd.read_xxx () one by one. If you use glob, you can get a list of files with names that meet the specified conditions, so how to use it to concat? I will introduce you.
python3
In [38]: import glob
#List of filenames that match the criteria
In [39]: concat_files = glob.glob('data/concat*')
In [40]: print(concat_files)
['data\\concat_1.csv', 'data\\concat_2.csv', 'data\\concat_3.csv']
In [41]: list_concat_df =[]
#Read the csv file for each file and append to the list of df
In [42]: for file in concat_files:
...: df = pd.read_csv(file)
...: list_concat_df.append(df)
...:
#I have a list with 3 DataFrames
In [43]: print(list_concat_df)
[ A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3, A B C D
0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7, A B C D
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11]
#Append 3 DataFrames into one and reindex
In [44]: concat_df = pd.concat(list_concat_df, ignore_index=True)
In [45]: print(concat_df)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
7 a7 b7 c7 d7
8 a8 b8 c8 d8
9 a9 b9 c9 d9
10 a10 b10 c10 d10
11 a11 b11 c11 d11
You can also write it like a 1-Liner using list comprehension as follows.
python3
In [46]: concat_files = glob.glob('data/concat*')
#here
In [47]: list_concat_df = [pd.read_csv(csv_file) for csv_file in concat_files]
#After that, the same as the previous example
In [48]: print(list_concat_df)
[ A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3, A B C D
0 a4 b4 c4 d4
1 a5 b5 c5 d5
2 a6 b6 c6 d6
3 a7 b7 c7 d7, A B C D
0 a8 b8 c8 d8
1 a9 b9 c9 d9
2 a10 b10 c10 d10
3 a11 b11 c11 d11]
In [49]: concat_df = pd.concat(list_concat_df, ignore_index=True)
In [50]: print(concat_df)
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
5 a5 b5 c5 d5
6 a6 b6 c6 d6
7 a7 b7 c7 d7
8 a8 b8 c8 d8
9 a9 b9 c9 d9
10 a10 b10 c10 d10
11 a11 b11 c11 d11
that's all. There is a lot of tips-like information (How) of melt () and pivot_table () of pandas on the net, and it was a learning experience, but concepts / methodologies (Why and What) like Tidy Data's papers. I realized that it is powerful when combined with the background of.
We have uploaded the contents of this article as a Jupyter Notebook on github so that you can move your hands and feel it. Since the data is also attached, it works as it is. If you are interested, please download and try it. ishida330/tidy-data-in-pandas
Recommended Posts