Introduction

Hello! Suddenly, I'm sorry, but have you ever heard the word "** Tidy Data " (translated as " orderly data **" in Japanese)? I recently studied Pandas in a course called Pandas Data Analysis with Python Fundamentals in Safari Books Online. When I was doing that, the chapter "Tidy Data" came out at the end and I knew this story. What is "Tidy Data"? Tidy Data By Hadley Wickham (who is revered as a god in the world of R). It is a concept proposed in the paper (2014) called html). The paper also has a Japanese translation, which is introduced at the beginning of the translation as follows.

A complete translation of the paper “Tidy Data” that advocated the concept of “orderly data” for effective data analysis. Orderly data is a useful concept that facilitates data analysis with R and the like.

Tidy Data is a concept and method for cleaning / organizing data, which is said to "occupy 80% of the work of analysts", as easily and effectively as possible. The paper was written with R in mind from the author's background, but the Safari Online course mentioned above showed how to implement those data cleanup patterns in Python / Pandas. I was ashamed to say that I had never read a properly organized data organization method / methodology in data analysis, so it was very inspiring to write this article.

: triangular_flag_on_post: (2019/3/2) This article was posted in April 2017, but Daniel Y. Chen's pandas book "Pandas for Everyone" who was a lecturer in the original story video course of this article "[Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization](https://www.amazon.co.jp/Python%E3%83%87%E3%83%BC%E3%82" % BF% E5% 88% 86% E6% 9E% 90-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 82% B3% E3% 83% BC% E3% 83% 87% E3% 82% A3% E3% 83% B3% E3% 82% B0-pandas% E3% 83% A9% E3% 82% A4% E3% 83% 96% E3% 83% A9% E3% 83% AA% E6% B4% BB% E7% 94% A8% E5% 85% A5% E9% 96% 80-impress-gear / dp / 4295005657 / ref = sr_1_3? S = books & ie = UTF8 & qid = 1551490253 & sr = 1-3 & keywords = pandas) Translated and published in 22. There is also an orderly explanation of the data, so I will introduce it.

First of all, I recommend articles in Japanese

The author of the site Colorless Green Ideas explains the translation of the paper and the concept of "orderly data" in Japanese in a very easy-to-understand manner.

-What is orderly data -You can get a general idea just by reading this. Highly recommended. -[Translation] Orderly data

What is the point of orderly data (Tidy Data)?

For more information, read the above article "What is orderly data" from my amateurish explanation. It's best to have it, but for those who are busy (very rough), I'll summarize it below.

[Introduction (Atarimae's story)]: triangular_flag_on_post: (October 2019) Changed the explanation example [^ 1]

[^ 1]: (2019/10) Initially, this place was described as an example of height and weight of body measurement, but @acevif commented that it is inappropriate as an example, and I think that is certainly the case. So I modified the example. Please refer to the comment section for details.

The following is a list of "treatments performed on patients at medical institutions and the resulting values". If you want to know the result of each treatment for each person, you can see it in the direction of the line. The results of each person are lined up in the row direction, which is called "** observation ". Also, the types of treatments performed are lined up in the column direction. These are called " variables **". This is a form that is easy for humans to understand.

[Meaning and structure of data] Here, assuming that there is a data group (observed value) that "the numerical value of the result of applying a certain treatment to each patient was like this", when actually expressing it as a data set, it is as follows. Can be represented by various ** structures **.

In the structure of the above figure,: one: contains two observations in one line and: two: contains three observations in one line, but the rightmost: three: contains only one observation in one line ( "As a result of treatment X performed on Mr. C, the value was 3."). In other words, the form of: three: is ** the "semantics" and "structure" of the data match **. This is the important point of "orderly data". In the "Ordered Data" paper, it is argued that "this form is easier to handle in data analysis (apart from being easy for humans to understand)". (For why it is easier to handle, see "Structure and" in the article "What is orderly data" See the section "Importance of Matching Meanings")

[In other words, what is "orderly data"? ] To put it very roughly

――It is easier to handle the data to be analyzed later when analyzing the data (for example, aggregation or plotting) by organizing and converting the data so that the meaning and structure match. ――The data to be analyzed should be normalized properly ~

I think that's what it means.

Tidying messy datasets

In the "Arrangement of cluttered data sets" chapter of the "Tidy Data" paper, list five "common patterns of cluttered data" and indicate the direction of each data organization. I am. In the following, this article will introduce an example of how to actually organize them with pandas.

――Of course, there are many other ways to implement it at the code level, so this is just an example. ――For the title line, I used the expression in the Japanese translation mentioned above. --The example is basically the one published in the original paper

1. The column heading is a value, not a variable name

Column headers are values, not variable names

[Example]

Below is a count of the number of people by "religion" and "income group" in a survey. In the table on the left, "income class" is represented by the horizontal axis, but it is not good that the column headings themselves, such as "$ 10K or less" and "$ 10-20K", contain values. Convert this to the form of "orderly data" on the right.

[Explanation of method]

This conversion can be easily done using pandas's melt.

--id_vars = Specify the basic axis column ("religion" in the above example) --var_name = Give a name to the column that will be a variable ("Range" in the above example) --value_name = Name the column that will be the Value ("count" in the example above)

[Conversion example with pandas]

`python3`



In [1]: import pandas as pd
#Data reading
In [2]: df=pd.read_excel('data/02_Religion.xls')

In [3]: print(df.head())
            religion  <$10k  $10-20k  $20-30k  $30-40k  $40-50k  $50-75k
0           Agnostic     27       34       60       81       76      137
1            Atheist     12       27       37       52       35       70
2           Buddhist     27       21       30       34       33       58
3           Catholic    418      617      732      670      638     1116
4  Don’tknow/refused     15       14       15       11       10       35

#Run melt
In [4]: df_long = pd.melt(df ,id_vars='religion', var_name='range', value_name='count')

# tidy-Converted to data format
In [5]: print(df_long.head(n=10))
                religion  range  count
0               Agnostic  <$10k     27
1                Atheist  <$10k     12
2               Buddhist  <$10k     27
3               Catholic  <$10k    418
4      Don’tknow/refused  <$10k     15
5        EvangelicalProt  <$10k    575
6                  Hindu  <$10k      1
7  HistoricallyBlackProt  <$10k    228
8       Jehovah’sWitness  <$10k     20
9                 Jewish  <$10k     19

[Example 2]

Below is a dataset of Billboard hit charts. It holds the ranking data of the first week, the second week, and so on of the week ranked in the chart, along with data such as year, song title, and artist.

[Explanation of method]

This is also converted into "orderly data". You can use melt as in Example 1, but in this example there are multiple axis columns. Since there is, specify the column in the list like id_vars = ['year','artist','track','time','date.entered']. The rest is the same as in Example 1.

[Conversion example with pandas]

`python3`


#Data reading
In [15]: billboard = pd.read_excel('data/04_HitChart.xls')

In [16]: print(billboard.head())
   year         artist                    track      time date.entered  wk1  wk2  wk3
0  2000          2 Pac           Baby Don’t Cry  04:22:00   2000-02-26   87   82   72
1  2000        2Ge+her  The Hardest Part Of ...  03:15:00   2000-09-02   91   87   92
2  2000   3 Doors Down               Kryptonite  03:53:00   2000-04-08   81   70   68
3  2000           98^0  Give Me Just One Nig...  03:24:00   2000-08-19   51   39   34
4  2000        A*Teens            Dancing Queen  03:44:00   2000-07-08   97   97   96

#Run melt(id_vars give in list)
In [17]: billboard_long = pd.melt(billboard, id_vars=['year','artist','track','time','date.entered'], var_name='week', value_name
    ...: ='rank')

# tidy-Converted to data format
In [18]: print(billboard_long.head(n=10))
   year           artist                    track      time date.entered week  rank
0  2000            2 Pac           Baby Don’t Cry  04:22:00   2000-02-26  wk1    87
1  2000          2Ge+her  The Hardest Part Of ...  03:15:00   2000-09-02  wk1    91
2  2000     3 Doors Down               Kryptonite  03:53:00   2000-04-08  wk1    81
3  2000             98^0  Give Me Just One Nig...  03:24:00   2000-08-19  wk1    51
4  2000          A*Teens            Dancing Queen  03:44:00   2000-07-08  wk1    97
5  2000          Aaliyah            I Don’t Wanna  04:15:00   2000-01-29  wk1    84
6  2000          Aaliyah                Try Again  04:03:00   2000-03-18  wk1    59
7  2000  Adams, Yolanda             Open My Heart  05:30:00   2000-08-26  wk1    76
8  2000            2 Pac           Baby Don’t Cry  04:22:00   2000-02-26  wk2    82
9  2000          2Ge+her  The Hardest Part Of ...  03:15:00   2000-09-02  wk2    87

2. Multiple variables are stored in one column

Multiple variables are stored in one column

[Example]

This is data on tuberculosis (TB) and shows the number of tuberculosis patients by "gender" and "age group" in addition to "country" and "year". m014 means male (m) 0-14 years old, m1524 means male (m) 15-24 years old. If two variables, "gender" and "age group", are stored in one column in this way, it is not good when you want to analyze from the perspective of "gender", "age group", etc., so the column should be "gender". And "age group".

[Explanation of method]

--First, melt around "country" and "year" as in the previous example. --Next, divide the variable column into two columns, "gender" and "age group". The method of division may vary depending on the case, but this time, we will divide it into the first character and the second and subsequent characters by using "the first character of the column name indicates the gender".

[Conversion example with pandas]

`python3`


#Data reading
In [2]: tb = pd.read_excel('data/05_TB.xls')

In [3]: print(tb.head())
  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014
0      AD  2000   0.0    0.0    1.0    0.0      0      0   0.0 NaN   NaN
1      AE  2000   2.0    4.0    4.0    6.0      5     12  10.0 NaN   3.0
2      AF  2000  52.0  228.0  183.0  149.0    129     94  80.0 NaN  93.0
3      AG  2000   0.0    0.0    0.0    0.0      0      0   1.0 NaN   1.0
4      AL  2000   2.0   19.0   21.0   14.0     24     19  16.0 NaN   3.0

#Run melt
In [4]: tb_long = pd.melt(tb, id_vars=['country','year'], var_name='variable', value_name='value')

In [5]: print(tb_long.head(n=10))
  country  year variable  value
0      AD  2000     m014    0.0
1      AE  2000     m014    2.0
2      AF  2000     m014   52.0
3      AG  2000     m014    0.0
4      AL  2000     m014    2.0
5      AM  2000     m014    2.0
6      AN  2000     m014    0.0
7      AO  2000     m014  186.0
8      AR  2000     m014   97.0
9      AS  2000     m014    NaN

#Extract the first character of the variable column and add the column "gender"
In [6]: tb_long['gender'] = tb_long.variable.str[0]

#Extract the second and subsequent characters of the variable column and add the column "age"
In [7]: tb_long['age'] = tb_long.variable.str[1:]

#gender and age columns have been append
In [8]: print(tb_long.head(n=10))
  country  year variable  value gender  age
0      AD  2000     m014    0.0      m  014
1      AE  2000     m014    2.0      m  014
2      AF  2000     m014   52.0      m  014
3      AG  2000     m014    0.0      m  014
4      AL  2000     m014    2.0      m  014
5      AM  2000     m014    2.0      m  014
6      AN  2000     m014    0.0      m  014
7      AO  2000     m014  186.0      m  014
8      AR  2000     m014   97.0      m  014
9      AS  2000     m014    NaN      m  014

3. Variables are stored in both rows and columns

Variables are stored in both rows and columns

[Example]

This is the daily meteorological data at a meteorological station (MX17004) in Mexico. The first day (d1), the second day (d2), and the day are lined up on the horizontal axis with respect to the "year" and "month" axes, but in the element column, the maximum and minimum temperatures of the day are 2 I have one as a line. It is not good that the observed facts are stored across both rows and columns, so organize them in the form on the right.

[Explanation of method]

--First, as in the previous example, melt around "id", "year", "month", and "element". --Next, use pivot_table to promote tmax and tmin to columns. (Pivot_table can be said to be the opposite of melt) --Finally, the index is not clean immediately after executing pivot_table, so reset_index () is used to adjust it.

[Conversion example with pandas]

`python3`


#Data reading
In [10]: weather = pd.read_excel('data/06_Weather.xls')

In [11]: print(weather.head())
        id  year  month element  d1    d2    d3  d4    d5  d6  d7  d8
0  MX17004  2010      1    tmax NaN   NaN   NaN NaN   NaN NaN NaN NaN
1  MX17004  2010      1    tmin NaN   NaN   NaN NaN   NaN NaN NaN NaN
2  MX17004  2010      2    tmax NaN  27.3  24.1 NaN   NaN NaN NaN NaN
3  MX17004  2010      2    tmin NaN  14.4  14.4 NaN   NaN NaN NaN NaN
4  MX17004  2010      3    tmax NaN   NaN   NaN NaN  32.1 NaN NaN NaN

#Run melt
In [12]: weather_melt = pd.melt(weather,id_vars=['id','year','month','element'], var_name='day', value_name='temp')

In [13]: print(weather_melt.head(n=10))
        id  year  month element day  temp
0  MX17004  2010      1    tmax  d1   NaN
1  MX17004  2010      1    tmin  d1   NaN
2  MX17004  2010      2    tmax  d1   NaN
3  MX17004  2010      2    tmin  d1   NaN
4  MX17004  2010      3    tmax  d1   NaN
5  MX17004  2010      3    tmin  d1   NaN
6  MX17004  2010      4    tmax  d1   NaN
7  MX17004  2010      4    tmin  d1   NaN
8  MX17004  2010      5    tmax  d1   NaN
9  MX17004  2010      5    tmin  d1   NaN

# pivot_Promote the value of the element column to a column in table
In [14]: weather_tidy = weather_melt.pivot_table(index=['id','year','month','day'], columns='element', values='temp')

#I was able to promote, but the index is not good
In [15]: print(weather_tidy.head(n=10))
element                 tmax  tmin
id      year month day
MX17004 2010 1     d1    NaN   NaN
                   d2    NaN   NaN
                   d3    NaN   NaN
                   d4    NaN   NaN
                   d5    NaN   NaN
                   d6    NaN   NaN
                   d7    NaN   NaN
                   d8    NaN   NaN
             2     d1    NaN   NaN
                   d2   27.3  14.4

# reset_index()Then it becomes beautiful
In [16]: weather_tidy_flat = weather_tidy.reset_index()

In [17]: print(weather_tidy_flat.head(n=10))
element       id  year  month day  tmax  tmin
0        MX17004  2010      1  d1   NaN   NaN
1        MX17004  2010      1  d2   NaN   NaN
2        MX17004  2010      1  d3   NaN   NaN
3        MX17004  2010      1  d4   NaN   NaN
4        MX17004  2010      1  d5   NaN   NaN
5        MX17004  2010      1  d6   NaN   NaN
6        MX17004  2010      1  d7   NaN   NaN
7        MX17004  2010      1  d8   NaN   NaN
8        MX17004  2010      2  d1   NaN   NaN
9        MX17004  2010      2  d2  27.3  14.4

4. Multiple types of observation constituent units are stored in the same table.

Multiple types of observational units are stored in the same table

[Example]

The following is the conversion result of the Billboard hit chart in Example 2 of "1. The column heading is a value, not a variable name". If you look closely, this dataset contains the "song-related part" and the "ranking-related part" in the same table, and the "song-related part" is repeated many times. From a "normalization" point of view, it seems better to split these into two datasets later.

[Explanation of method]

First of all, proceed with the following procedure for the "song-related part".

--Create a new DataFrame by specifying only the columns related to the song -Eliminate duplicates with drop_duplicates () --Generate an ID column to join with "Ranking part" later

Follow the steps below for the "ranking part".

--JOIN the DataFrame created above and the original dataframe to generate an ID column. --After that, create a new DataFrame by specifying only the columns related to ranking.

[Conversion example with pandas]

`python3`


#Data confirmation
In [23]: print(billboard_long.head())
   year         artist                    track      time date.entered week  rank
0  2000          2 Pac           Baby Don’t Cry  04:22:00   2000-02-26  wk1    87
1  2000        2Ge+her  The Hardest Part Of ...  03:15:00   2000-09-02  wk1    91
2  2000   3 Doors Down               Kryptonite  03:53:00   2000-04-08  wk1    81
3  2000           98^0  Give Me Just One Nig...  03:24:00   2000-08-19  wk1    51
4  2000        A*Teens            Dancing Queen  03:44:00   2000-07-08  wk1    97

#The data of the same song is repeated
In [24]: print(billboard_long[billboard_long.track =='Baby Don’t Cry'].head())
    year  artist           track      time date.entered week  rank
0   2000   2 Pac  Baby Don’t Cry  04:22:00   2000-02-26  wk1    87
8   2000   2 Pac  Baby Don’t Cry  04:22:00   2000-02-26  wk2    82
16  2000   2 Pac  Baby Don’t Cry  04:22:00   2000-02-26  wk3    72

#Create a new DataFrame in "Song-related part" (repeated part still overlaps)
In [25]: billboard_songs = billboard_long[['year','artist','track','time']]

In [26]: billboard_songs.head(n=10)
Out[26]:
   year           artist                    track      time
0  2000            2 Pac           Baby Don’t Cry  04:22:00
1  2000          2Ge+her  The Hardest Part Of ...  03:15:00
2  2000     3 Doors Down               Kryptonite  03:53:00
3  2000             98^0  Give Me Just One Nig...  03:24:00
4  2000          A*Teens            Dancing Queen  03:44:00
5  2000          Aaliyah            I Don’t Wanna  04:15:00
6  2000          Aaliyah                Try Again  04:03:00
7  2000  Adams, Yolanda             Open My Heart  05:30:00
8  2000            2 Pac           Baby Don’t Cry  04:22:00
9  2000          2Ge+her  The Hardest Part Of ...  03:15:00

#24 rows x 4 columns
In [27]: billboard_songs.shape
Out[27]: (24, 4)

# drop_duplicates()Eliminate duplication with
In [28]: billboard_songs = billboard_songs.drop_duplicates()

#8 rows x 4 columns
In [29]: billboard_songs.shape
Out[29]: (8, 4)

In [30]: print(billboard_songs.head())
   year         artist                    track      time
0  2000          2 Pac           Baby Don’t Cry  04:22:00
1  2000        2Ge+her  The Hardest Part Of ...  03:15:00
2  2000   3 Doors Down               Kryptonite  03:53:00
3  2000           98^0  Give Me Just One Nig...  03:24:00
4  2000        A*Teens            Dancing Queen  03:44:00

#Generate and add ID columns with serial numbers
In [31]: billboard_songs['id'] = range(len(billboard_songs))

#Completion of "Song-related part"
In [32]: print(billboard_songs.head())
   year         artist                    track      time  id
0  2000          2 Pac           Baby Don’t Cry  04:22:00   0
1  2000        2Ge+her  The Hardest Part Of ...  03:15:00   1
2  2000   3 Doors Down               Kryptonite  03:53:00   2
3  2000           98^0  Give Me Just One Nig...  03:24:00   3
4  2000        A*Teens            Dancing Queen  03:44:00   4

#From here on, the part related to "ranking"
#The original DataFrame and the billboard created just before_JOIN DataFrame of songs(merge)To create a new DataFrame
In [33]: billboard_ratings = billboard_long.merge(billboard_songs, on=['year','artist','track','time'])

#ID column added
In [34]: billboard_ratings.head()
Out[34]:
   year    artist                    track      time date.entered week  rank  id
0  2000     2 Pac           Baby Don’t Cry  04:22:00   2000-02-26  wk1    87   0
1  2000     2 Pac           Baby Don’t Cry  04:22:00   2000-02-26  wk2    82   0
2  2000     2 Pac           Baby Don’t Cry  04:22:00   2000-02-26  wk3    72   0
3  2000   2Ge+her  The Hardest Part Of ...  03:15:00   2000-09-02  wk1    91   1
4  2000   2Ge+her  The Hardest Part Of ...  03:15:00   2000-09-02  wk2    87   1

#Create a new DataFrame in "Ranking part"
In [35]: billboard_ratings = billboard_ratings[['id','date.entered','week','rank']]

#Completion of "ranking part"
In [36]: print(billboard_ratings.head(n=10))
   id date.entered week  rank
0   0   2000-02-26  wk1    87
1   0   2000-02-26  wk2    82
2   0   2000-02-26  wk3    72
3   1   2000-09-02  wk1    91
4   1   2000-09-02  wk2    87
5   1   2000-09-02  wk3    92
6   2   2000-04-08  wk1    81
7   2   2000-04-08  wk2    70
8   2   2000-04-08  wk3    68
9   3   2000-08-19  wk1    51

5. The building blocks of one observation are stored in multiple tables

A single observational unit is stored in multiple tables

[Example]

It is inconvenient for analysis if the observed facts are stored separately in multiple data sets. As shown below, multiple datasets with the same column layout are concatenated into one.

[Explanation of method]

This case is easy because you only have to concat the dataset, but if you have a large number of datasets, it can be difficult to read with pd.read_xxx () one by one. If you use glob, you can get a list of files with names that meet the specified conditions, so how to use it to concat? I will introduce you.

[Conversion example with pandas]

`python3`


In [38]: import glob
#List of filenames that match the criteria
In [39]: concat_files = glob.glob('data/concat*')

In [40]: print(concat_files)
['data\\concat_1.csv', 'data\\concat_2.csv', 'data\\concat_3.csv']

In [41]: list_concat_df =[]
#Read the csv file for each file and append to the list of df
In [42]: for file in concat_files:
    ...:     df = pd.read_csv(file)
    ...:     list_concat_df.append(df)
    ...:
#I have a list with 3 DataFrames
In [43]: print(list_concat_df)
[    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3,     A   B   C   D
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7,      A    B    C    D
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11]
#Append 3 DataFrames into one and reindex
In [44]: concat_df = pd.concat(list_concat_df, ignore_index=True)

In [45]: print(concat_df)
      A    B    C    D
0    a0   b0   c0   d0
1    a1   b1   c1   d1
2    a2   b2   c2   d2
3    a3   b3   c3   d3
4    a4   b4   c4   d4
5    a5   b5   c5   d5
6    a6   b6   c6   d6
7    a7   b7   c7   d7
8    a8   b8   c8   d8
9    a9   b9   c9   d9
10  a10  b10  c10  d10
11  a11  b11  c11  d11

You can also write it like a 1-Liner using list comprehension as follows.

`python3`


In [46]: concat_files = glob.glob('data/concat*')
#here
In [47]: list_concat_df = [pd.read_csv(csv_file) for csv_file in concat_files]
#After that, the same as the previous example
In [48]: print(list_concat_df)
[    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3,     A   B   C   D
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7,      A    B    C    D
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11]

In [49]: concat_df = pd.concat(list_concat_df, ignore_index=True)

In [50]: print(concat_df)
      A    B    C    D
0    a0   b0   c0   d0
1    a1   b1   c1   d1
2    a2   b2   c2   d2
3    a3   b3   c3   d3
4    a4   b4   c4   d4
5    a5   b5   c5   d5
6    a6   b6   c6   d6
7    a7   b7   c7   d7
8    a8   b8   c8   d8
9    a9   b9   c9   d9
10  a10  b10  c10  d10
11  a11  b11  c11  d11

that's all. There is a lot of tips-like information (How) of melt () and pivot_table () of pandas on the net, and it was a learning experience, but concepts / methodologies (Why and What) like Tidy Data's papers. I realized that it is powerful when combined with the background of.

For your information

We have uploaded the contents of this article as a Jupyter Notebook on github so that you can move your hands and feel it. Since the data is also attached, it works as it is. If you are interested, please download and try it. ishida330/tidy-data-in-pandas

[PYTHON] Try converting to tidy data with pandas

Introduction

First of all, I recommend articles in Japanese

What is the point of orderly data (Tidy Data)?

Tidying messy datasets

1. The column heading is a value, not a variable name

[Example]

[Explanation of method]

[Conversion example with pandas]

python3

[Example 2]

[Explanation of method]

[Conversion example with pandas]

python3

2. Multiple variables are stored in one column

[Example]

[Explanation of method]

[Conversion example with pandas]

python3

3. Variables are stored in both rows and columns

[Example]

[Explanation of method]

[Conversion example with pandas]

python3

4. Multiple types of observation constituent units are stored in the same table.

[Example]

[Explanation of method]

[Conversion example with pandas]

python3

5. The building blocks of one observation are stored in multiple tables

[Example]

[Explanation of method]

[Conversion example with pandas]

python3

python3

For your information

`python3`

`python3`

`python3`

`python3`

`python3`

`python3`

`python3`