[Linux command] Petit data analysis with grep / awk / sort command

Introduction

This is, KPI Item A:" Get used to terminal commands "Item A:" Get used to terminal commands " It is an output article of.

Main story

This time, I used grep awk sort to process the data on the Linux server.

Preparation

This time, in order to perform search and text processing, we obtained a sample CSV from the following. It is a data set that summarizes official statistics by region. This time, I deleted the header part and extracted only the data part and placed it on the Linux server.

The data format is as follows.

Region code Prefectures Municipality Total population 123 pieces(Omission) Number of children in nursery schools, etc.
R01100 Hokkaido Sapporo 1952356 ・ ・ ・ 25742

For the omitted part, refer to the explanation below as appropriate.

Extract only the specified prefecture data

If you want to extract only the data (row) of a certain prefecture, use the grep command. The Grep command is a command that displays a ** line ** that contains the specified string or pattern.

Let's extract only the data of "Wakayama prefecture".

# grep "Character string to search"File
grep "Wakayama Prefecture" data/ssdse-2019a.csv

[test@srv1 testAwk]$ grep "Wakayama Prefecture" data/ssdse-2019a.csv
R30201 Wakayama City, Wakayama Prefecture 364154 171215 192939 360592 169798 190794 44519 22757 21762 211753 102859 108894 105954 44446 61508 50822 19092 31730 2466 2727 4449 7627 7938 153089 152798 355502 91995 48369 37387 20594 22322 1819 727 20884 14 658 1751 1435 618 1554 108 1378 159 21 2628 14298 177281 326 10259 25874 1139 2240 11469 34092 5855 4467 4079 15176 6481 10411 25096 882 11457 7907 392 36138 140751 2173 1478 95.6 11.6 152643733 63453773 151658392 65642801 20551722 13750658 0 28 3665 54 1131 17614 25 688 10313 14 12329 162655 90595 72060 8131 5266 2865 132002 43765 88237 3023 37094 114600 1 2 373593 43481 3094 1519 55 34 443 235 1616 335 1172 58 6830
(Omitted below)

Extract only some data

Use the previously executed command in the redirect. This time, I will try to find the move-in ratio from the information of the move-out person and the move-in person.

I want to extract only the information of the move-out person and the move-in person (FY2017), so I use the awk command. The awk command is a command that formats the data divided by the specified delimiter.

The default is to split spaces ("") as delimiters. At the time of output, it is possible to specify a specific item such as "{print $ 1}" and output.

This time, the information to be output is the number of in-migrants, the number of out-migrants, and the number of in-migrants-the number of out-migrants. The third output outputs the difference calculation result.

Output number Data column number Item name or formula Remarks
1 3 Municipal name
2 25 Number of transferees
3 26 Number of out-migrants
4 - Number of transferees-Number of out-migrants 正数:Number of out-migrants以上のNumber of transferees
負数:Number of transferees以上のNumber of out-migrants
# awk -F'[Delimiter(Multiple)]' '{awk command}' file
[test@srv1 testAwk]$ grep "Wakayama Prefecture" data/ssdse-2019a.csv | awk '{print $3,$25,$26,$25-$26}'
Wakayama City 7627 7938-311
Hainan City 1042 1324-282
Hashimoto City 1260 1560-300
Arida City 469 713-244
Gobo City 632 787-155
Tanabe City 1691 2216-525
Shingu City 815 1068-253
Kinokawa City 1278 1689-411
Iwade City 1720 1753-33
Kimino Town 154 225-71
Katsuragi Town 360 430-70
Kudoyama Town 91 112-21
Koya Town 88 168-80
Yuasa Town 223 313-90
Hirokawa Town 142 202-60
Aridagawa Town 611 593 18
Mihama Town 275 289-14
Hidaka Town 260 239 21
Yura Town 99 169-70
Inami Town 157 204-47
Minabe Town 196 345-149
Hidakagawa Town 270 289-19
Shirahama Town 691 749-58
Kamitonda Town 636 510 126
Susami Town 101 123-22
Nachikatsuura Town 319 527-208
Taiji Town 107 127-20
Kozagawa Town 74 66 8
Kitayama Village 24 18 6
Kushimoto Town 430 584-154

Sort the output

It also uses the previously executed command for redirects.

To sort, use the sort command. The sort command is a command that sorts the data divided by the specified delimiter. This time, set the following options.

option argument Description
-k 4 Item number that serves as a sort standard
-n Compare sort criteria as numbers
-r Display in descending order
[test@srv1 testAwk]$ grep "Wakayama Prefecture" data/ssdse-2019a.csv | awk '{print $3,$25,$26,$25-$26}' | sort  -k 4 -n -r
Kamitonda Town 636 510 126
Hidaka Town 260 239 21
Aridagawa Town 611 593 18
Kozagawa Town 74 66 8
Kitayama Village 24 18 6
Mihama Town 275 289-14
Hidakagawa Town 270 289-19
Taiji Town 107 127-20
Kudoyama Town 91 112-21
Susami Town 101 123-22
Iwade City 1720 1753-33
Inami Town 157 204-47
Shirahama Town 691 749-58
Hirokawa Town 142 202-60
Yura Town 99 169-70
Katsuragi Town 360 430-70
Kimino Town 154 225-71
Koya Town 88 168-80
Yuasa Town 223 313-90
Minabe Town 196 345-149
Kushimoto Town 430 584-154
Gobo City 632 787-155
Nachikatsuura Town 319 527-208
Arida City 469 713-244
Shingu City 815 1068-253
Hainan City 1042 1324-282
Hashimoto City 1260 1560-300
Wakayama City 7627 7938-311
Kinokawa City 1278 1689-411
Tanabe City 1691 2216-525

result

I was able to process simple data on a Linux server. There are more out-migrants than I expected. I would like to know the reason why Kamitonda Town is increasing by far.

Recommended Posts

[Linux command] Petit data analysis with grep / awk / sort command
Data analysis with python 2
Data analysis with Python
[Linux] OS recovery with restore command
Parse pcap data with tshark command
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
I tried factor analysis with Titanic data!
How to use the Linux grep command
Data analysis starting with python (data preprocessing-machine learning)
Linux command # 4
Linux command # 3
Linux command # 5
grep command
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Adjust file permissions with the Linux command chmod
I tried principal component analysis with Titanic data!
[Linux] eQTL analysis with genetic statistics software PLINK
[Introduction to minimize] Data analysis with SEIR model ♬
Compress with zip command of java side linux
[Voice analysis] Collect bird song data with xeno-canto
[Linux] How to display CPU usage, display header, and not display grep process with ps command