[PYTHON] use pandas-ply

use pandas-ply

Contents

Use pandas_ply.

Do something like R's dplyr with pandas.

With method chains, it's easy and fun to write code.

This time, let's play with the election data.

Using House of Representatives single-seat constituency candidate name and data on the number of hits when searching by name, pandas Play with -ply.

code

Preparations for using pandas_ply.

import pandas as pd
from ply import install_ply, X, sym_call
install_ply(pd)

Read the data.

It is the data of the candidate name + the number of hits when googled.

data = pd.read_csv("../kouho.hit.list", encoding="utf-8", header=0)

print data.head(2)
   BLOCK    NAME       AGE    PARTY STATUS   HIT
0 Hokkaido 1st District Takahiro Yokomichi 73 Before Democracy 153000
1 Hokkaido 1st District Hiroyuki Noroda 56 Communist New 346000

Find out the number of candidates and average age by party

You can group them and aggregate them.

partySummarize = (
    data
    .groupby('PARTY')
    .ply_select(
      meanAge=X.AGE.mean(),
      candidateNum=X.NAME.size(),
      )
    )

print partySummarize
       candidateNum    meanAge
PARTY                         
Komei 9 52.111111
Communism 292 53.188356
Next generation 39 50.461538
Democracy 178 50.595506
Nowhere 45 53.177778
Life 13 54.230769
Social Democratic Party 18 56.833333
Restoration 77 45.311688
Liberal Democratic Party 283 53.346290
Various factions 5 52.400000

Candidates under 30

The dplyr :: filter corresponds to ply_where.

## under 30 
print (data
    .ply_where(X.AGE < 30)
    .head(10)
    )
21 Hokkaido 7th District Takako Suzuki 28 Before Democracy 1670000
88 Akita 2nd District Takashi Midorikawa 29 Democratic New 170000
174 Saitama 1st District Sho Matsumoto 29 Social Democratic Party New 3070000
221 Chiba 1st District Naoyoshi Yoshida 27 Communist New 1690000
269 Tokyo 1st District Takanobu Nozaki 27 Nowhere New 530000
271 Tokyo 2nd District Noriyuki Ishizawa 27 Communist New 156000
297 Tokyo 8th District Shingo Sawada 29 Communist New 400000
306 Tokyo 11th District Shimomura Mei 27 Nosho New 380000
390 Kanagawa 8th District Yasuhisa Wakabayashi 29 Communist New 525000
403 Kanagawa 12th District Kotaro Amimura 25 Communist New 106000

Number of hits when searching (10,000)

The operation corresponding to dplyr :: mutant is also possible in ply_select.

print (data
    .ply_select(
      NAME=X.NAME,
      HIT_x10000 = X.HIT / 1000
      )
    .head(10)
    )
  HIT_x10000    NAME
0       15.30 Takahiro Yokomichi
1       34.60 Hiroyuki Noroda
2      268.00 Toshimitsu Funahashi
3       54.30 Yoshihiro Iida
4       54.10 Takamori Yoshikawa
5      152.00 Maki Ikeda
6        7.42 Kenko Matsuki
7        5.92 Masatoshi Kanakura
8       33.50 Satoshi Arai
9       30.30 Hiroko Yoshioka

I wonder why the order of the columns is changed.

Impressions

More fun than using raw pandas.

I don't know how to sort. What corresponds to dplyr :: arrange?

that's all.


This post was posted from Github: point_right: Qiita.

Recommended Posts

use pandas-ply
Use DeepLabCut
Use pycscope
Use collections.Counter
Use: Django-MySQL
Use Pygments.rb
Use Numpy
Use GitPython
Use Miniconda
Use Invariant TSC
Why use linux
[C] Use qsort ()
Let's use pytube
Use weak references
Use django-debug-toolbar non-locally
Use combinatorial optimization
use go module