[PYTHON] Correspondence between pandas and SQL

With pandas, you can perform data operations similar to SQL. Make a note of the correspondence between SQL data manipulation and pandas data manipulation. I personally write them in descending order of frequency of use.

limit -> head It is used to extract an arbitrary number of lines.

select *
from table
limit 10

↓pandas

table.head(10)

order by -> sort_values Sorting

select *
from table
order by col1

↓pandas

table.sort_values(by='col1')

Sorting by order by and sort_values ​​is in ascending order by default. If order by, add desc at the end, but if sort_values, set ascending = False (if True, ascending order).

table.sort_values(by='col1', ascending=False)

You can also sort by multiple conditions

select *
from table
order by col1, col2 desc

↓pandas

table.sort_values(by=['col1', 'col2'], ascending=[True, False])

where -> []

select *
from table
where col1 = 'a'

↓pandas

table[table.col1 == 'a']

in clause

select *
from table
where col1 in ('a', 'b')

↓pandas

table[table.col1.isin(['a', 'b'])]

Multiple conditions

select *
from table
where col1 = 'a' and col2 = 1

↓pandas

table[(table.col1 == 'a')&(table.col2 == 1)]

Enclose each condition in () and connect with &. Use | for or instead of and

table[(table.col1 == 'a')|(table.col2 == 1)]

These condition specifications can also be used at the time of delete and update described below.

Function-> agg

select
    sum(col1),
    avg(col1),
    max(col1),
    min(col2),
    count(col2),
    count(distinct(col2)),
from table

↓pandas

table.agg({'col1': ['sum', 'mean', 'max'], 'col2': ['min', 'count', 'nunique']})

nunique supports count distinct.

The distinct function looks like this:

select
    distinct(col1)
from talbe

↓pandas

table.agg({'col1': 'unique'})

Aggregate functions and distinct can't be used together in sql, but can be used in pandas

table.agg({'col1': ['sum', 'mean', 'max', 'min', 'count', 'nunique', 'unique']})

Can be done like this.

sql and pandas function support

group by -> groupby

select
    col1,
    sum(col2)
from table
group by col1

↓pandas

table.groupby('col1').agg({'col2': 'sum'})

The column specified in groupby will be an index in the resulting data frame, so

table.groupby('col1', as_index=False).agg({'col2': 'sum'})

It is easy to handle if you put as_index = False like.

When group by multiple columns

select
    col1,
    col2,
    sum(col3)
from table
group by col1, col2

↓pandas

table.groupby(['col1', 'col2'], as_index=False).agg({'col3': 'sum'})

join -> merge

select
    *
from table t
left join table2 t2
on table.col1 = table2.col1

↓pandas

pd.merge(table, table2, on='col1', how='left')

how also has inner, outer, and right.

When using multiple columns with on

select
    *
from table t
left join table2 t2
on table.col1 = table2.col1
    and table.col2 = table2.col2

↓pandas

pd.merge(table, table2, on=['col1', 'col2'], how='left')

Specify a list for on.

update -> loc[]

update table
set col1 = 'b'
where col2 = 1

↓pandas

table.loc[table.col2 == 1, 'col1'] = 'b'

Specify the where condition in the first argument of .loc and the column to be updated in the second argument.

delete -> []

delete
from table
where col1 = 'a'

↓pandas

table = table[table.col1 != 'a']

Only those that do not match the conditions are extracted and deleted.

column alias-> rename

select
    col1 as new_col1
from table

↓pandas

table.rename(columns={'col1': 'new_col1'})

union -> concat

select
    *
from table
union
select
    *
from table2

↓pandas

pd.concat([table, table2], ignore_index=True)

Since the first argument of the concat function is a list, it is possible to union three or more data frames. If ignore_index is not specified, the index value will be inherited from the data frame before UNION. In other words, the index value may be covered, so it may be better to specify it.

insert -> append

insert into
table(
    col1,
    col2
)
values(
    'a',
    '2'
)

↓pandas

table = table.append({'col': 'a', 'b': 2}, ignore_index=True)

ignore_index is the same as union.

case when -> apply

select
    case when col1 = 'a' then 1
            when col1 = 'b' then 2
            else 3 end as case_col1
from table

↓pandas

def case_when(x):
    if x.col1 == 'a':
        return 1
    elif x.col1 == 'b':
        return 2
    else:
        return 3

table.apply(case_when, axis=1)

The reason why axis = 1 is to fetch data line by line. If axis = 0, column data will be entered in x of case_when. It is easy to understand if you display x with print. The type of x will be the Series type of pandas.

It is also possible to add a column and return it. The lower method is the upper method table['case_col1'] = table.apply(case_when, axis=1) The result is the same as the case of.

select
    *,
    case when col1 = 'a' then 1
            when col1 = 'b' then 2
            else 3 end as case_col1
from table

↓pandas

def case_when(x):
    if x.col1 == 'a':
        ret = 1
    elif x.col1 == 'b':
        ret = 2
    else:
        ret = 3
    
    x['case_col1'] = ret

    return x

table.apply(case_when, axis=1)

group by case when -> groupby.apply

select
    col1,
    sum(case when col2 < 10 then col3 else 0 end) as sum_under10_col3,
    sum(case when col2 >= 10 then col3 else 0 end) as sum_over10_col3
from table
group by col1

↓pandas

def case_when(df):
    sum_under10_col3 = df[df.col2 < 10].col3.sum()
    sum_over10_col3 = df[df.col2 >= 10].col3.sum()
    return pd.Series({'sum_under10_col3': sum_under10_col3, 'sum_over10_col3': sum_over10_col3})

table.groupby('col1').apply(case_when).reset_index()

It's quite complicated, but you can do the same thing. Since data frames are sent to the case_when function, they are aggregated and the result is returned in Series. reset_index () removes col1 from the index. When using apply, if as_index = False, col1 will not be added to the column automatically. You probably have to include col1 in the Series when returning in the case_when function. It's a hassle, so reset_index supports it.

Recommended Posts

Correspondence between pandas and SQL
Correspondence between RecyclerView and Marker (Kotlin)
Correspondence between Python built-in functions and Rust
jupyter and pandas installation
Between parametric and nonparametric
pandas index and reindex
pandas resample and rolling
Pandas averaging and listing
Correspondence between Unix system call ʻopen` and libc` fopen () `
Differences between numpy and pandas methods for finding variance
Difference between process and job
Conversion between unixtime and datetime
Difference between "categorical_crossentropy" and "sparse_categorical_crossentropy"
Difference between regression and classification
Collaboration between PTVS and Anaconda
Key additions to pandas 1.1.0 and 1.0.0
To go back and forth between standard python, numpy, pandas ①
Connection between flask and sqlite3
Read / write between BigQuery ↔ Pandas
I tried Pandas' Sql Upsert
Difference between np.array and np.arange
Difference between MicroPython and CPython
Cooperation between py2exe and setuptools
Boundary between C and Golang
Difference between ps a and ps -a
Difference between return and print-Python
Difference between Ruby and Python split
Django's MVT-Relationship between Models and Modules-
Difference between java and python (memo)
Difference between list () and [] in Python
Differences between yum commands and APT commands
Difference between SQLAlchemy filter () and filter_by ()
Calculate the correspondence between two word-separators
Difference between == and is in python
Anaconda and Python version correspondence table
[Note] Conflict between matplotlib and zstd
Differences between symbolic links and hard links
Memorandum (difference between csv.reader and csv.dictreader)
Precautions when using codecs and pandas
(Note) Difference between gateway and default gateway
Cooperation between python module and API
Difference between Numpy randint and Random randint
Differences between Python, stftime and strptime
Difference between sort and sorted (memorial)
Difference between python2 series and python3 series dict.keys ()
[Pandas] Find quartiles and detect outliers
Speed comparison between CPython and PyPy
IPython and pry command correspondence table
[Python] Difference between function and method
Difference between SQLAlchemy flush () and commit ()
Ignore # line and read in pandas
Python --Difference between exec and eval
[Python] Difference between randrange () and randint ()
[Python] Difference between sorted and sorted (Colaboratory)
Create a correspondence table between EC number and Uniprot entry from enzyme.dat