[Python] Notes on data analysis

Purpose of this article

A complete personal note for data analysis. I haven't put short things like `df.head ()` because I made it so that I can copy it.

Library load

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

import seaborn as sns
import matplotlib.pyplot as plt
import japanize_matplotlib

import os
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

Data read

pattern 1

filename = "hoge.csv"
df = pd.read_csv(filename, encoding='utf-8')

Pattern 2

dirname = "/foo/bar/.../"
filename = "hoge.csv"
filepath = os.path.join(dirname, filename)
df = pd.read_csv(filepath, encoding='utf-8')

Data export

filename = "huga.csv"
df.to_csv(filename, header=True, index=False)

Data frame manipulation

Rename column

df = df.rename(columns={"before01":"after01", "before02":"after02"})

Column data type change

df = df.astype({"col": "category"})

Simple join of data frames

df = pd.concat([upper,lower])

df = pd.concat([left,right], axis=1)

LEFT JOIN

df = pd.merge(left, right, on="key", how='left')

df = pd.merge(left, right, left_on="lkey", right_on="rkey", how='left')

df = pd.merge(left, right, left_on=["lkey01", "lkey02"], right_on=["rkey01", "rkey02"], how='left')

GROUP BY

df = df.groupby(by="col01", as_index=False).sum()

df = df.groupby(by=["col01", "col02"], as_index=False).agg({"col01": ['mean', 'count'], "col02":['std', 'var']})

#Reassignment of index(Used as a set with roughly)
df.reset_index(drop=True, inplace=True)

Export to csv

filename = "hoge.csv"
df.to_csv(filename, header=True, index=False)

Other

Summary such as basic statistics

!pip install pandas-profiling
import pandas_profiling as pdp

profile = pdp.ProfileReport(df)
profile.to_file(outputfile="myoutputfile.html")

After reading the data, do this first.

Quantity count

import collections

lis = ["Alice", "Alice", "Bob", "Bob", "Bob", "Carol"]
c = collections.Counter(lis)
c.most_common(3)

progress bar

for i tqdm(range(n)):
    foo bar

#Intensional expression
[foo for i in tqdm(range(n))]

Measure calculation time

%%timeit

foo bar

Garbage collection

import gc

gc.collect()

Frequently used templates

list01 = []
list02 = []

for i tqdm(range(n)):
  v01 = ???
  list01.append(v01)
  v02 = ???
  list02.append(v02)

df = pd.DataFrame({"col01":list01, "col02":list02})

Recommended Posts

[Python] Notes on data analysis
Python data analysis learning notes
Data analysis python
Data analysis with python 2
Data analysis overview python
Python data analysis template
Data analysis with Python
My python data analysis container
Python for Data Analysis Chapter 4
Notes on installing Python on Mac
Python for Data Analysis Chapter 2
Data analysis using python pandas
Notes on installing Python on CentOS
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Notes on Python and dictionary types
Python: Time Series Analysis: Preprocessing Time Series Data
Notes on using MeCab from Python
Python Pandas Data Preprocessing Personal Notes
Notes on installing Python using PyEnv
Preprocessing template for data analysis (Python)
Notes on using rstrip with python.
Notes on accessing dashDB from python
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Build a python data analysis environment on Mac (El Capitan)
Notes for using OpenCV on Windows10 Python 3.8.3.
Python scraping notes
Python study notes _000
Python learning notes
Notes on PyQ machine learning python grammar
Python visualization tool for data analysis work
Data analysis Titanic 2
Python on Windows
twitter on python3
Notes on nfc.ContactlessFrontend () for nfcpy in python
Python study notes_006
Notes on Flask
Notes on doing Japanese OCR with Python
Notes on building Python and pyenv on Mac
Data analysis Titanic 1
python C ++ notes
[Python] First data analysis / machine learning (Kaggle)
python on mac
Python study notes _005
Python grammar notes
Python Library notes
Data analysis starting with python (data preprocessing-machine learning)
Data analysis Titanic 3
Python on Windbg
python personal notes
I did Python data analysis training remotely
Data analysis environment centered on Datalab (+ GCP)
Python 3 Engineer Certified Data Analysis Exam Preparation
Notes on using code formatter in Python
python pandas notes
Python study notes_001
python learning notes
Python3.4 installation notes
[python] Read data
[Python] 100 knocks on data science (structured data processing) 018 Explanation