[PYTHON] Most wanted pandas functions (Part 02)

Here, I'm going to extend the last article I have written some time ago. You guys can read it in here Most wanted pandas functions (Part 01)

  1. Applying a function. When dealing with the data, we often have to use functions. There are few ways to use a function over a DataFrame. We should avoid iterating approaches since it is slow going through the panda's rows. We can use .apply() in this case.

pandas_funcs.ipynb


# Add $ to values of the Fare column.
def add_dolor_mark(money_str):
    money_str = '$ '+str(money_str)
    return money_str

df['Fare'] = df['Fare'].apply(add_dolor_mark)
df.head() 

pandas apply.png

  1. Renaming columns. Sometimes it is necessary to rename columns of our DataFrame. We can pass dictionary {‘column_1’ : ’new_column_1’, ‘column_2’ : ’new_column_2’} to .rename() Let's try to rename Embarked into Port and Fare into Price.

pandas_funcs.ipynb


df.rename(columns = {'Embarked':'Port', 'Fare':'Price'}, inplace=True)
df.head()

Screen Shot 2021-01-15 at 15.57.56.png

  1. Adding prefixes or suffixes to column names. In this case we can use add_prefix() and add_suffix()

pandas_funcs.ipynb


df = df.add_prefix('X_')
df.head()

Screen Shot 2021-01-15 at 16.01.50.png

  1. Conditional replacing. “Where” can be used to replace the values in rows or columns based on a condition. The way where is works a bit different. It is crucial to identify the way “where” works. "Where" select values based on the condition, then the remaining values are replaced with the specified value. So "where" is behaving like a masking operation.

pandas_funcs.ipynb


# we need to replace survied people with True
df['X_Survived'] = df['X_Survived'].where(df['X_Survived'] == 0 , True)
df.head()

Capture.PNG

  1. Melting This is a bit infamous function. But very useful. We can use melt to convert wide DataFrame to narrow ones. Let's assume we have measured the temperature of 5 cities and for a week. And the data is looking like this. Screen Shot 2021-01-15 at 16.46.24.png

(we can create a sample DataFrame as follows)

pandas_funcs.ipynb


import pandas as pd

temp = {'City': ['Ofuna','Tokyo','Yokohama','Okayama', 'Tsukuba'],
        'Day_1': [22,23,21,20,19],
        'Day_2': [23,22,24,29,18],
        'Day_3': [25,24,23,22,23],
        'Day_4': [21,20,25,21,20],
        'Day_5': [20,19,25,22,21],
        }

df = pd.DataFrame(temp, columns = ['City', 'Day_1', 'Day_2', 'Day_3', 'Day_4', 'Day_5’])

pandas_funcs.ipynb


df = df.melt(id_vars=['City'])
df

Screen Shot 2021-01-15 at 16.55.44.png

  1. Save as csv In some situations, we have to save processed DataFrame. In that case, we can use the to_csv function.

pandas_funcs.ipynb


df.to_csv('myfile.csv')
  1. Query Among the several filtering methods query is another filtering method that we can quickly and easily access to the subsections of the DataFrame.

pandas_funcs.ipynb


#letes assume that we want to get list of age of 10 and 15
seek_age = [10,15]
df.query("X_Age in @seek_age")

Capture2.PNG

  1. Memory usage

How much memory is used by a pandas DataFrame is really beneficial when dealing with large DataFrames. So we can avoid trouble with dead kernel due to out of memory

pandas_funcs.ipynb


df.memory_usage()

mem.PNG

For total Memory usage we can use,

pandas_funcs.ipynb


df.memory_usage().sum()
  1. Profile report It generates reports from a pandas DataFrame. The pandas df.describe() function is handy but a little basic for good exploratory data analysis. pandas_profiling extends the capability for quick data analysis.

pandas_funcs.ipynb


from pandas_profiling import ProfileReport
profile = ProfileReport(df, title='Pandas Profiling Report', style={'full_width':True}, correlations={'kendall': False})
profile

prof.gif

  1. Explode Let's think that our data set holds multiple entries in a single row. But for our analysis, we need to analyze them on separate rows. In that case, we can use the explode function. (First lets create example DataFrame)

pandas_funcs.ipynb


df_ = pd.DataFrame({'id':['a','b','c','d'],
                  'humidity %':[62,55,[29,39,81],77],
                  'day':[1,2,3,4]})
df_

ex1.PNG

pandas_funcs.ipynb


df_.explode('humidity %')

ex2.PNG

  1. Crosstab It is used to group 2 or more variables and make a summary table. It is used to get insight of columns. Please note that crosstab() works with categorical data.

For example, let's think that we need to get how many people survived/not-survived according to their ports.

pandas_funcs.ipynb


#First lets replace True with 1 in X_Survived column.
df['X_Survived'] = df['X_Survived'].where(df['X_Survived'] == 0 , 1)

port_survived = pd.crosstab(df["X_Port"],df["X_Survived"])
port_survived

por.PNG

There are more many useful functions for analyzing data. I will introduce few more functions in the next article.

Hope this will help you....

Recommended Posts

Most wanted pandas functions (Part 02)
pandas series part 1
Python application: Pandas Part 1: Basic
Excel aggregation with Python pandas Part 1