[Pandas] I tried to analyze sales data with Python [For beginners]

I watched a Python lecture video by Keith Galli, an overseas youtuber, and tried the work.

Let's solve real-life data science problems using Python's Pandas library! ] Solving real world data science tasks with Python Pandas! The difficulty level is not high and it is plain English so please have a look.

It's about an hour and a half video. It's a very polite explanation. At stackoverflow (overseas version programming Q & A site) on Google, It also shows the process of finding a solution.

It is a task to analyze using CSV Excel data. The data can be downloaded from his Github.

Click here for Github

I'm running on a jupyter notebook.

I am doing it while adding supplements to the code in Japanese.

If you watch the video and actually touch it, I think it's the best way to practice Pandas.

You can get a feel for the atmosphere just by reading this article.

Create a data frame from reading data

Load pandas, os library

import pandas as pd
import os

First, check what kind of data is included (get file name list)

files = [file for file in os.listdir("Sales_Data")]
for file in files:
    print(file)
Sales_April_2019.csv
Sales_August_2019.csv
Sales_December_2019.csv
Sales_February_2019.csv
Sales_January_2019.csv
Sales_July_2019.csv
Sales_June_2019.csv
Sales_March_2019.csv
Sales_May_2019.csv
Sales_November_2019.csv
Sales_October_2019.csv
Sales_September_2019.csv

It seems that the data from the file name to January-December 2019 is included, I want to combine the data for analysis

First, create an empty data frame to store all the data

all_months_data = pd.DataFrame()

In order to combine all the month data, we will put the csv file of each month into the empty data frame one after another.

for file in files:
    df = pd.read_csv("Sales_Data/" + file)
    all_months_data = pd.concat([all_months_data,df])

Well, the first 5 lines of data look like this

all_months_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001
1 NaN NaN NaN NaN NaN NaN
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

Output the combined data (At this time, since the serial number is not required, add index = False to the argument)

all_months_data.to_csv("all_data.csv",index=False)

After confirming the output data, try reading the combined data again.

all_data = pd.read_csv("all_data.csv")
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001
1 NaN NaN NaN NaN NaN NaN
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

Work 1: When was the month with the highest sales and how much did you earn?

Looking at the column, there is a column of Order Date = order date, but there is no data only for the month, so let's create a new one

The first two letters in the order date column are likely to be the month data

all_data['Month'] = all_data['Order Date'].str[0:2]

Since it was converted to a character string, let's return it to the numerical value again

all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-77-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
      2 all_data.head()


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: cannot convert float NaN to integer

I got an error

When I read the error statement, it says that NaN cannot be quantified.

Combining the is_na and any functions will return True if there is at least one NaN.

axis = 1 is a function that returns processing line by line

Have all rows containing NaN returned

nan_df = all_data[all_data.isna().any(axis=1)]
nan_df
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
1 NaN NaN NaN NaN NaN NaN NaN
356 NaN NaN NaN NaN NaN NaN NaN
735 NaN NaN NaN NaN NaN NaN NaN
1433 NaN NaN NaN NaN NaN NaN NaN
1553 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
185176 NaN NaN NaN NaN NaN NaN NaN
185438 NaN NaN NaN NaN NaN NaN NaN
186042 NaN NaN NaN NaN NaN NaN NaN
186548 NaN NaN NaN NaN NaN NaN NaN
186826 NaN NaN NaN NaN NaN NaN NaN

545 rows × 7 columns

There were also 545 lines

Erase data containing NaN

all_data = all_data.dropna(how='all')
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 04
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 04
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 04

Alright, let's go back to the previous month and try again

all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-80-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
      2 all_data.head()


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: invalid literal for int() with base 10: 'Or'

Next, I'm getting a new error again

It seems that the word'Or'was included

Let's look at the line that contains'Or'

temp_df = all_data[all_data['Order Date'].str[0:2] == "Or"]
temp_df.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
519 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
1149 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
1155 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
2878 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or
2893 Order ID Product Quantity Ordered Price Each Order Date Purchase Address Or

It seems that the identity of "Or" was "Order Date".

Let's refresh the data by extracting the data other than the data containing the word "Order Date"

all_data = all_data[all_data['Order Date'].str[0:2] != "Or"]
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 04
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 04
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 04
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 04

Let's be honest for the third time and perform quantification

all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4

I got it

Next, although there are Quantity Ordered and Price Each, there is no essential sales.

Since PQ (sales) = P (price) x Q (quantity), add a column that corresponds to PQ.

all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each'] 
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
    967         try:
--> 968             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    969         except TypeError:


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    220     if use_numexpr:
--> 221         return _evaluate(op, op_str, a, b, **eval_kwargs)
    222     return _evaluate_standard(op, op_str, a, b)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    126     if result is None:
--> 127         result = _evaluate_standard(op, op_str, a, b)
    128 


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     69     with np.errstate(all="ignore"):
---> 70         return op(a, b)
     71 


TypeError: can't multiply sequence by non-int of type 'str'


During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

<ipython-input-84-7c1e2b69cbe2> in <module>
----> 1 all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
   1046 
   1047         with np.errstate(all="ignore"):
-> 1048             result = na_op(lvalues, rvalues)
   1049         return construct_result(
   1050             left, result, index=left.index, name=res_name, dtype=None


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
    968             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
    969         except TypeError:
--> 970             result = masked_arith_op(x, y, op)
    971 
    972         return missing.dispatch_fill_zeros(op, x, y, result)


/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
    445         if mask.any():
    446             with np.errstate(all="ignore"):
--> 447                 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
    448 
    449     else:


TypeError: can't multiply sequence by non-int of type 'str'

Once again, it contained a string

I want to rewrite it into a form that can be calculated for the time being, so I will use the "to_numeric" function

all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

I will try it again

all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each'] 
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99

It came out. It seems that the answer to the question will finally come out

For the time being, I will repost the question.

Work 1: When was the month with the highest sales and how much did you earn?

all_data.groupby('Month').sum()
Quantity Ordered Price Each Sales
Month
1 10903 1.811768e+06 1.822257e+06
2 13449 2.188885e+06 2.202022e+06
3 17005 2.791208e+06 2.807100e+06
4 20558 3.367671e+06 3.390670e+06
5 18667 3.135125e+06 3.152607e+06
6 15253 2.562026e+06 2.577802e+06
7 16072 2.632540e+06 2.647776e+06
8 13448 2.230345e+06 2.244468e+06
9 13109 2.084992e+06 2.097560e+06
10 22703 3.715555e+06 3.736727e+06
11 19798 3.180601e+06 3.199603e+06
12 28114 4.588415e+06 4.613443e+06

I don't know Pat, so let's show it

Install the library for illustration

import matplotlib.pyplot as plt

Graph while deciding the range for the X axis

months = range(1,13)
results = all_data.groupby('Month').sum()
plt.bar(months,results['Sales'])
plt.xticks(months)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

output_60_0.png

That was work 1.

Work 2: Which city had the highest sales

Now that we have a line of sales, it seems easy work, but let's do it

Before that, let's check again what the data was like.

all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99

It seems like extracting a part of the data of the Purchase Address part and totaling it.

Let's create a City column as we did in Sales.

I have to extract a part of Address, but when I look at the data of Adress, I want to extract only the second city name like "address, city name, zip code".

Let's split by "," in the split function, and use the apply function to apply it to all columns in the same column

all_data['City'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles

Only City was successfully divided!

By the way, there is one problem that must be considered here when totaling in City.

That is the case when a city has the same name but other countries have the same name. If this is not taken into consideration, they will be added up and unintended aggregation results will be obtained. (Example: Portland in both Oregon and Maine)

However, there is no need to create a new column for the country name. Since it's just aggregating, all you have to do is put the State (2 characters) in the cell of the City column. This time, we will use the form City (State). ex: Dallas (TX)

Earlier, I wrote it in one line with the lambda function, but since it's a big deal, let's create a function to get the city name and country name respectively. It is also kind to the reader.

def get_city(address):
    return address.split(',')[1]

def get_state(address):
    return address.split(',')[2].split(' ')[1]
#Purchase address data is a comma(,)Blank after(" ")Because there is, cut with split so that it does not become unnatural when united

all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)} ({get_state(x)})")
all_data.drop("State",axis=1)
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX)
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA)
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA)
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)
... ... ... ... ... ... ... ... ... ...
186845 259353 AAA Batteries (4-pack) 3 2.99 09/17/19 20:56 840 Highland St, Los Angeles, CA 90001 9 8.97 Los Angeles (CA)
186846 259354 iPhone 1 700.00 09/01/19 16:00 216 Dogwood St, San Francisco, CA 94016 9 700.00 San Francisco (CA)
186847 259355 iPhone 1 700.00 09/23/19 07:39 220 12th St, San Francisco, CA 94016 9 700.00 San Francisco (CA)
186848 259356 34in Ultrawide Monitor 1 379.99 09/19/19 17:30 511 Forest St, San Francisco, CA 94016 9 379.99 San Francisco (CA)
186849 259357 USB-C Charging Cable 1 11.95 09/30/19 00:18 250 Meadow St, San Francisco, CA 94016 9 11.95 San Francisco (CA)

185950 rows × 9 columns

It's OK, all you have to do is add up normally

all_data.groupby('City').sum()
Quantity Ordered Price Each Month Sales
City
Atlanta (GA) 16602 2.779908e+06 104794 2.795499e+06
Austin (TX) 11153 1.809874e+06 69829 1.819582e+06
Boston (MA) 22528 3.637410e+06 141112 3.661642e+06
Dallas (TX) 16730 2.752628e+06 104620 2.767975e+06
Los Angeles (CA) 33289 5.421435e+06 208325 5.452571e+06
New York City (NY) 27932 4.635371e+06 175741 4.664317e+06
Portland (ME) 2750 4.471893e+05 17144 4.497583e+05
Portland (OR) 11303 1.860558e+06 70621 1.870732e+06
San Francisco (CA) 50239 8.211462e+06 315520 8.262204e+06
Seattle (WA) 16553 2.733296e+06 104941 2.747755e+06

San Francisco is number one. Also, you can see that Portland is also properly divided.

Arrange the code used in the previous month

results = all_data.groupby('City').sum()
cities = [city for city, df in all_data.groupby('City')]
#City for x-axis labels: Numbers and labels will be disjointed unless they are in the same order as when grooved above
plt.bar(cities,results['Sales'])
plt.xticks(cities,rotation="vertical")
#If the cities are displayed as they are, it will be bulky, so display them vertically.
plt.ylabel('Sales')
plt.xlabel('City name')
plt.show()

output_1.png

It's done!

Work 3: When and when should an ad be placed (displayed) to maximize the likelihood that a customer will buy the item?

The work became like data science at once (however, the conclusion is not so strict logic first)

Now, as usual, let's look back at the data

all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001

Perhaps it would be nice to look at the relationship between the Order Data column and the Sales column.

Let's change the data type so that Order Date can be handled easily as date data.

all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001

The format of the data has changed

Let's make a line of Hour, Minute

When creating a column in Month or Sales, it was a character string, so I used Split etc., but thanks to changing the data type earlier

all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 9 27

Let's visualize once to grasp the data

hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours,all_data.groupby(['Hour']).count())
#Number of Orders per hour with time on the X axis(Number of lines)Aggregate
plt.xticks(hours)
plt.grid()
#I want to see a clear tendency for each hour as much as possible, so I will add a grid
plt.xlabel('Hour')
plt.ylabel('Orders')
plt.show()

output_2.png

The peak is around AM: 11 and PM: 7. Therefore, it seems good to display advertisements during this time when the number of orders (customers) is the highest.

Work 4: Which items are most bought together as a set?

all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) TX 75001 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 9 27

It seems that you can understand if you group by the same Order ID

The values you are wearing can be easily sorted using the duplicated function.

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df.head(20)
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City State Hour Minute
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) CA 90001 14 38
18 176574 Google Phone 1 600.00 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) CA 90001 19 42
19 176574 USB-C Charging Cable 1 11.95 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) CA 90001 19 42
30 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 11 31
31 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) MA 02215 11 31
32 176586 AAA Batteries (4-pack) 2 2.99 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 5.98 San Francisco (CA) CA 94016 17 0
33 176586 Google Phone 1 600.00 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 600.00 San Francisco (CA) CA 94016 17 0
119 176672 Lightning Charging Cable 1 14.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 14.95 New York City (NY) NY 10001 11 7
120 176672 USB-C Charging Cable 1 11.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 11.95 New York City (NY) NY 10001 11 7
129 176681 Apple Airpods Headphones 1 150.00 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 150.00 Seattle (WA) WA 98101 10 39
130 176681 ThinkPad Laptop 1 999.99 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 999.99 Seattle (WA) WA 98101 10 39
138 176689 Bose SoundSport Headphones 1 99.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 99.99 New York City (NY) NY 10001 17 15
139 176689 AAA Batteries (4-pack) 2 2.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 5.98 New York City (NY) NY 10001 17 15
189 176739 34in Ultrawide Monitor 1 379.99 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 379.99 Austin (TX) TX 73301 17 38
190 176739 Google Phone 1 600.00 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 600.00 Austin (TX) TX 73301 17 38
225 176774 Lightning Charging Cable 1 14.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 14.95 Los Angeles (CA) CA 90001 15 6
226 176774 USB-C Charging Cable 1 11.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) CA 90001 15 6
233 176781 iPhone 1 700.00 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 700.00 Dallas (TX) TX 75001 7 37
234 176781 Lightning Charging Cable 1 14.95 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 14.95 Dallas (TX) TX 75001 7 37

The data with the same ID is now lined up in succession

The above data set is the order with the ID (≒ nothing other than the order bought as a set)

After grouping by ID, create a column that stores the orders purchased by that ID. (Example ID: 1111 → Order: Apple, Banana, Orange)

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
#transform is like an apply function. I will omit the explanation here.
df = df[['Order ID','Grouped']].drop_duplicates()
df.head(10)
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Order ID Grouped
3 176560 Google Phone,Wired Headphones
18 176574 Google Phone,USB-C Charging Cable
30 176585 Bose SoundSport Headphones,Bose SoundSport Hea...
32 176586 AAA Batteries (4-pack),Google Phone
119 176672 Lightning Charging Cable,USB-C Charging Cable
129 176681 Apple Airpods Headphones,ThinkPad Laptop
138 176689 Bose SoundSport Headphones,AAA Batteries (4-pack)
189 176739 34in Ultrawide Monitor,Google Phone
225 176774 Lightning Charging Cable,USB-C Charging Cable
233 176781 iPhone,Lightning Charging Cable

It feels good. The data is nicely organized.

In this Grouped column you have to look up "which, which, in what combination, how many times" etc.

Let's aggregate while using a convenient library. See stackoverflow below for a concrete image of the library.

from itertools import combinations
from collections import Counter

count = Counter()
for row in df['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list,2)))
    
count.most_common(10)
[(('iPhone', 'Lightning Charging Cable'), 1005),
 (('Google Phone', 'USB-C Charging Cable'), 987),
 (('iPhone', 'Wired Headphones'), 447),
 (('Google Phone', 'Wired Headphones'), 414),
 (('Vareebadd Phone', 'USB-C Charging Cable'), 361),
 (('iPhone', 'Apple Airpods Headphones'), 360),
 (('Google Phone', 'Bose SoundSport Headphones'), 220),
 (('USB-C Charging Cable', 'Wired Headphones'), 160),
 (('Vareebadd Phone', 'Wired Headphones'), 143),
 (('Lightning Charging Cable', 'Wired Headphones'), 92)]

The output is a little dirty, so I will make minor corrections

for key,value in count.most_common(10):
    print(key,value)
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
('iPhone', 'Apple Airpods Headphones') 360
('Google Phone', 'Bose SoundSport Headphones') 220
('USB-C Charging Cable', 'Wired Headphones') 160
('Vareebadd Phone', 'Wired Headphones') 143
('Lightning Charging Cable', 'Wired Headphones') 92

iPhone and charge cable are the most

Work 5 (Final Work): Which product is the best selling? Why do you think it sells best?

This is the last work. You have come to think for yourself (hypothesis part). Let's visualize the data.

Think on a quantity basis

product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product,df in product_group]
plt.bar(products,quantity_ordered)
plt.xlabel('Product')
plt.ylabel('Quantity Ordered')
plt.xticks(products,rotation="vertical",size=8)
plt.show()

output_3.png

Is there a lot of batteries and cables?

I guess it's because the price is cheap. Check prices and sales.

prices = all_data.groupby('Product').mean()['Price Each']

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
#Graph generation that is common to the x-axis and has different y-axis values
ax1.bar(products,quantity_ordered)
ax2.plot(products,prices,'b-')

ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quanity Ordered', color="g")
ax2.set_ylabel('Price', color="b")
ax1.set_xticklabels(products,rotation="vertical",size=8)

plt.show()

output_4.png

That is all.

Recommended Posts

[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to analyze J League data with Python
I tried to get CloudWatch data with Python
I tried to refer to the fun rock-paper-scissors poi for beginners with Python
I want to be able to analyze data with Python (Part 3)
I tried to make various "dummy data" with Python faker
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
[For beginners in competition professionals] I tried to solve 40 AOJ "ITP I" questions with python
I tried scraping food recall information with Python to create a pandas data frame
I want to analyze logs with Python
I tried to solve the ant book beginner's edition with python
I tried to save the data with discord
I tried to output LLVM IR with Python
I tried to automate sushi making with python
[Data science basics] I tried saving from csv to mysql with python
I tried fMRI data analysis with python (Introduction to brain information decoding)
Let's analyze Covid-19 (Corona) data using Python [For beginners]
I tried fp-growth with python
I tried scraping with Python
[For beginners] How to study Python3 data analysis exam
I tried to implement Minesweeper on terminal with python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to draw a route map with Python
I tried to solve the soma cube with python
I tried to implement an artificial perceptron with python
I tried to automatically generate a password with Python3
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to summarize how to use pandas in python
100 Pandas knocks for Python beginners
I tried to solve the problem with Python Vol.1
I tried gRPC with Python
I tried scraping with python
~ Tips for beginners to Python ③ ~
I tried to solve AOJ's number theory with Python
For those who are new to programming but have decided to analyze data with Python
I tried to aggregate & compare unit price data by language with Real Gachi by Python
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to build an environment for machine learning with Python (Mac OS X)
I tried to analyze the whole novel "Weathering with You" ☔️
I tried various methods to send Japanese mail with Python
[Introduction to Pandas] I tried to increase exchange data by data interpolation ♬
[Python] I tried to visualize tweets about Corona with WordCloud
Mayungo's Python Learning Episode 3: I tried to print numbers with print
[Python] Introduction to graph creation using coronavirus data [For beginners]
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I tried to divide the file into folders with Python
I tried to touch Python (installation)
I want to do ○○ with Pandas
I want to debug with Python
I tried running prolog with python 3.8.2.
python beginners tried to find out
I tried SMTP communication with Python
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
[3rd] I tried to make a certain authenticator-like tool with python
[Python] A memo that I tried to get started with asyncio
I tried to fix "I tried stochastic simulation of bingo game with Python"