I watched a Python lecture video by Keith Galli, an overseas youtuber, and tried the work.
Let's solve real-life data science problems using Python's Pandas library! ] Solving real world data science tasks with Python Pandas! The difficulty level is not high and it is plain English so please have a look.
It's about an hour and a half video. It's a very polite explanation. At stackoverflow (overseas version programming Q & A site) on Google, It also shows the process of finding a solution.
It is a task to analyze using CSV Excel data. The data can be downloaded from his Github.
I'm running on a jupyter notebook.
I am doing it while adding supplements to the code in Japanese.
If you watch the video and actually touch it, I think it's the best way to practice Pandas.
You can get a feel for the atmosphere just by reading this article.
Load pandas, os library
import pandas as pd
import os
First, check what kind of data is included (get file name list)
files = [file for file in os.listdir("Sales_Data")]
for file in files:
print(file)
Sales_April_2019.csv
Sales_August_2019.csv
Sales_December_2019.csv
Sales_February_2019.csv
Sales_January_2019.csv
Sales_July_2019.csv
Sales_June_2019.csv
Sales_March_2019.csv
Sales_May_2019.csv
Sales_November_2019.csv
Sales_October_2019.csv
Sales_September_2019.csv
It seems that the data from the file name to January-December 2019 is included, I want to combine the data for analysis
First, create an empty data frame to store all the data
all_months_data = pd.DataFrame()
In order to combine all the month data, we will put the csv file of each month into the empty data frame one after another.
for file in files:
df = pd.read_csv("Sales_Data/" + file)
all_months_data = pd.concat([all_months_data,df])
Well, the first 5 lines of data look like this
all_months_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
Output the combined data (At this time, since the serial number is not required, add index = False to the argument)
all_months_data.to_csv("all_data.csv",index=False)
After confirming the output data, try reading the combined data again.
all_data = pd.read_csv("all_data.csv")
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
Looking at the column, there is a column of Order Date = order date, but there is no data only for the month, so let's create a new one
The first two letters in the order date column are likely to be the month data
all_data['Month'] = all_data['Order Date'].str[0:2]
Since it was converted to a character string, let's return it to the numerical value again
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-77-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
2 all_data.head()
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
5880 # else, only a single dtype is given
5881 new_data = self._data.astype(
-> 5882 dtype=dtype, copy=copy, errors=errors, **kwargs
5883 )
5884 return self._constructor(new_data).__finalize__(self)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
579
580 def astype(self, dtype, **kwargs):
--> 581 return self.apply("astype", dtype=dtype, **kwargs)
582
583 def convert(self, **kwargs):
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
436 kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
437
--> 438 applied = getattr(b, f)(**kwargs)
439 result_blocks = _extend_blocks(applied, result_blocks)
440
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
557
558 def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559 return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
560
561 def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
641 # _astype_nansafe works fine with 1-d only
642 vals1d = values.ravel()
--> 643 values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
644
645 # TODO(extension)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
705 # work around NumPy brokenness, #1987
706 if np.issubdtype(dtype.type, np.integer):
--> 707 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
708
709 # if we have a datetime/timedelta array of objects
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
ValueError: cannot convert float NaN to integer
I got an error
When I read the error statement, it says that NaN cannot be quantified.
Combining the is_na and any functions will return True if there is at least one NaN.
axis = 1 is a function that returns processing line by line
Have all rows containing NaN returned
nan_df = all_data[all_data.isna().any(axis=1)]
nan_df
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | |
---|---|---|---|---|---|---|---|
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
356 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
735 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1433 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1553 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
185176 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
185438 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
186042 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
186548 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
186826 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
545 rows × 7 columns
There were also 545 lines
Erase data containing NaN
all_data = all_data.dropna(how='all')
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | |
---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 04 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 04 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 04 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 04 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 04 |
Alright, let's go back to the previous month and try again
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-80-ffc394ccb2ad> in <module>
----> 1 all_data['Month'] = all_data['Month'].astype('int32')
2 all_data.head()
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
5880 # else, only a single dtype is given
5881 new_data = self._data.astype(
-> 5882 dtype=dtype, copy=copy, errors=errors, **kwargs
5883 )
5884 return self._constructor(new_data).__finalize__(self)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
579
580 def astype(self, dtype, **kwargs):
--> 581 return self.apply("astype", dtype=dtype, **kwargs)
582
583 def convert(self, **kwargs):
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
436 kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
437
--> 438 applied = getattr(b, f)(**kwargs)
439 result_blocks = _extend_blocks(applied, result_blocks)
440
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
557
558 def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559 return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
560
561 def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
641 # _astype_nansafe works fine with 1-d only
642 vals1d = values.ravel()
--> 643 values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
644
645 # TODO(extension)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
705 # work around NumPy brokenness, #1987
706 if np.issubdtype(dtype.type, np.integer):
--> 707 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
708
709 # if we have a datetime/timedelta array of objects
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
ValueError: invalid literal for int() with base 10: 'Or'
Next, I'm getting a new error again
It seems that the word'Or'was included
Let's look at the line that contains'Or'
temp_df = all_data[all_data['Order Date'].str[0:2] == "Or"]
temp_df.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | |
---|---|---|---|---|---|---|---|
519 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Or |
1149 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Or |
1155 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Or |
2878 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Or |
2893 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Or |
It seems that the identity of "Or" was "Order Date".
Let's refresh the data by extracting the data other than the data containing the word "Order Date"
all_data = all_data[all_data['Order Date'].str[0:2] != "Or"]
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | |
---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 04 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 04 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 04 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 04 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 04 |
Let's be honest for the third time and perform quantification
all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | |
---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 |
I got it
Next, although there are Quantity Ordered and Price Each, there is no essential sales.
Since PQ (sales) = P (price) x Q (quantity), add a column that corresponds to PQ.
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
967 try:
--> 968 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
969 except TypeError:
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
220 if use_numexpr:
--> 221 return _evaluate(op, op_str, a, b, **eval_kwargs)
222 return _evaluate_standard(op, op_str, a, b)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
126 if result is None:
--> 127 result = _evaluate_standard(op, op_str, a, b)
128
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
69 with np.errstate(all="ignore"):
---> 70 return op(a, b)
71
TypeError: can't multiply sequence by non-int of type 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-84-7c1e2b69cbe2> in <module>
----> 1 all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in wrapper(left, right)
1046
1047 with np.errstate(all="ignore"):
-> 1048 result = na_op(lvalues, rvalues)
1049 return construct_result(
1050 left, result, index=left.index, name=res_name, dtype=None
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in na_op(x, y)
968 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
969 except TypeError:
--> 970 result = masked_arith_op(x, y, op)
971
972 return missing.dispatch_fill_zeros(op, x, y, result)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py in masked_arith_op(x, y, op)
445 if mask.any():
446 with np.errstate(all="ignore"):
--> 447 result[mask] = op(xrav[mask], com.values_from_object(yrav[mask]))
448
449 else:
TypeError: can't multiply sequence by non-int of type 'str'
Once again, it contained a string
I want to rewrite it into a form that can be calculated for the time being, so I will use the "to_numeric" function
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
I will try it again
all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | |
---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 |
It came out. It seems that the answer to the question will finally come out
For the time being, I will repost the question.
all_data.groupby('Month').sum()
Quantity Ordered | Price Each | Sales | |
---|---|---|---|
Month | |||
1 | 10903 | 1.811768e+06 | 1.822257e+06 |
2 | 13449 | 2.188885e+06 | 2.202022e+06 |
3 | 17005 | 2.791208e+06 | 2.807100e+06 |
4 | 20558 | 3.367671e+06 | 3.390670e+06 |
5 | 18667 | 3.135125e+06 | 3.152607e+06 |
6 | 15253 | 2.562026e+06 | 2.577802e+06 |
7 | 16072 | 2.632540e+06 | 2.647776e+06 |
8 | 13448 | 2.230345e+06 | 2.244468e+06 |
9 | 13109 | 2.084992e+06 | 2.097560e+06 |
10 | 22703 | 3.715555e+06 | 3.736727e+06 |
11 | 19798 | 3.180601e+06 | 3.199603e+06 |
12 | 28114 | 4.588415e+06 | 4.613443e+06 |
I don't know Pat, so let's show it
Install the library for illustration
import matplotlib.pyplot as plt
Graph while deciding the range for the X axis
months = range(1,13)
results = all_data.groupby('Month').sum()
plt.bar(months,results['Sales'])
plt.xticks(months)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()
That was work 1.
Now that we have a line of sales, it seems easy work, but let's do it
Before that, let's check again what the data was like.
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | |
---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 |
It seems like extracting a part of the data of the Purchase Address part and totaling it.
Let's create a City column as we did in Sales.
I have to extract a part of Address, but when I look at the data of Adress, I want to extract only the second city name like "address, city name, zip code".
Let's split by "," in the split function, and use the apply function to apply it to all columns in the same column
all_data['City'] = all_data['Purchase Address'].apply(lambda x: x.split(',')[1])
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | |
---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles |
Only City was successfully divided!
By the way, there is one problem that must be considered here when totaling in City.
That is the case when a city has the same name but other countries have the same name. If this is not taken into consideration, they will be added up and unintended aggregation results will be obtained. (Example: Portland in both Oregon and Maine)
However, there is no need to create a new column for the country name. Since it's just aggregating, all you have to do is put the State (2 characters) in the cell of the City column. This time, we will use the form City (State). ex: Dallas (TX)
Earlier, I wrote it in one line with the lambda function, but since it's a big deal, let's create a function to get the city name and country name respectively. It is also kind to the reader.
def get_city(address):
return address.split(',')[1]
def get_state(address):
return address.split(',')[2].split(' ')[1]
#Purchase address data is a comma(,)Blank after(" ")Because there is, cut with split so that it does not become unnatural when united
all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)} ({get_state(x)})")
all_data.drop("State",axis=1)
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | |
---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas (TX) |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
186845 | 259353 | AAA Batteries (4-pack) | 3 | 2.99 | 09/17/19 20:56 | 840 Highland St, Los Angeles, CA 90001 | 9 | 8.97 | Los Angeles (CA) |
186846 | 259354 | iPhone | 1 | 700.00 | 09/01/19 16:00 | 216 Dogwood St, San Francisco, CA 94016 | 9 | 700.00 | San Francisco (CA) |
186847 | 259355 | iPhone | 1 | 700.00 | 09/23/19 07:39 | 220 12th St, San Francisco, CA 94016 | 9 | 700.00 | San Francisco (CA) |
186848 | 259356 | 34in Ultrawide Monitor | 1 | 379.99 | 09/19/19 17:30 | 511 Forest St, San Francisco, CA 94016 | 9 | 379.99 | San Francisco (CA) |
186849 | 259357 | USB-C Charging Cable | 1 | 11.95 | 09/30/19 00:18 | 250 Meadow St, San Francisco, CA 94016 | 9 | 11.95 | San Francisco (CA) |
185950 rows × 9 columns
It's OK, all you have to do is add up normally
all_data.groupby('City').sum()
Quantity Ordered | Price Each | Month | Sales | |
---|---|---|---|---|
City | ||||
Atlanta (GA) | 16602 | 2.779908e+06 | 104794 | 2.795499e+06 |
Austin (TX) | 11153 | 1.809874e+06 | 69829 | 1.819582e+06 |
Boston (MA) | 22528 | 3.637410e+06 | 141112 | 3.661642e+06 |
Dallas (TX) | 16730 | 2.752628e+06 | 104620 | 2.767975e+06 |
Los Angeles (CA) | 33289 | 5.421435e+06 | 208325 | 5.452571e+06 |
New York City (NY) | 27932 | 4.635371e+06 | 175741 | 4.664317e+06 |
Portland (ME) | 2750 | 4.471893e+05 | 17144 | 4.497583e+05 |
Portland (OR) | 11303 | 1.860558e+06 | 70621 | 1.870732e+06 |
San Francisco (CA) | 50239 | 8.211462e+06 | 315520 | 8.262204e+06 |
Seattle (WA) | 16553 | 2.733296e+06 | 104941 | 2.747755e+06 |
San Francisco is number one. Also, you can see that Portland is also properly divided.
Arrange the code used in the previous month
results = all_data.groupby('City').sum()
cities = [city for city, df in all_data.groupby('City')]
#City for x-axis labels: Numbers and labels will be disjointed unless they are in the same order as when grooved above
plt.bar(cities,results['Sales'])
plt.xticks(cities,rotation="vertical")
#If the cities are displayed as they are, it will be bulky, so display them vertically.
plt.ylabel('Sales')
plt.xlabel('City name')
plt.show()
It's done!
The work became like data science at once (however, the conclusion is not so strict logic first)
Now, as usual, let's look back at the data
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | State | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas (TX) | TX 75001 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) | MA 02215 |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) | CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 |
Perhaps it would be nice to look at the relationship between the Order Data column and the Sales column.
Let's change the data type so that Order Date can be handled easily as date data.
all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | State | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas (TX) | TX 75001 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) | MA 02215 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) | CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 |
The format of the data has changed
Let's make a line of Hour, Minute
When creating a column in Month or Sales, it was a character string, so I used Split etc., but thanks to changing the data type earlier
all_data['Hour'] = all_data['Order Date'].dt.hour
all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | State | Hour | Minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas (TX) | TX 75001 | 8 | 46 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) | MA 02215 | 22 | 30 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) | CA 90001 | 14 | 38 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 | 14 | 38 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 | 9 | 27 |
Let's visualize once to grasp the data
hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours,all_data.groupby(['Hour']).count())
#Number of Orders per hour with time on the X axis(Number of lines)Aggregate
plt.xticks(hours)
plt.grid()
#I want to see a clear tendency for each hour as much as possible, so I will add a grid
plt.xlabel('Hour')
plt.ylabel('Orders')
plt.show()
The peak is around AM: 11 and PM: 7. Therefore, it seems good to display advertisements during this time when the number of orders (customers) is the highest.
all_data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | State | Hour | Minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas (TX) | TX 75001 | 8 | 46 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) | MA 02215 | 22 | 30 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) | CA 90001 | 14 | 38 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 | 14 | 38 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 | 9 | 27 |
It seems that you can understand if you group by the same Order ID
The values you are wearing can be easily sorted using the duplicated function.
df = all_data[all_data['Order ID'].duplicated(keep=False)]
df.head(20)
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | State | Hour | Minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) | CA 90001 | 14 | 38 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles (CA) | CA 90001 | 14 | 38 |
18 | 176574 | Google Phone | 1 | 600.00 | 2019-04-03 19:42:00 | 20 Hill St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles (CA) | CA 90001 | 19 | 42 |
19 | 176574 | USB-C Charging Cable | 1 | 11.95 | 2019-04-03 19:42:00 | 20 Hill St, Los Angeles, CA 90001 | 4 | 11.95 | Los Angeles (CA) | CA 90001 | 19 | 42 |
30 | 176585 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 11:31:00 | 823 Highland St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) | MA 02215 | 11 | 31 |
31 | 176585 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 11:31:00 | 823 Highland St, Boston, MA 02215 | 4 | 99.99 | Boston (MA) | MA 02215 | 11 | 31 |
32 | 176586 | AAA Batteries (4-pack) | 2 | 2.99 | 2019-04-10 17:00:00 | 365 Center St, San Francisco, CA 94016 | 4 | 5.98 | San Francisco (CA) | CA 94016 | 17 | 0 |
33 | 176586 | Google Phone | 1 | 600.00 | 2019-04-10 17:00:00 | 365 Center St, San Francisco, CA 94016 | 4 | 600.00 | San Francisco (CA) | CA 94016 | 17 | 0 |
119 | 176672 | Lightning Charging Cable | 1 | 14.95 | 2019-04-12 11:07:00 | 778 Maple St, New York City, NY 10001 | 4 | 14.95 | New York City (NY) | NY 10001 | 11 | 7 |
120 | 176672 | USB-C Charging Cable | 1 | 11.95 | 2019-04-12 11:07:00 | 778 Maple St, New York City, NY 10001 | 4 | 11.95 | New York City (NY) | NY 10001 | 11 | 7 |
129 | 176681 | Apple Airpods Headphones | 1 | 150.00 | 2019-04-20 10:39:00 | 331 Cherry St, Seattle, WA 98101 | 4 | 150.00 | Seattle (WA) | WA 98101 | 10 | 39 |
130 | 176681 | ThinkPad Laptop | 1 | 999.99 | 2019-04-20 10:39:00 | 331 Cherry St, Seattle, WA 98101 | 4 | 999.99 | Seattle (WA) | WA 98101 | 10 | 39 |
138 | 176689 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-24 17:15:00 | 659 Lincoln St, New York City, NY 10001 | 4 | 99.99 | New York City (NY) | NY 10001 | 17 | 15 |
139 | 176689 | AAA Batteries (4-pack) | 2 | 2.99 | 2019-04-24 17:15:00 | 659 Lincoln St, New York City, NY 10001 | 4 | 5.98 | New York City (NY) | NY 10001 | 17 | 15 |
189 | 176739 | 34in Ultrawide Monitor | 1 | 379.99 | 2019-04-05 17:38:00 | 730 6th St, Austin, TX 73301 | 4 | 379.99 | Austin (TX) | TX 73301 | 17 | 38 |
190 | 176739 | Google Phone | 1 | 600.00 | 2019-04-05 17:38:00 | 730 6th St, Austin, TX 73301 | 4 | 600.00 | Austin (TX) | TX 73301 | 17 | 38 |
225 | 176774 | Lightning Charging Cable | 1 | 14.95 | 2019-04-25 15:06:00 | 372 Church St, Los Angeles, CA 90001 | 4 | 14.95 | Los Angeles (CA) | CA 90001 | 15 | 6 |
226 | 176774 | USB-C Charging Cable | 1 | 11.95 | 2019-04-25 15:06:00 | 372 Church St, Los Angeles, CA 90001 | 4 | 11.95 | Los Angeles (CA) | CA 90001 | 15 | 6 |
233 | 176781 | iPhone | 1 | 700.00 | 2019-04-03 07:37:00 | 976 Hickory St, Dallas, TX 75001 | 4 | 700.00 | Dallas (TX) | TX 75001 | 7 | 37 |
234 | 176781 | Lightning Charging Cable | 1 | 14.95 | 2019-04-03 07:37:00 | 976 Hickory St, Dallas, TX 75001 | 4 | 14.95 | Dallas (TX) | TX 75001 | 7 | 37 |
The data with the same ID is now lined up in succession
The above data set is the order with the ID (≒ nothing other than the order bought as a set)
After grouping by ID, create a column that stores the orders purchased by that ID. (Example ID: 1111 → Order: Apple, Banana, Orange)
df = all_data[all_data['Order ID'].duplicated(keep=False)]
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
#transform is like an apply function. I will omit the explanation here.
df = df[['Order ID','Grouped']].drop_duplicates()
df.head(10)
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Order ID | Grouped | |
---|---|---|
3 | 176560 | Google Phone,Wired Headphones |
18 | 176574 | Google Phone,USB-C Charging Cable |
30 | 176585 | Bose SoundSport Headphones,Bose SoundSport Hea... |
32 | 176586 | AAA Batteries (4-pack),Google Phone |
119 | 176672 | Lightning Charging Cable,USB-C Charging Cable |
129 | 176681 | Apple Airpods Headphones,ThinkPad Laptop |
138 | 176689 | Bose SoundSport Headphones,AAA Batteries (4-pack) |
189 | 176739 | 34in Ultrawide Monitor,Google Phone |
225 | 176774 | Lightning Charging Cable,USB-C Charging Cable |
233 | 176781 | iPhone,Lightning Charging Cable |
It feels good. The data is nicely organized.
In this Grouped column you have to look up "which, which, in what combination, how many times" etc.
Let's aggregate while using a convenient library. See stackoverflow below for a concrete image of the library.
from itertools import combinations
from collections import Counter
count = Counter()
for row in df['Grouped']:
row_list = row.split(',')
count.update(Counter(combinations(row_list,2)))
count.most_common(10)
[(('iPhone', 'Lightning Charging Cable'), 1005),
(('Google Phone', 'USB-C Charging Cable'), 987),
(('iPhone', 'Wired Headphones'), 447),
(('Google Phone', 'Wired Headphones'), 414),
(('Vareebadd Phone', 'USB-C Charging Cable'), 361),
(('iPhone', 'Apple Airpods Headphones'), 360),
(('Google Phone', 'Bose SoundSport Headphones'), 220),
(('USB-C Charging Cable', 'Wired Headphones'), 160),
(('Vareebadd Phone', 'Wired Headphones'), 143),
(('Lightning Charging Cable', 'Wired Headphones'), 92)]
The output is a little dirty, so I will make minor corrections
for key,value in count.most_common(10):
print(key,value)
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
('iPhone', 'Apple Airpods Headphones') 360
('Google Phone', 'Bose SoundSport Headphones') 220
('USB-C Charging Cable', 'Wired Headphones') 160
('Vareebadd Phone', 'Wired Headphones') 143
('Lightning Charging Cable', 'Wired Headphones') 92
iPhone and charge cable are the most
This is the last work. You have come to think for yourself (hypothesis part). Let's visualize the data.
Think on a quantity basis
product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product,df in product_group]
plt.bar(products,quantity_ordered)
plt.xlabel('Product')
plt.ylabel('Quantity Ordered')
plt.xticks(products,rotation="vertical",size=8)
plt.show()
Is there a lot of batteries and cables?
I guess it's because the price is cheap. Check prices and sales.
prices = all_data.groupby('Product').mean()['Price Each']
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
#Graph generation that is common to the x-axis and has different y-axis values
ax1.bar(products,quantity_ordered)
ax2.plot(products,prices,'b-')
ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quanity Ordered', color="g")
ax2.set_ylabel('Price', color="b")
ax1.set_xticklabels(products,rotation="vertical",size=8)
plt.show()
That is all.
Recommended Posts