[PYTHON] Development memorandum ~ pandas, forecast, data structure ~

Overview

Recently, I have had the opportunity to automate my work. As for the content,

--Use the gspread library to extract schedules on Google Sheets --Perform operations including simple aggregation and relocation, and reassemble into pandas data frame --Reprint the collected data frame to Google Spreadsheet

It was a relatively simple thing. The current degree of perfection is about 80%. It was more difficult than I expected and I got a lot of things, so I am writing this article so that I can summarize it for the future. It may be a little messy, but I hope you can see it with warm eyes.

Knowledge

The following is a summary of knowledge-related learning, mainly in the library and program specifications.

1.) pandas.DataFrame assignment points to the same object

I think this is a matter of course for those who are familiar with data frames. I made the same mistake when dealing with lists before, but I didn't think of it and made it again.

The situation is that you are trying to initialize a dictionary with a datetime type as a key with an empty data frame as shown below. I made it like "I want to group people who use sections by date, so register an empty data frame for the time being!"

python


import pandas as pd
import datetime


empty_user = dict(Section1=['', '', ''], Section2=['', '', ''], Section3=['', '', ''])
hour = ['9:00', '10:00', '11:00']
df = pd.DataFrame(data=empty_user, index=hour)

five_days_list = [(datetime.datetime(2020, 9, 1) + datetime.timedelta(days=1) * i) for i in range(5)]
dict_of_dataframe = {date_key : df for date_key in five_days_list}

The completed dictionary looks like this.

date_key : 2020-09-01 00:00:00
value :
      Section1 Section2 Section3
9:00                            
10:00                           
11:00                           
date_key : 2020-09-02 00:00:00
#Omitted below

About a week ago, I didn't have any doubts about this and proceeded to the next task. However, when I start registering the aggregated data, it doesn't work at all. Of the one-month schedule, the third and subsequent days are messed up.

The reason is simple when I think about it now. This method of creation simply reuses the same data frame over and over again. It will be more noticeable if you output using id ().

python


for date_key, dataframe in dict_of_dataframe.items():
    print(f"date_key : {date_key}")
    print(f"dataframe_id : {id(dataframe)}")
date_key : 2020-09-01 00:00:00
dataframe_id : 2124838088520
date_key : 2020-09-02 00:00:00
dataframe_id : 2124838088520
date_key : 2020-09-03 00:00:00
#Omitted below

Correctly, I had to use the pandas.DataFrame method copy ().

python


dict_of_dataframe = {date_key : df.copy() for date_key in five_days_list}
date_key : 2020-09-01 00:00:00
dataframe_id : 2124838588936
date_key : 2020-09-02 00:00:00
dataframe_id : 2124838590152
date_key : 2020-09-03 00:00:00

Pandas official documentation has a good description, but a copy of the data frame The () method defaults to deep = True, a deep copy. If you just want to use it as a template, you have to use it as an object. When I noticed it, I felt the shock of being hit.

As will be described later, this work was accompanied by the karma of a fairly deep data structure and the accompanying multiple loop structure. At first, I was only concerned about that person, so I think that I was suffering for about two days due to the struggle of that and this, such as outputting the progress, reading the pandas document, writing the loop structure on paper, etc. I will.

2.) for: else: is very convenient

I thought I knew a little about the basic grammar of Python, but I first met this child during development. Or I knew it but forgot it. If you use an else statement together with a for statement, the else statement will be executed ** only if you cannot break the loop of the for statement **.

I think there are countless uses, but I used it to get a log notification when there is no free section as below.

python


client_salesman_dict = {datetime.datetime(2020, 9, 1, 9, 0) :  [('Mr. Yamada', 'Takahashi'), ('Mr. Yoshizawa', 'Ito')],
                        datetime.datetime(2020, 9, 1, 10, 0) : [('Mr. Sasaki', 'Momoyama')],
                        datetime.datetime(2020, 9, 1, 11, 0) : [('Mr. Yokota', 'Takahashi'), ('Fukuchi', 'large tree'), ('Mr. Nakayama', 'Ito'), ('Mr. Gonda', 'Ozawa')],}

section_list = ['Section1', 'Section2', 'Section3',]

for date_dt, client_salesman_tuples in client_salesman_dict.items():
    date_str = f"{date_dt.hour}:{date_dt.minute:02}"
    
    for client_salesman_tuple in client_salesman_tuples:
        client = client_salesman_tuple[0]
        salesman = client_salesman_tuple[1]
        
        for section in section_list:
            section_status = df.loc[date_str, section]
            print(f"client : {client}, salesman : {salesman}")
            print(f"time is {date_str}")
            print(f"section is {section}")
            print(f"section_status is {section_status}")
            if section_status:
                print(f"bool of section_status is {bool(section_status)}.")
                print("I will skip writing phase.")
                continue
            print(f"I have applied {client},{salesman} to {section}")
            df.loc[date_str, section] = f"{client} {salesman}"
            break
            
        else:
            print(f"There is no empty section for{client}, {salesman}.Please recheck schedule.")
client :Mr. Yamada, salesman :Takahashi
time is 9:00
section is Section1
section_status is 
I have applied Mr. Yamada,Takahashi to Section1
client :Mr. Yoshizawa, salesman :Ito
time is 9:00
section is Section1
section_status is Mr. Yamada Takahashi
bool of section_status is True.
I will skip writing phase.
#Omission
There is no empty section for Gonda,Ozawa.Please recheck schedule.

Once upon a time, when I was doing similar processing in C language or Java, I think I was doing my best with a flag set inside, but it seems unnecessary for Python. It's sober, but there are many opportunities to use the for sentence itself, so I would like to continue using it as appropriate.

3.) One _ before the attribute is a customary private declaration, two _ make it inaccessible in the usual way

I knew it existed for a long time, but I didn't use it myself with particular consciousness. If you test the behavior properly, you'll see something like this.

python


class TestClass:
    def __init__(self):
        self.hoge = 1
        self._fuga = 2
        self.__monge = 3
    
    def _foo1(self):
        print("_foo1 is called")
    
    def __foo2(self):
        print("__foo2 is called")

t = TestClass()

#Instance variables
print(t.hoge)
print(t._fuga)
# print(t.__monge)← I can't call
print(t._TestClass__monge)

#Class method
t._foo1()
# t.__foo2()← I can't call
t._TestClass__foo2()
1
2
3
_foo1 is called
__foo2 is called

If you add two underscores at the beginning, you will not be able to call instance variables or class methods as usual. However, it is not a rugged Private attribute as in Java,

python


instance.__ClassName_AttributeName

It is possible to call it with.

Also, if you put one underscore at the beginning ... you can also call this. Moreover, normally. When I wondered, "Well, what is it for?", I found the following site.

[Python] How to use underscore (_) (special attribute, dunders)

According to this, apparently

――When it is one, it only suggests that it is for internal use, and the operation does not change in particular. However, it will not be loaded only when called with wildcards as a module. --When there are two, the name will be mangled (name mangling), so you will not be able to access it as it is. However, it is not intended to be private, but is used to avoid name conflicts between parent-child relationships.

It seems. In the first place, it was strange to recognize that it was for making private. This time, I created the program without inheriting the class, so using one underscore was enough.

4.) docstring is a good culture

I somehow knew that it also existed, but this was the first time I used it. I refer to the following articles when creating.

[Python] Learn how to write a docstring to improve readability (NumPy style)

Since this program was developed for myself, I wrote it while thinking "Is it necessary?", But as a result, "what to use", "for what", and "what to do" It was an opportunity to think firmly. Up until now, it was an ad hoc way to start writing somehow, move it once, and then fix it, but I think writing a docstring has made it a little better.

Way of thinking

The following is the point that I learned empirically, "I think this is better" than knowledge.

1.) Data structures that are too deep are out

When I first started putting together my schedule, I was putting together my data in a dictionary like this:

python


from datetime import datetime

from datetime import datetime

schedule_dict = {'1week': {datetime(2020, 9, 1) : {datetime(2020, 9, 1, 9, 0): [('Mr. Yamada', 'Terada'),('Mr. Yoshiki', 'Endo'),],
                                                    datetime(2020, 9, 1, 10, 0): [('Mr. Kudo', 'Yamashita'),],},
                            datetime(2020, 9, 2) : {datetime(2020, 9, 2, 10, 0): [('Mr. Tsurukawa', 'Honda'),],
                                                    datetime(2020, 9, 2, 11, 0): [('Mr. Endo', 'Aizawa'),],},
                            datetime(2020, 9, 2) : {datetime(2020, 9, 3, 9, 0): [('Mr. Shimoda', 'Terada'), ('Mr. Yoshikawa', 'Goda')],
                                                   }
                           }
                '2week': ....}

It looks like this when accessing.

python


schedule_dict['2week'][datetime(2020, 9, 8)][datetime(2020, 9, 8, 10, 0)]

Since the data on the spreadsheet side was arranged side by side every week, I did not think about anything in particular, but it was uselessly deep. It's still cool to access, and the loop hierarchy gets deeper and deeper when expanding with a for statement when using it.

After all, I took root in the middle of the work, lost the week key, and made it shallower by one level. When I review it again, I feel that I don't need the date key at 0 o'clock. If you need it, you can access each of the year, month, and day attributes of datetime.datetime type and recreate them.

It is often said that "uselessly deep loops should be avoided", but I learned that "it should also be avoided to create uselessly deep layers of data" that causes it. I feel that the burden per process is doubled ** each time the hierarchy is deepened. The memory of the brain is sucked hard ...

2.) Give it a decent name even if it gets a little longer

At the beginning of development, the name when testing the partial part was used as it is for the variable name. Since the processing was classified according to the target, there is no name conflict. I tried to write as short as possible, such as df for data frames, date for dates, and dct for dictionaries.

However, I immediately hit the wall. It's more about the memory of the brain than the problem with the program specifications. As the process becomes more complicated, "Oh, what's in this dictionary?" "I'm throwing an error, but isn't this date a string type?" "Index out of range ??? There were many problems such as "Isn't the list of the number of elements that I had?"

Until now, I wrote only a sample program to understand the specifications lightly, so I didn't care so much, but ** the name should be given so that it can be understood no matter where it flies **. .. It seems natural in words, and I hadn't really thought about it.

date_dt for datetime type dates, date_str for str type dates. After that, name it client_salesman_tuples_list to indicate what is stored, and use the name used when expanding in the for statement, such as for client_salesman_tuple in client_salesman_tuples_list :, considering the singular and plural forms. I did it.

Thanks to you, it's a little easier to understand than it was in the beginning. In the first place, it would be ideal not to perform processing that would confuse the brain, but if the skill for that is not enough, it will be insufficient, and I would like to keep in mind that the name will be devised.

3.) Object-oriented imagines a government office or company

I'm not sure if this is really correct. However, about two weeks ago, when I put it together from the part to the whole, I wrote about 100 lines flat and returned to myself. It would be a big deal if it was in this condition. I got a petite revelation from the following site that came out after a little research.

What is a namespace?

I feel that object-oriented is the task of properly dividing the namespace self through the class.

I was dumbfounded for a while when I saw the word "appropriate division", but suddenly I came up with the idea "** Isn't this the same as a government office? **".

Some time ago, I had to go to the city hall to get a resident card and wait for a while. While I was waiting, I was looking at the site of the government office without any reason, but it was really finely divided. Citizens, business and industry tourism, construction, city planning etc etc ... At first glance, some people may think, "Isn't construction and city planning the same?", But in reality, they are divided by the △△ section of the ○○ department. I don't know the contents of the work, but it seems that they are well shared. This is the same as "** Classification **".

Occasionally, departmental collaboration may be required, but not all information will be passed. Paper and data will be flooded. It makes sense for the person in charge to bring and discuss as much information as is needed for the job. This is the same as "** Synthesize from inheritance **".

Furthermore, "I" who applied for the acquisition of a resident's card does not know what kind of processing is being done inside. I can imagine that the person in charge fills out the documents by inputting various things on the computer, but all I do is write the application form and pay for it. Then you can receive the resident card without any problem. Perhaps even those who work at the counter do not know everything about what the specifically written documents and the entered data are used for and how they are stored. This is the same as "** Information hiding **".

With that kind of feeling, when I imagined a government office in my head, I managed to classify it. Even if I write it myself, I'm not sure if it really fits this understanding, but I was able to write it in a curved manner, so I'm tentatively OK.

I thought, Qiita had a timely article. I think this is a much easier explanation, so please refer to it.

Object-oriented design prescription talked about by an object-oriented uncle with 25 years of object-oriented history

I want to learn here in the future

Finally, I will write that I had some knowledge and ideas, but in the end I could not use it well.

Utilization of itertools to make for statements shallow

I didn't know it again, but there is a library called itertools in the Python standard library. For an easy-to-understand explanation, refer to this post, but what I focused on here is the product method in itertools.

This product produces output similar to a typical multiple loop, as described in the Official Documentation (https://docs.python.org/3/library/itertools.html#itertools.product). Will give you.

python


import itertools

section_list = ['Section1', 'Section2', 'Section3']
time_list = ['9:00', '10:00', '11:00']

for section in section_list:
    for time in time_list:
        print(section, time)

print("------------")
        
for section, time in list(itertools.product(section_list, time_list)):
    print(section, time)

The output is the same.

Section1 9:00
Section1 10:00
Section1 11:00
Section2 9:00
Section2 10:00
Section2 11:00
Section3 9:00
Section3 10:00
Section3 11:00
------------
Section1 9:00
Section1 10:00
Section1 11:00
Section2 9:00
Section2 10:00
Section2 11:00
Section3 9:00
Section3 10:00
Section3 11:00

You can also receive it as a tuple.

python


for tpl in list(itertools.product(section_list, time_list)):
    print(tpl)
('Section1', '9:00')
('Section1', '10:00')
('Section1', '11:00')
('Section2', '9:00')
('Section2', '10:00')
('Section2', '11:00')
('Section3', '9:00')
('Section3', '10:00')
('Section3', '11:00')

I really wanted to use this to reduce the loop hierarchy, but it didn't work. As mentioned above, I painfully understood that increasing the hierarchy of loops directly leads to an increase in the burden on the brain, so I would like to use itertools as well as the idea of data structures in the future.

Summary

As an advice for beginners in programming, I often hear "Why don't you try making something for the time being?", But I understand that this is really reasonable advice. I had a hard time about 10 times as much as I imagined, but I feel that I got a lot of things.

If you have any opinions or advice, please leave them in the comments section.

Recommended Posts

Development memorandum ~ pandas, forecast, data structure ~
Pandas memorandum
pandas memorandum
Read pandas data
Pandas operation memorandum
[For recording] Pandas memorandum
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
[Python tutorial] Data structure
data structure python push pop
Data handling 3 (development) About data format
Memorandum (pseudo Vlookup by pandas)
Memorandum @ Python OR Seminar: Pandas
Data analysis using python pandas
Data processing tips with Pandas