[PYTHON] Make "RPA tool" because I have free time # 4 Selenium edition Text acquisition

Introduction

Everyone, I think I was able to post early this time. This is enp. I want to have a romance. This article is a progress report for anyone trying to make an "RPA tool". Please note that it does not describe how to make RPA or RPA tools. (Maybe I'll put a little sauce on it, it depends on my mood) This time is Part.2 to get along with selenium. What I did was "convert address to zip code" and "convert postal code to address". It will be a long article again, but I would appreciate it if you could get along with me.

I just want to get the text! !!

Yes. The title is just that. I want to get the characters written on the blog and the characters written on Wikipedia. That's it for this article. __ Then why are you trying to get the zip code from your address? __ It will be. That's because it was a zip code as a result of my prejudice and arbitrary consideration of the information on the Internet necessary for work. I think there is other information you need on the Internet depending on your work. However, I thought I would look up the __zip code regardless of what I was doing. __ It's just my prejudice. __ It also includes the simple reason that __ has a lot of text to handle. __ Then why are you trying to get an address from your zip code? __ You just need to know the zip code from the address. That's right. However, it may be inconvenient without the opposite function. __ Also, __ I couldn't settle down without the opposite function. __ So, this time, I implemented two functions, __ "Conversion from address to zip code" __ and __ "Conversion from zip code to address" __.

(To be clear, if you look for it, I think there is a library that handles addresses and zip codes. However, this time we will implement it for selenium learning, so we will not perform actions such as searching for addresses in the library. Implement it with selenium To go)

First, let's know the specifications of the site

If you don't know the site specifications, there are no implementation specifications. So let's learn about zip code search etc. First, search for an address by zip code. 郵便番号検索.png The above is the search bar for zip code search. Well, if you put the zip code in the text box and press the button, you're done. Easy to understand. So what about after searching? 郵便番号検索後.png It will be displayed like this. It seems that you can get it by extracting the address from the table.

Next is the search for the zip code from the address. 住所検索.png By the way, what is worrisome here is selection of prefectures. Implementing the selection with selenium seems a bit daunting. However, the Japan Post is very kind. You can also search by entering all the required addresses in the fields for entering cities, wards, towns and villages. __ Thank you. Let's consider whether to issue a New Year's card this year. No one sends it. After the search, it will be as follows. 住所検索後.png The zip code is not listed. What if I click on an address in the town area? ボタンを押した後.png It looks like the above. The zip code finally comes out here. Converting an address to a zip code will take more time.

The above is the specifications of the site.

(We will implement it on the premise that there are no other candidates after the search. If you search by zip code, the address that the zip code corresponds to should be fixed to one. Also, except for some Even in the address search, if you do not do an ambiguous search, it should be fixed to one. When you do an ambiguous search, I think that it is not "automated" because the search is done on the "premise that the user selects". Decided not to support)

Limits of element acquisition

Then, let's implement it immediately! I would like to say that there was a problem here. Until now, you have used the attribute name of the __ element to enter characters in the text box or click a button. __ (Elements are divs and inputs, and attributes are classes and ids) But what if the __attribute name is used in multiple elements? __ For example, when the name sample is used for various divs in class. In that case, you will not be able to pinpoint exactly where you want to type or click. __ When you want to specify an element by attribute name, you use id name or name name, but there is no guarantee that id or name is used. Also, you may not be using the attribute name. You can specify it by using the element name, but Web pages often use an element called div, which is not suitable for identifying one. __ Then what should I do! !! __ </ font> __ In that case, use XPath. __ __XPath? what is that? Band? __ You might think that. That person may be imagining X JAPAN. So, in the next section, I will briefly explain XPath.

What is XPath?

XPath is simply the address of the __ element __. Let's explain with a little sample.

<html>
  <head>
    <title>Sample HTML</title>
  </head>
  <body>
    <h1>HTML sample</h1>
    <p>This is HTML Sample.</p>
    <p>This is HTML Sample.</p>
  </body>
</html>

The above is an HTML sample program. It overstates that it is a sample. I'm not explaining HTML, so I'll omit what kind of HTML it is. __ The important thing is that some attributes are not used and some elements are used. __ Now, I have a question for you. Doesn't it look like a single block with __ <○○> </ ○○>? __ __ From the perspective, it feels like there is a body in the html. That feeling is XPath. __ When you hear the second p in the body in the html, is it the 8th line in the sample program? You should understand somehow. The same can be done with selenium. __ Then how do you write it? __ </ font> If you can't write in letters, you can't write in the program. The way to write it is as follows.

/html/body/p[2]

What I'm saying is the same as before. It points to "the second p in the body in html". However, it is difficult to write from html one by one. It may be possible if all the HTML is as short as the sample, but if it is long, it will be confusing. Therefore, you can omit the XPath up to the required part with "//". __ This is the explanation of XPath. If you want to know more details, please check it yourself.

Implementation!

Now let's implement it. The program is as follows.

#Library import
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import psutil
import re

#Enter zip code or address
postHint = input("Zip code or address:")

#Driver import
chrome = webdriver.Chrome(r".\driver\chromedriver.exe")

#Go to JP's zip code search page
chrome.get("https://www.post.japanpost.jp/zipcode/")

#Conditional branch of "zip code to address" or "address to zip code"
while True :
    if re.fullmatch("((\d{3})-(\d{4}))|(\d+)", postHint) : #When searching for an address by zip code
        if re.fullmatch("(\d{3})-(\d{4})", postHint) : # -If so-Delete
            postNumber = postHint.replace("-", "")
        else :
            postNumber = postHint

        #Search for an address by zip code
        textarea = chrome.find_element_by_name("zip") #Text box for entering zip code
        textarea.send_keys(postNumber) #Enter zip code
        textarea.send_keys(Keys.ENTER) #Determined by enter

        #Get an address
        ken = chrome.find_element_by_xpath("//table[@class='prefList sp-b10']/tbody/tr[2]/td[2]/small").get_attribute("textContent")
        si = chrome.find_element_by_xpath("//table[@class='prefList sp-b10']/tbody/tr[2]/td[3]/small").get_attribute("textContent")
        machi = chrome.find_element_by_xpath("//table[@class='prefList sp-b10']/tbody/tr[2]/td[4]/div/p/small/a").get_attribute("textContent")

        #Combine addresses
        address = ken + si + machi
        
        #result
        print("search results:" + address)

        #Exit the loop
        break

    elif re.fullmatch(".+?Prefecture.+?\d*?-?\d*?-?\d*?-?\d*?-?\d*?", postHint) : #From address to zip code
        #Delete the number part of the address
        if re.search("\d", postHint) :
            postAddress = re.sub("\d+?-?\d*?-?\d*?-?\d*?-?\d*?", "", postHint)
        else :
            postAddress = postHint

        #Search for zip code by address
        textarea = chrome.find_element_by_name("addr") #Text box for entering an address
        textarea.send_keys(postAddress)
        textarea.send_keys(Keys.ENTER)

        #Wait for 1 second (because browser processing cannot keep up)
        time.sleep(1)

        #Get zip code
        button = chrome.find_element_by_xpath("//table[@class='prefList sp-b10']/tbody/tr[2]/td[2]/div/p/a") #Get the link
        button.click()
        addressNumber = chrome.find_element_by_xpath("//table[@class='zip-detail']/tbody/tr[2]/td[1]/span").get_attribute("textContent")

        #Result output
        print("search results:" + re.sub("\s*?", "", addressNumber))

        #Exit the loop
        break

    else :
        postHint = repr(input("Zip code or address:"))

p = psutil.Process(chrome.service.process.pid)
p.terminate()

I've disabled syntax highlighting so it may be a little ugly. However, when I enabled syntax highlighting, it was more difficult to see, so I disabled it. There is something better than sleep for standby processing, but this time I made it sleep. There is no particular reason. Because it is not the main subject. The program is easy to do. Search and get information. that's all! simple! However, this program has two fatal drawbacks. __ I didn't handle the error in the first place. Strictly speaking, there aren't two.

By the way, __ fatal drawback The first is a bug when searching by address if there are some candidates __. Candidates may appear even if all addresses are entered correctly. In that case, the correct zip code may not be output. __ The reason is that XPath selects the first choice. __ It can be said that it is a bug because it was supposed to be a situation where there is no other than the first candidate. One solution is to compare the address of each candidate with the address entered by the user.

__ The second fatal drawback is the loose address search regular expressions. __ __ The regular expression in question is ". +? Prefecture. +? \ D *?-? \ D *?-? \ D *?-? \ D *?-? \ D *?". __ __ Garbled characters? Characters that seem to think for a moment. I won't explain the regular expression, but __ it means "OK if there is some character after XX prefecture" __. __ In other words, I recognize that it is correct to write "Sakura in Kumamoto Prefecture". The ultimate "Maru prefecture?" Is OK. __ __ However, it does not search for items that are not "○○ prefecture" such as "Tokyo" and "Kyoto prefecture". __ It's too sloppy to talk about. Of course, if you search and find nothing, an error will be thrown and the program will end. __ The reason why regular expressions are sloppy is because the way to write an address differs depending on the prefecture __. If there is ○○ prefecture ○○ city ○○, ○○ prefecture ○○ city ○○ ward ○○, ○○ prefecture ○○ group ○○, etc. It is impossible to investigate and respond to all of them. You may be able to do it if you have some kind of list. However, at the moment, there is no other way but to look it up individually, so it has become a sloppy regular expression. __ Also, I simply forgot about the problem of not searching for Tokyo. __I'm sorry.

What you need to do to implement RPA tools

By the way, I would like to enumerate the contents related to implementation based on my experience so far. Below is a bulleted list.

--__ There is a limit if XPath cannot be used to get the element __ --_ I need to explain what XPath is for those who can't program __ ――_ For those who cannot program, it is necessary to devise how to write XPath __ --__ It may not work well without standby processing __ --_ I felt that the range of RPA would be wider if conditional branching and repetition were possible __ --_ About repetition It is convenient if you can use break etc. __ --_ You should be able to use regular expressions __ --_ I need to explain what regular expressions are for those who can't program __ --_ For those who can't program, you need to devise a way to write regular expressions __ --__ It is better to use get_attribute ("textContent") to get the text __ --_ I also want a string combination function __ --__ If you don't handle the error fairly well, it will break soon __

That's all I can think of right now. Because get_attribute ("textContent") can also get characters that are not displayed in the browser I think it will be harder to throw an error.

Finally

Thank you for staying with us until the end. I think I can get along well with selenium this time. However, there is something I want to do with selenium at the end, so the selenium edition will continue for a while. I think the next article will be "Let's get along with selenium Part.3", so thank you. That's all for enp.

Recommended Posts