Until the last time, as mentioned in this article, "Let's get notified of the weather in your favorite area from yahoo weather on LINE!" I explained from yahoo weather to getting the URL of the whole country.
This time, in this PART2, I will briefly explain from the URL of the national area to the acquisition of the URL of the detailed area (city, ward, town, village). think.
First, here is the URL I got last time.
The information of the page obtained from the first page in this URL is as follows. You can easily see the weather of the main cities, wards, towns and villages that belong to the region.
If you click "Wakkanai" on this screen, the detailed weather and precipitation probability of "Wakkanai" will be displayed.
What to do from here ・ "Getting the name and URL of the area and city" ・ "Getting weather information from the URL of the city" There are two.
From here, I will explain how to actually get the information. First, I will explain "getting the name and URL of the area and city".
The program is as follows.
with open("yahooChiku.csv", "r", encoding="utf-8") as readChikuNum:
reader = csv.reader(readChikuNum)
with open("shosaiChiku.csv", "w", encoding="cp932", newline="") as schiku:
writer = csv.writer(schiku)
column = ["Rural", "Municipality", "URL"]
writer.writerow(column)
for target_url in reader:
res = requests.get(target_url[0])
soup = BeautifulSoup(res.text, 'lxml')
chiku = re.search(r".*of", str(soup.find("title").text)).group().strip("of")
elems = soup.find_all("a")
chikuList, shosaiNumList = [], []
chikuNameList = [chikuName.get_text() for chikuName in soup.find_all(class_= "name")]
for e in elems:
if re.search(r'data-ylk="slk:prefctr', str(e)):
if re.search(r'"https://.*html"', str(e)):
row = re.search(r'"https://.*html"', str(e)).group().strip('"')
chikuList.append(chiku)
shosaiNumList.append(row)
for p, e, c in zip(chikuList, chikuNameList, shosaiNumList):
writeList = [p, e, c]
writer.writerow(writeList)
The first with open reads the URL file, and the second with open opens the local and city and the file to write the URL of each. Next, store the html information in soup and acquire the necessary information in sequence. Adjust and assign the local name of the acquisition destination to chiku so that it is only the local name with a regular expression. In elems, save the a tag of html with find_all to get the URL of the city / ward / town / village destination.
This is where the variables written to the file come into play. In chikuNameList, the one whose tag is "name" is obtained from the local html using the inclusion notation. Fortunately, all the city names are in the "name" tag. Regarding the for statement, since there is a URL of the city, ward, town, and village in the "data-ylk =" slk: prefctr "tag", set the condition in the first if statement. Since the "data-ylk =" slk: prefctr "tag has data other than the URL of the city, ward, town, and village, only the one that matches the URL format is judged by the regular expression search. Then add the district name to chikuList and the URL of the city to shosaiNumList.
In the last for statement, write the local name, city, ward, town, and URL stored in the list to "shosaiChiku.csv" line by line.
And the resulting file looks like this:
It is possible to access the URL of each city, ward, town, and village as it is, and bring in the desired data by regular expression or scraping, but I noticed that there is RSS, so I decided to add that as well.
df = pd.read_csv("shosaiChiku.csv", encoding="cp932")
with open("dataBase.csv", "w", encoding="cp932", newline="") as DBcsv:
writer = csv.writer(DBcsv)
#Header writing
columns = ["Rural", "Municipality", "URL", "RSS"]
writer.writerow(columns)
#Write data (town name, city, URL, RSS) line by line
for place, city, url in zip(df["Rural"], df["Municipality"], df["URL"]):
row = [place, city, url]
rssURL = "https://rss-weather.yahoo.co.jp/rss/days/"
#From the URL "Number.Get "html"> "Numbers".Molded into rss
url_pattern = re.search(r"\d*\.html", url).group()
url_pattern = url_pattern.replace("html", "xml")
rssURL = rssURL + url_pattern
row.append(rssURL)
writer.writerow(row)
Almost everything you do is the same as the previous source. Since most of the data is contained in shosaiChiku, please enter the RSS URL.
Just add a little bit. (I changed my mind and tried using pandas read_csv.)
The URL that is the basis of RSS is the character string " https://rss-weather.yahoo.co.jp/rss/days/
"in rssURL.
What the program is doing is to first read shosaiChiku line by line and get the "region", "city", and "URL".
I noticed that the URL after "days /" in RSS is the same as the number part of the URL of the city, ward, town, and village.
Next, extract only the number part of the URL of the city, ward, town, and village with a regular expression.
Also, RSS is not ".html" but ".xml", so convert it.
Now that we know the RSS URL, we append it to the list and write it.
Here is the resulting file. It's hard to see because you don't open it directly and use it, but now you have the data to do what you want. (When I have time, I plan to use sqlite to make it look like databese)
I wrote a lot, but it's been long, so I'll stop by "getting the name and URL of the area and city" out of the two. Also, in the next update, I hope I can explain how to get the weather information and send it on LINE. .. ..
So next time.
Recommended Posts