[PYTHON] Scraping news of incidents on livedoor

I am interested in the incidents that occur in the world, so I decided to scrape the news of incidents and accidents that are updated daily on Livedoor News. It is a memo for myself of the flow. [livedoor NEWS Domestic Incidents / Accidents](https://news.livedoor.com/%E5%9B%BD%E5%86%85%E3%81%AE%E4%BA%8B%E4%BB%B6 % E3% 83% BB% E4% BA% 8B% E6% 95% 85 / topics / keyword / 31673 /)

1. Know what scraping is

Web scraping with python ← Check the basic movements on this site for the time being.

2. Try it on the livedoor NEWS site

The above site was experimenting with yahoo's professional baseball player batting average, so when I tried to reproduce it on the livedoor NEWS site, a problem occurred. .. I tried to access and get information as described in the above site, but the access is not permitted.

403 Forbidden
Forbidden
You don't have permission to access /Domestic incidents / accidents/topics/keyword/31673/
on this server.

Read the following site and solve it. [Python] 403 Forbidden: What to do when you do n’t have permission to access on this server Although it depends on the site, it seems that livedoor NEWS cannot be accessed unless the user agent is described in the header at the time of request. Fixed.

3. I want to jump to the news link

The process of accessing the livedoor NEWS site and writing the information to a file is complete. The next thing I want to do is follow the link from the news list page to jump to each news page. One news block on the news list page is as follows, and the href value of the a tag is a link. How can I get this?

<li class="hasImg">
  <a href="https://news.livedoor.com/topics/detail/18794424/">
    <p class="articleListImg"><img src="https://sl.news.livedoor.com/a8affe16ad083d6f44918377f5748e09849ffbc0/small_light(p=S80)/https://image.news.livedoor.com/newsimage/stf/3/f/3f701_1675_218e1f3c1a51c13c80249b4bd8af0afe-m.jpg " onMouseDown="return false;" onSelectStart="return false;" oncontextmenu="return false;" galleryimg="no"></p>
    <div class="articleListBody">
      <h3 class="articleListTtl">Arrested 3 people on suspicion of stealing sustainability benefits. Is a total of 400 million yen illegally received?</h3>
      <p class="articleListSummary">In addition, about 400 freeters and students are suspected of being involved in fraudulent applications.</p>
      <time datetime="2020-08-26 16:57:08" class="articleListDate">16:57</time>
    </div>
  </a>
</li>

See the following site to solve the problem. [Python] Get href value with Beautiful Soup

#First
link = soup.find('a')
#from
link.get('href')
#Get the link at!

It was surprisingly easy.

4. Completion

The following is the completed program. It is a level that can be used as it is. However, since the list pages are the 2nd and 3rd pages, I would like to improve it so that I can get the news for the past 2 days. Also, it takes a few minutes to fetch all the articles, so it would be better if we could improve it so that it could be fetched sooner.

import requests
import time
from bs4 import BeautifulSoup

def Get_soup( url ):
    headers = {
        "User-Agent" : "Mozilla/... Chrome/... Safari/..."
    }

    response = requests.get(url , headers = headers)

    response.encoding = response.apparent_encoding

    return BeautifulSoup(response.text, "html.parser")

def Get_article( url , f ):
    soup = Get_soup( url )

    title = soup.find( 'h1' , class_ = "articleTtl" )

    f.write( title.text )

    body = soup.find( 'div' , class_ = "articleBody" )
    
    f.write( body.text )

    f.write('@@@@@\n')

    time.sleep(1)


def go_through( url , f  ):
    soup = Get_soup( url )

    footer = soup.find( 'div' , class_ = "articleFooter" )
    
    link = footer.find('a')

    url_ = link.get('href')

    Get_article(url_ , f )


def main():
    url = "https://news.livedoor.com/%E5%9B%BD%E5%86%85%E3%81%AE%E4%BA%8B%E4%BB%B6%E3%83%BB%E4%BA%8B%E6%95%85/topics/keyword/31673/"

    soup = Get_soup( url )

    topic_date_ = soup.find( 'h2' , class_ = "keywordDate" )
    topic_date_ = topic_date_.text
    title = 'CaseNews_' +  topic_date_ + '.txt'
    f = open( title , 'w' )

    articles_ = soup.find( 'ul' , class_ = "articleList" )

    articles = articles_.find_all( 'li' )

    art_len = len(articles)
    dl_count = 1
    for art_i in articles:
        link_i_ = art_i.find('a')
        url_i_ = link_i_.get('href')
        go_through(url_i_ , f )

        print( dl_count , end = '/' )
        print( art_len , end = ' ' )
        print('downloaded.' , end = '\n' )
        dl_count += 1
        # go_through:Function to overcome the linked site before this article

    f.close()

main()

Recommended Posts

Scraping news of incidents on livedoor
Basics of Python scraping basics
Web scraping of comedy program information and notification on LINE
Scraping the result of "Schedule-kun"
Implementation of MathJax on Sphinx
Handling of python on mac