[PYTHON] [Recommended tagging for machine learning # 2] Extension of scraping script

Hi, this is Chogo again. Today is cool day and good day for programing inside warm home :)

So today topic is Scraping again. before that, I'd like to explain my goal of this series. My goal is building a system for tag suggesting with machine learning of Bayesian method. Learning articles and tags I already put on then checking articles for suggesting tags. I have to many things to learn so I don't know how many articles for the goal, I will do one by one.

Ok so now today's topic is still scraping. article #1 I explained how to scrape articles from Hatenablog. However this script was only for Hatenablog. I have to extend this script for other web sites.

First i'd like to show you modified script.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    c = 0
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            break
        c += 1

    response = urllib2.urlopen(url)
    html = response.read()
    
    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This script can scrape articles from Hatana Blog and Qiita. Below are tags of Hatena blog and Qiita.

Hatena Blog:

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

Qiita:

    <div class="col-sm-9 itemsShowBody_articleColumn"><section class="markdownContent markdownContent-headingEnabled js-task-list-container clearfix position-relative js-task-list-enabled" id="item-xxx" itemprop="articleBody">
    CONTENTS to SCRAPE!
    </div>

So with BeautifulSoup, I wrote up like this. Feeding the elements for the soup...

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]

then, have the soup!

    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]}

Good. now I can get the elements for soup for each web site, I can extend the scrape article on other sites!

I'm going to Umemura. It's cold today. On a day like this, don't play outside, just program in a warm room.

Now, as for the continuation of scraping, before that, I would like to explain my goal this time. The goal will be a tag estimation system using machine learning. It's a substitute for learning bookmark articles that I personally tagged and then trying to estimate the tags that were in the article using the Bayesian method. So, it has become clear that there are many things to remember as we go along, so it is undecided how long this series will continue. I wonder if it will end.

Well the main subject. It will be scraping following the last time. Last time, it was a script that extracts the part of the article on the Hatena blog, but of course, it is necessary to extract it from the article on other sites as well. Therefore, it is necessary to modify it so that it has versatility.

So, you can see the script after remodeling immediately.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    c = 0
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            break
        c += 1

    response = urllib2.urlopen(url)
    html = response.read()
    
    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This script extracts the entry part from the Hatena blog and Qiita article. Each entry is surrounded by the following tags.

Hatena Blog:

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

Qiita:

    <div class="col-sm-9 itemsShowBody_articleColumn"><section class="markdownContent markdownContent-headingEnabled js-task-list-container clearfix position-relative js-task-list-enabled" id="item-xxx" itemprop="articleBody">
    CONTENTS to SCRAPE!
    </div>

Then, specify the part required for the tag judgment of Beautiful Soup as follows.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]

Then, have a soup!

    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]}

It will be like that. If you add the tag information of the site you want to extract the article, you can apply it to other sites.

That's all for today, but this series is still going on.

Recommended Posts

[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
[Recommended tagging for machine learning # 4] Machine learning script ...?
[Recommended tagging for machine learning # 1] Scraping of Hatena blog articles
2020 Recommended 20 selections of introductory machine learning books
Summary of recommended APIs for artificial intelligence, machine learning, and AI
Beginning of machine learning (recommended teaching materials / information)
Recommended study order for machine learning / deep learning beginners
Image collection Python script for creating datasets for machine learning
Data set for machine learning
Japanese preprocessing for machine learning
Basics of Machine Learning (Notes)
Importance of machine learning datasets
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Significance of machine learning and mini-batch learning
Machine learning ③ Summary of decision tree
<For beginners> python library <For machine learning>
How to use machine learning for work? 01_ Understand the purpose of machine learning
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
Machine learning meeting information for HRTech
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
"Scraping & machine learning with Python" Learning memo
How to use machine learning for work? 02_Overview of AI development project
An introductory reader of machine learning theory for IT engineers tried Kaggle
[Example of Python improvement] What is the recommended learning site for Python beginners?
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Python learning memo for machine learning by Chainer until the end of Chapter 2
Summary of mathematical scope and learning resources required for machine learning and data science
Machine learning algorithm (generalization of linear regression)
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
An introduction to OpenCV for machine learning
Why Python is chosen for machine learning
"Usable" one-hot Encoding method for machine learning
Machine learning algorithm (implementation of multi-class classification)
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
An introduction to Python for machine learning
[Machine learning] List of frequently used packages
Creating a development environment for machine learning
Judgment of igneous rock by machine learning ②
[For beginners of artificial intelligence] Machine learning / Deep Learning Programming Learning path and reference books
Machine learning
Align the number of samples between classes of data for machine learning with Python
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
The story of low learning costs for Python
Machine learning of sports-Analysis of J-League as an example-②
Machine learning starting from 0 for theoretical physics students # 1
Python & Machine Learning Study Memo ⑤: Classification of irises
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Upgrade the Azure Machine Learning SDK for Python
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
[Python] Collect images with Icrawler for machine learning [1000 images]
List of links that machine learning beginners are learning
Overview of machine learning techniques learned from scikit-learn
About the development contents of machine learning (Example)