[PYTHON] [Recommended tagging for machine learning # 1] Scraping of Hatena blog articles

Hi, this is Bython Chogo. I have to learn English so I try to post article both English and Japanese :(

Now studying Machine Learning and practicing test scripting with Bayesian filtering. my plan is to estimate tag from web posted contents after learning several posts and tags. Bayesian sample script can be got from Gihyo web page, I'll introduce later, before that today's topic and problem to talk is scraping contents from post.

I found good slide to describe what I'd like to say however I've lost ... orz. Will add it later. Regarding the article, there is two way to scrape body contents. One is using characterized format of each contents. I don't need header or footer date for learning words because it may not useful for identifying the tag.

As a example, I try to scrape only article on Hatena Blog, the article is between the below tags.

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

this case, I wrote below code.

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find("div", {"class": "entry-content"})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

Looks not cool.. but it works :( Also I have to prepare all format I will scrape. This is very tired. So second way is to use learning method! But this looks difficult for me.

To be continued...

Hi, my name is Mayor Umemura. Thank you. I write in English and Japanese because I am also learning English, but I hope you will keep an eye on the ugliness of English. I am currently studying machine learning, and as part of my practical experience, I am making an automatic tagging system for articles using Bayesian. However, there are many new things to remember when doing it, and the road to Senri is just one step away, so I'm doing it steadily.

So, today's topic is the extraction of articles used for learning and judgment, so-called scraping. You can find articles in various places with hot themes. There was a good article I searched for the other day, but I inadvertently forgot. I would like to re-tension it later. So, the content of the article introduced two methods to extract only the body of the article, ignoring the header and footer of the target page.

One is to register and extract the box tag of the article body steadily according to the format of the site. For example, in the case of Hatena Blog.

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

So, I wrote the following script to extract the contents from this one.

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find("div", {"class": "entry-content"})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This is all I can do, thinking that the code is probably ugly. So, with this method, you have to register the characteristic enclosure of each site, and if that is troublesome, use the learning of the second method, I feel like it was written in the above article. To do. However, it is a difficult place with the current ability.

I would like to serialize this script until it is completed.

Recommended Posts

[Recommended tagging for machine learning # 1] Scraping of Hatena blog articles
[Recommended tagging for machine learning # 2.5] Modification of scraping script
[Recommended tagging for machine learning # 4] Machine learning script ...?
Summary of recommended APIs for artificial intelligence, machine learning, and AI
2020 Recommended 20 selections of introductory machine learning books
Beginning of machine learning (recommended teaching materials / information)
Recommended study order for machine learning / deep learning beginners
Data set for machine learning
Japanese preprocessing for machine learning
Basics of Machine Learning (Notes)
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Importance of machine learning datasets
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
How to use machine learning for work? 01_ Understand the purpose of machine learning
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
Machine learning ③ Summary of decision tree
<For beginners> python library <For machine learning>
Machine learning meeting information for HRTech
"Scraping & machine learning with Python" Learning memo
How to use machine learning for work? 02_Overview of AI development project
An introductory reader of machine learning theory for IT engineers tried Kaggle
[Example of Python improvement] What is the recommended learning site for Python beginners?
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Python learning memo for machine learning by Chainer until the end of Chapter 2
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Summary of mathematical scope and learning resources required for machine learning and data science
Machine learning algorithm (generalization of linear regression)
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
An introduction to OpenCV for machine learning
Why Python is chosen for machine learning
"Usable" one-hot Encoding method for machine learning
Machine learning algorithm (implementation of multi-class classification)
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
An introduction to Python for machine learning
[Machine learning] List of frequently used packages
Creating a development environment for machine learning
Judgment of igneous rock by machine learning ②
[For beginners of artificial intelligence] Machine learning / Deep Learning Programming Learning path and reference books
Align the number of samples between classes of data for machine learning with Python
A memorandum of method often used in machine learning using scikit-learn (for beginners)