[PYTHON] [Recommended tagging for machine learning # 2.5] Modification of scraping script

<ENGLISH>

Hello - I hope you have a good day. Happy weekend should be happy cording day :smile:

Ok, today I will not proceed the scripting and I'd like to modify previous script. The script is below from #2:

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

Yes, it works, but want to use (1) BeautifulSoup instead of regular expression and (2)Hash list instead of counting inside for.

(1) BeautifulSoup

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

Regular Expression is strong tool, but I have to learn BeautifulSoup more. Beautiful Soup is using unique type for it's string, and we can check how to use it in user's guide. I modified it as below.

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    soup2 = BeautifulSoup(tag.encode('utf8'), "lxml")
    print "".join([string.encode('utf8') for string in soup2.strings])

Looks smarter? :satisfied: you got another soup for getting strings. Which do you like?

(2) Hash List for splitting. Watch out!

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

To get splitter strings for each web site, used c as count up integer. That's not cool. So I modified as below.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    numHash = {}
    for i in range(len(scraper)):
        numHash[scraper[i][0]] = i 
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            c = numHash[domain[0]]
            break

yes, it becomes longer, but I think it's much better than previous, isn't it?

Great, next I hope I can proceed to next step... It will be getting elements for learning.

\

Yes, domo. It's a weekend. Let's go coding to have a good weekend. Today I would like to modify the script I did in # 2 before proceeding. This is it.

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

It still works, but the changes are (1) use BeautifulSoup instead of regular expression for tag removal, and (2) use hash list instead of count-up for delimiter selection. I will.

(1) Use Beautiful Soup

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

Regular expressions are very useful, but I was wondering if I could make the Beautiful Soup more effective. In BS, tools for extracting the character string inside are available, but it was difficult at first because of the unique character string format. However, it is well documented, so I have no choice but to get used to it.

And this is after the change!

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    soup2 = BeautifulSoup(tag.encode('utf8'), "lxml")
    print "".join([string.encode('utf8') for string in soup2.strings])

Doesn't it feel cool? I changed the soup to pull out the character string in the tag by replacing it again.

(2) Use a hash list as a delimiter About here.

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

It looks like counting up the C variable and adjusting the delimiter number. Hmmm, is it going to be crazy? And it's a nice makeover.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    numHash = {}
    for i in range(len(scraper)):
        numHash[scraper[i][0]] = i 
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            c = numHash[domain[0]]
            break

The script has become longer than I expected. But I like this one very much. I wonder if I can make it a little cleaner.

So, this time I made a self-satisfying correction. Next time I think I will move on to the next. It is scraping of links and tag lists that are the basis of learning. When will we get to machine learning? .. .. It's about to be called a fraud.

Recommended Posts

[Recommended tagging for machine learning # 2.5] Modification of scraping script
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 4] Machine learning script ...?
[Recommended tagging for machine learning # 1] Scraping of Hatena blog articles
2020 Recommended 20 selections of introductory machine learning books
Summary of recommended APIs for artificial intelligence, machine learning, and AI
Beginning of machine learning (recommended teaching materials / information)
Recommended study order for machine learning / deep learning beginners
Image collection Python script for creating datasets for machine learning
Data set for machine learning
Japanese preprocessing for machine learning
Basics of Machine Learning (Notes)
Importance of machine learning datasets
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Significance of machine learning and mini-batch learning
Machine learning ③ Summary of decision tree
<For beginners> python library <For machine learning>
How to use machine learning for work? 01_ Understand the purpose of machine learning
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
Machine learning meeting information for HRTech
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
"Scraping & machine learning with Python" Learning memo
How to use machine learning for work? 02_Overview of AI development project
An introductory reader of machine learning theory for IT engineers tried Kaggle
[Example of Python improvement] What is the recommended learning site for Python beginners?
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Python learning memo for machine learning by Chainer until the end of Chapter 2
Summary of mathematical scope and learning resources required for machine learning and data science
Machine learning algorithm (generalization of linear regression)
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
An introduction to OpenCV for machine learning
Why Python is chosen for machine learning
"Usable" one-hot Encoding method for machine learning
Machine learning algorithm (implementation of multi-class classification)
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
An introduction to Python for machine learning
[Machine learning] List of frequently used packages
Creating a development environment for machine learning
Judgment of igneous rock by machine learning ②
Machine learning
Align the number of samples between classes of data for machine learning with Python
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine learning memo of a fledgling engineer Part 1
An introduction to machine learning for bot developers
Classification of guitar images by machine learning Part 1
The story of low learning costs for Python
Machine learning starting from 0 for theoretical physics students # 1
Python & Machine Learning Study Memo ⑤: Classification of irises
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Upgrade the Azure Machine Learning SDK for Python
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
[Python] Collect images with Icrawler for machine learning [1000 images]
List of links that machine learning beginners are learning
Overview of machine learning techniques learned from scikit-learn
About the development contents of machine learning (Example)
Summary of evaluation functions used in machine learning