Get only articles from web pages in Python

A library that allows you to easily extract text from web pages

Extracting data scraped with Python is not useful for HTML tags or later minutes Extra information is often obtained.

In such a case, *** readability-lxml *** is all you need. I will explain here

Install first

(env)$pip install readability-lxml 

Create a utility class like the one below

utils.py


# -*- coding:utf8 -*-
import lxml.html
import readability
def get_content(html):
    """
From HTML strings(title,Text)Get a tuple of.
    """

    document = readability.Document(html)
    content_html = document.summary()
    #Remove HTML tags to get only the body text.
    content_text = lxml.html.fromstring(content_html).text_content().strip()
    short_title = document.short_title()
    return short_title, content_text

Test if you can actually get the title and content using the utility class (I used an article from Yahoo News)

import utils
import requests
obj = requests.get('https://headlines.yahoo.co.jp/hl?a=20191230-00000310-oric-ent')
title,content = utils.get_content(obj.content)
print(title)
print(content)

Please confirm that the article is acquired as follows. image.png

Change log

--2019/12/31 Newly created

Recommended Posts

Get only articles from web pages in Python
Get data from Quandl in Python
Get exchange rates from open exchange rates in Python
Get battery level from SwitchBot in Python
Get Precipitation Probability from XML in Python
Get metric history from MLflow in Python
Get time series data from k-db.com in Python
Get data from GPS module at 10Hz in Python
Get YouTube Comments in Python
Get last month in python
Web scraping notes in python3
OCR from PDF in Python
Get Evernote notes in Python
Get Japanese synonyms in Python
Get your heart rate from the fitbit API in Python!
Get the value while specifying the default value from dict in Python
Hit REST in Python to get data from New Relic
Get macro constants from C (++) header file (.h) in Python
Get message from first offset with kafka consumer in python
Get Leap Motion data in Python.
python web scraping-get elements in bulk
Web application development memo in python
Get the desktop path in Python
Get web screen capture with python
Get the script path in Python
Extract text from images in Python
Get, post communication memo in Python
Get the desktop path in Python
Get the host name in Python
web coder tried excel in Python
Get started with Python in Blender
Extract strings from files in Python
How to get a string from a command line argument in python
Get US stock price from Python with Web API with Raspberry Pi
Get additional data in LDAP with python
[Python] Web application from 0! Hands-on (2) -Hello World-
[Python] Web application from 0! Hands-on (3) -API implementation-
Get a capture of the entire web page in Selenium Python VBA
Get html from element with Python selenium
[Note] Get data from PostgreSQL with Python
Get Suica balance in Python (using libpafe)
Get keystrokes from / dev / input (python evdev)
Python: Reading JSON data from web API
Revived from "no internet access" in Python
Prevent double boot from cron in Python
Get Google Fit API data in Python
How to get a value from a parameter store in lambda (using python)
How to get a stacktrace in python
Get Youtube data in Python using Youtube Data API
[Python] Web application from 0! Hands-on (4) -Data molding-
Get a token for conoha in python
Get Started with TopCoder in Python (2020 Edition)
Generate a class from a string in Python
Generate C language from S-expressions in Python
Get the EDINET code list in Python
Convert from Markdown to HTML in Python
Get Cloud Logging available in Python in 10 minutes
[Python] Web application from 0! Hands-on (0) -Environmental construction-
[Python] Get a list of folders only
[Python] Get the main color from the screenshot
Get rid of DICOM images in Python