[PYTHON] I tried to get Web information using "Requests" and "lxml"

I'm thinking of using Scrapy, but first I tried to get Web information with "Requests" and "lxml". The first step in web scraping using Python.

What i did

--Getting information on the Web using "Requests" --Extracting necessary information from HTML obtained using "lxml"

Installation

pip install requests
pip install lxml

HTML for testing

I placed it on EC2 and tested it via the Internet.

test.html


<html>
    <body>
        <div id="test1">test1
            <ul id="test1_ul">test1 ul</ul>
        </div>
    </body>
</html>

Scraping code

--If you pass the URL as an argument, process from that HTML --User-Agent changed to Mac just in case

(Error handling when there is no argument etc. is not implemented)

scraping.py


import sys
import requests
import lxml.html

#set dummy user-agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8'}

#Specify URL as an argument
url = ''
if len(sys.argv) > 1:
    url = sys.argv[1]

response = requests.get(url, headers = headers)
html = lxml.html.fromstring(response.content)

for div in html.xpath('//*[@id="test1_ul"]') :
    print(div.text)

The execution command is as follows. The URL of the argument is arbitrary.

python scraping.py http://ec2******

Other

It's convenient to be able to easily get XPath and CSS selectors with Chrome developer tools.

Recommended Posts

I tried to get Web information using "Requests" and "lxml"
[Python] I tried to get various information using YouTube Data API!
I tried web scraping using python and selenium
I tried to get an AMI using AWS Lambda
Try to get a web page and JSON file using Python's Requests library
Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.
I tried to get the location information of Odakyu Bus
I tried to get various information from the codeforces API
I tried to get data from AS / 400 quickly using pypyodbc
I tried to get a database of horse racing using Pandas
I tried to get the index of the list using the enumerate function
I tried to let Pepper talk about event information and member information
I tried to get a list of AMI Names using Boto3
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
I tried using Azure Speech to Text.
I tried using Twitter api and Line api
I tried to get started with Hy
I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ
I tried to classify text using TensorFlow
I tried to make a Web API
Scraping using lxml and saving to MySQL
I tried to predict Covid-19 using Darts
I tried to get the batting results of Hachinai using image processing
I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()
I tried to extract and illustrate the stage of the story using COTOHA
I tried to get the movie information of TMDb API with Python
Start a web server using Bottle and Flask (I also tried using Apache)
I tried to create a sample to access Salesforce using Python and Bottle
I want to make a web application using React and Python flask
I tried using PyEZ and JSNAPy. Part 1: Overview
I tried web scraping to analyze the lyrics.
I implemented DCGAN and tried to generate apples
I tried to get an image by scraping
I tried object detection using Python and OpenCV
I tried to synthesize WAV files using Pydub.
I tried to get CloudWatch data with Python
Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4
[Introduction to PID] I tried to control and play ♬
I tried to make a ○ ✕ game using TensorFlow
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[ES Lab] I tried to develop a WEB application with Python and Flask ②
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried to debug.
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried to paste
I tried using Jupyter