Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath

I've been training for scraping since the other day, I couldn't do the following things, I made it, so I will write it in the article.

-I want to scrape the text and the link destination URL that exist in the table structure as a set (using DataFrame of pandas) -The link destination URL had multiple a hrefs in the same table, and no identifiable name was given, so it was difficult to take even a regular expression. → I decided to use XPath because it seemed good to specify a text sentence, specify it as a link destination of that text, and scrape it. (DataFrame will return an error if the number of rows is not aligned, so I want to omit unnecessary data and take it surely) ・ Beautiful Soup cannot use XPath, but it can be done by using lxml.

[Site that I referred to] http://gci.t.u-tokyo.ac.jp/tutorial/crawling/ http://www.slideshare.net/tushuhei/python-xpath http://qiita.com/tamonoki/items/a341657a86ff7a945224

scraping.py


#coding: utf-8
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
import time
import lxml.html

aaa = []
bbb = []

for page in range(1,2):
	url = "http://www.~~~" + str(page)
	html = urllib2.urlopen(url)
	html2 = urllib2.urlopen(url)
	soup = BeautifulSoup(html, "lxml")
	dom = lxml.html.fromstring(html2.read())

	for o1 in soup.findAll("td", class_="xx"):
		aaa.append(o1.string)

	for o2 in dom.xpath(u"//a[text()='xxx']/@href"): #Get href by specifying text for xxx part
		bbb.append(o2)

	time.sleep(2)

df = pd.DataFrame({"aaa":aaa, "bbb":bbb})
print(df)
df.to_csv("xxxx.csv", index=False, encoding='utf-8')

It's easy, but that's it for today.

Recommended Posts

Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
[Python] Delete by specifying a tag with Beautiful Soup
Try scraping with Python + Beautiful Soup
Scraping with Python and Beautiful Soup
Get property information by scraping with python
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
[Python] Get the files in a folder with Python
Specifying the module loading destination with GAE python
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Get the URL of the HTTP redirect destination in Python
Scraping with Beautiful Soup
I made a class to get the analysis result by MeCab in ndarray with python
Table scraping with Beautiful Soup
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
[Python] Get elements by specifying attributes with prefix search in BeautifulSoup
Get Splunk download link by scraping
Link to get started with python
Scraping multiple pages with Beautiful Soup
[Python] A memorandum of beautiful soup4
Get the weather with Python requests
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Get Qiita trends with Python scraping
Website scraping with Python's Beautiful Soup
Get weather information with Python & scraping
A memo organized by renaming the file names in the folder with python
Get the number of searches with a regular expression. SeleniumBasic VBA Python
Extract lines that match the conditions from a text file with python
Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium
I get a Python No module named'encodings' error with the aws command
How to sort by specifying a column in the Python Numpy array.
[Python] Get the variable name with str
Search the maze with the python A * algorithm
Install by specifying the version with pip
Try HTML scraping with a Python library
[Python] Replace the text output by MeCab with the important words extracted by MeCab + Term Extract.
Python / subprocess> Symbolic link Implementation to get only the destination file name> os.readlink ()
Get a list of articles posted by users with Python 3 Qiita API v2
[Python] How to save images on the Web at once with Beautiful Soup
Find the ideal property by scraping! A few minutes walk from the property to the destination
Get the stock price of a Japanese company with Python and make a graph
How to get a list of files in the same directory with python
[Introduction to Python] How to get the index of data with a for statement