[PYTHON] BeautifulSoup use case: read anchored links of interest from a url

Introduction I have a url of an html page that contains links to various courses. I am interested in extracting href links of courses only.

Problem solution outline

  1. The url is https://www.jdla.org/certificate/engineer/#certificate_No04
  2. First I will read the url and save raw html text in a variable
  3. Then I will instantiate BeautifulSoup, using its instance I first get the list of div blocks with a class wp-block-group
  4. Finally I will parse each div block and extract href links

Problem solution code

import requests
from bs4 import BeautifulSoup

url = "https://www.jdla.org/certificate/engineer/#certificate_No04"

res = requests.get(url)
html = res.text

soup = BeautifulSoup(html, 'lxml')

divs = soup.find_all('div', attrs={'class':'wp-block-group'})
preliminary_links = [div.find_all('a') for div in divs]

# preliminary_links is in the form of list of lists. Moreover some lists are empty. We will flatten it and get rid of empty lists simultaneously

import itertools

flat_links = list(itertools.chain.from_iterable(preliminary_links))

# Finally I have got list of links I am interested in. Now I will extract their href attributes and text

links = [(link.get_text(),link.get_attribute_list('href')) for link in flat_links]

# extract actual href from list of lists
clean_links = [(title,href_list_of_list[0]) for (title,href_list_of_list) in links]
links_df = pd.DataFrame(columns=['title','link'], data=clean_links)

Recommended Posts

BeautifulSoup use case: read anchored links of interest from a url
Python> Read from a multi-line string instead of a file> io.StringIO ()
How to get a list of links from a page from wikipedia
Read and use Python files from Python
Use BeautifulSoup to extract a link containing a string from an HTML file