Introduction I have a url of an html page that contains links to various courses. I am interested in extracting href links of courses only.
Problem solution outline
div
blocks with a class wp-block-group
div
block and extract href links
Problem solution code
import requests
from bs4 import BeautifulSoup
url = "https://www.jdla.org/certificate/engineer/#certificate_No04"
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'lxml')
divs = soup.find_all('div', attrs={'class':'wp-block-group'})
preliminary_links = [div.find_all('a') for div in divs]
# preliminary_links is in the form of list of lists. Moreover some lists are empty. We will flatten it and get rid of empty lists simultaneously
import itertools
flat_links = list(itertools.chain.from_iterable(preliminary_links))
# Finally I have got list of links I am interested in. Now I will extract their href attributes and text
links = [(link.get_text(),link.get_attribute_list('href')) for link in flat_links]
# extract actual href from list of lists
clean_links = [(title,href_list_of_list[0]) for (title,href_list_of_list) in links]
links_df = pd.DataFrame(columns=['title','link'], data=clean_links)
Recommended Posts