[PYTHON] Note that I dealt with HTML in Beautiful Soup

For the basic usage of Beautiful Soup, please see Scraping with Python and Beautiful Soup.

This time I had the opportunity to handle HTML with Beautiful Soup, so the tips, memos, and memorandums of the processing used at that time, well, that kind of thing. Update from time to time (maybe)

Replace a specific tag with a different tag

for texttag in content.find_all('text'):
	texttag.name = 'p'

Replaced all <text> with <p>

Add tags to elements that are not enclosed by a specific tag

for imgtag in content.find_all('img'):
	if not imgtag.parent.name in ['figure']:
		imgtag.wrap(content.new_tag('figure'))

Find the <img> that is not enclosed in the <figure> and enclose it in the <figure>

Alternatively, the same process can be performed by the following method.

for notwrap_a in content.select("p ~ a"):
	notwrap_a.wrap(content.new_tag("p"))

Find the <a> that is not enclosed in <p> and enclose it in <p>


### Remove all but the first element from the list
for tag in content.find_all('ul'):
	tag.find('li').unwrap()
	
for unwarp_ul in content.find_all('ul'):
	unwarp_ul.unwrap()

for delete_li in content.find_all('li'):
	delete_li.decompose()

First, the first process finds <ul> and removes <li> from the first element of the list with find ('li'). Unwrap. Next, I removed the <ul> and removed the last remaining <li>. The first element is in the state where <li> is removed, so if you want to add a new tag,

tag.find('li').unwrap()

To

first_li = tag.find('li')
first_li.name = 'p'

I think it would be good to do something like that

Remove the parent element of a particular element

for p in soup.find_all('p'):
    p.parent.unwrap()

I'm removing the parent element of <p>

Wrap the element next to the specified element together

Suppose you have the following html

<img src="00001.jp">
<figcaption>caption string1</figcaption>

<img src="00002.jp">

<img src="00003.jp">
<figcaption>caption string3</figcaption>

If there is a <figcaption> next to the <img> and you want to enclose it in a <figure>, you can do as follows.

html = "<img src="00001.jp">
<figcaption>caption string1</figcaption>

<img src="00002.jp">

<img src="00003.jp">
<figcaption>caption string3</figcaption>"

content = BeautifulSoup(html)

for img_tag in content.find_all('img'):          
    fig = content.new_tag('figure')
    img_tag.wrap(fig)

    next_node = img_tag.find_next() 
    if next_node and next_node.name == 'figcaption':        
        fig.append(next_node)

print(content)

If you do this, it will be edited as follows

<figure>
   <img src="00001.jp"/>
   <figcaption>caption string1</figcaption>
</figure>
<figure><img src="00002.jp"/></figure>
<figure>
   <img src="00003.jp"/>
   <figcaption>caption string3</figcaption>
</figure>

Recommended Posts

Note that I dealt with HTML in Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Remove unwanted HTML tags with Beautiful Soup
Scraping with Beautiful Soup
I get an Import Error in Python Beautiful Soup
Work memo that I tried i18n with Flask app
Table scraping with Beautiful Soup
Crawl practice with Beautiful Soup
Extract mypy errors that I fixed in conjunction with git
Try scraping with Python + Beautiful Soup
Scraping multiple pages with Beautiful Soup
Scraping with Python and Beautiful Soup
Scraping pages with pagination with Beautiful Soup
I put Arch in XPS 13 (7390) (Note)
Website scraping with Python's Beautiful Soup
I made a web application in Python that converts Markdown to HTML
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
I tried batch normalization with PyTorch (+ note)
Beautiful Soup
I registered PyQCheck, a library that can perform QuickCheck with Python, in PyPI.
I tried to predict the horses that will be in the top 3 with LightGBM
Note that I understand the least squares algorithm. And I wrote it in Python.
A memo that I wrote a quicksort in Python
[Note] Export the html of the site with python.
I tried to integrate with Keras in TFv1.1
I struggled with conditional branching in Django's Templates.
How to search HTML data using Beautiful Soup
What should I do with DICOM in MPEG2?
Formulas that appear in Doing Math with Python
The story that fits in with pip installation
Note that I was addicted to npm script not passing in the verification environment
A template that I often use when making Discord BOT in Python (memorial note)
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
[Visualization with folium] I feel that FamilyMart has increased too much in recent years.