[PYTHON] Note that I dealt with HTML in Beautiful Soup

For the basic usage of Beautiful Soup, please see Scraping with Python and Beautiful Soup.

This time I had the opportunity to handle HTML with Beautiful Soup, so the tips, memos, and memorandums of the processing used at that time, well, that kind of thing. Update from time to time (maybe)

Replace a specific tag with a different tag

for texttag in content.find_all('text'):
	texttag.name = 'p'

Replaced all <text> with <p>

Add tags to elements that are not enclosed by a specific tag

for imgtag in content.find_all('img'):
	if not imgtag.parent.name in ['figure']:
		imgtag.wrap(content.new_tag('figure'))

Find the <img> that is not enclosed in the <figure> and enclose it in the <figure>

Alternatively, the same process can be performed by the following method.

for notwrap_a in content.select("p ~ a"):
	notwrap_a.wrap(content.new_tag("p"))

Find the <a> that is not enclosed in <p> and enclose it in <p>

### Remove all but the first element from the list

for tag in content.find_all('ul'):
	tag.find('li').unwrap()
	
for unwarp_ul in content.find_all('ul'):
	unwarp_ul.unwrap()

for delete_li in content.find_all('li'):
	delete_li.decompose()

First, the first process finds <ul> and removes <li> from the first element of the list with find ('li'). Unwrap. Next, I removed the <ul> and removed the last remaining <li>. The first element is in the state where <li> is removed, so if you want to add a new tag,

tag.find('li').unwrap()

first_li = tag.find('li')
first_li.name = 'p'

I think it would be good to do something like that

Remove the parent element of a particular element

for p in soup.find_all('p'):
    p.parent.unwrap()

I'm removing the parent element of <p>

Wrap the element next to the specified element together

Suppose you have the following html

<img src="00001.jp">
<figcaption>caption string1</figcaption>

<img src="00002.jp">

<img src="00003.jp">
<figcaption>caption string3</figcaption>

If there is a <figcaption> next to the <img> and you want to enclose it in a <figure>, you can do as follows.

html = "<img src="00001.jp">
<figcaption>caption string1</figcaption>

<img src="00002.jp">

<img src="00003.jp">
<figcaption>caption string3</figcaption>"

content = BeautifulSoup(html)

for img_tag in content.find_all('img'):          
    fig = content.new_tag('figure')
    img_tag.wrap(fig)

    next_node = img_tag.find_next() 
    if next_node and next_node.name == 'figcaption':        
        fig.append(next_node)

print(content)

If you do this, it will be edited as follows

<figure>
   <img src="00001.jp"/>
   <figcaption>caption string1</figcaption>
</figure>
<figure><img src="00002.jp"/></figure>
<figure>
   <img src="00003.jp"/>
   <figcaption>caption string3</figcaption>
</figure>