[Python] A function that aligns the width by inserting a space in text that has both full-width and half-width characters.

Introduction

I was writing a program to get a list of headlines and URLs from the Yahoo News site and display each item in one line, but I had a little trouble aligning the URL columns neatly, so for the future I will write an article in.

Get the data from the following sites. 2020-10-15_09h07_43.png

The final text to get is as follows. 2020-10-15_09h13_18.png

Development environment

Use Python 3.7. The development environment is Visual Studio Community 2019.

code

import requests
import unicodedata
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def main():
    base_url = 'https://news.yahoo.co.jp/'
    categories = {
        'Major': '',
        'Domestic': 'categories/domestic',
        #'Entertainment': 'categories/entertainment',
        #'international': 'categories/world',
        #'Economy': 'categories/business',
        }

    #Loop processing for each category
    for cat in categories:
        url = urljoin(base_url, categories[cat])

        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml') # html.parser

        ul_tag = soup.find('div', class_='topicsList')\
                     .find('ul', class_='topicsList_main')

        print(f'==={cat}===')

        for item in ul_tag.find_all('li', class_='topicsListItem'):
            a = item.find('a')
            topic_url = a['href']
            topic_headline = a.text.strip()
            
            #print(f'{topic_headline:<18}[{topic_url}]')
            text = text_align(topic_headline, 30)
            print(f'{text}[{topic_url}]')

        print()

def get_han_count(text):
    '''
Calculate the length of the character string with "2" for full-width characters and "1" for half-width characters.
    '''
    count = 0

    for char in text:
        if unicodedata.east_asian_width(char) in 'FWA':
            count += 2
        else:
            count += 1

    return count

def text_align(text, width, *, align=-1, fill_char=' '):
    '''
Text with mixed full-width / half-width
Fill in blanks so that it has the specified length (half-width conversion)
    
    width:Specify the number of characters in half-width conversion
    align: -1 -> left, 1 -> right
    fill_char:Specify the character to fill

    return:Text filled with whitespace ('abcde     ')
    '''

    fill_count = width - get_han_count(text)
    if (fill_count <= 0): return text

    if align < 0:
        return text + fill_char*fill_count
    else:
        return fill_char*fill_count + text

if __name__ == '__main__':
    main()

Initially, the format of the output text was as follows.

for item in ul_tag.find_all('li', class_='topicsListItem'):
    a = item.find('a')
    topic_url = a['href']
    topic_headline = a.text.strip()
            
    #This code will shift the URL column.
    print(f'{topic_headline:<18}[{topic_url}]')

In this case, the output will be as follows. 2020-10-15_09h23_08.png

A specification such as print (f'{topic_headline: <18} [{topic_url}]') will handle full-width characters and half-width characters without distinction.

Therefore, I created a function to distinguish between full-width and half-width text and insert the required white space.

def get_han_count(text):
    '''
Calculate the length of the character string with "2" for full-width characters and "1" for half-width characters.
    '''
    count = 0

    for char in text:
        if unicodedata.east_asian_width(char) in 'FWA':
            count += 2
        else:
            count += 1

    return count

def text_align(text, width, *, align=-1, fill_char=' '):
    '''
Text with mixed full-width / half-width
Fill in blanks so that it has the specified length (half-width conversion)
    
    width:Specify the number of characters in half-width conversion
    align: -1 -> left, 1 -> right
    fill_char:Specify the character to fill

    return:Text filled with whitespace ('abcde     ')
    '''

    fill_count = width - get_han_count(text)
    if (fill_count <= 0): return text

    if align < 0:
        return text + fill_char*fill_count
    else:
        return fill_char*fill_count + text

In the end, format it with code like this:

for item in ul_tag.find_all('li', class_='topicsListItem'):
    a = item.find('a')
    topic_url = a['href']
    topic_headline = a.text.strip()
            
    text = text_align(topic_headline, 30)
    print(f'{text}[{topic_url}]')

This solved the problem. For the time being, text_align () has an option to insert a space on the left side and a symbol other than a space can be specified.

How to output to a text editor etc.

By the way, I think that the output from a Python program is usually a command prompt, but in this case you may want to export it to a text editor or word processor and save it.

In such a case, you can use the software Paster to paste directly to the caret position such as an editor. 2020-10-15_09h50_53.png

Then, the data will be pasted directly as shown below. 2020-10-15_09h13_18.png

At the end

This time it was a text format theme, so I didn't explain the scraping code, but almost all the find () methods were sufficient.

You are free to use the above source code, but please do so at your own risk.

Reference site

For how to use the function that distinguishes between full-width and half-width (ʻunicodedata.east_asian_width () `), I referred to the following site.

Count the number of characters (width) as 1 half-width character and 2 full-width characters in Python

When doing web scraping, be sure to check robots.txt of the target site.

text:news.yahoo.co.jp/robots.txt


User-agent: *
Disallow: /comment/plugin/
Disallow: /comment/violation/
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml
Sitemap: https://news.yahoo.co.jp/byline/sitemap.xml
Sitemap: https://news.yahoo.co.jp/polls/sitemap.xml

Recommended Posts

[Python] A function that aligns the width by inserting a space in text that has both full-width and half-width characters.
A function that measures the processing time of a method in python
How to put a half-width space before letters and numbers in Python.
The eval () function that calculates a string as an expression in python
Divides the character string by the specified number of characters. In Ruby and Python.
Correct half-width and full-width notation fluctuations in Python
[Python] Change the text color and background color of a specific keyword in print output
A note that runs an external program in Python and parses the resulting line
Full-width and half-width processing of CSV data in Python
Shift the alphabet string by N characters in Python
What does the last () in a function mean in Python?
A library that monitors the life and death of other machines by pinging from Python
Check the argument type annotation when executing a function in Python and make an error
I also tried to imitate the function monad and State monad with a generator in Python
Find out the apparent width of a string in python
A function that divides iterable into N pieces in Python
A program that removes specific characters from the entered text
Create code that outputs "A and pretending B" in python
How to write a metaclass that supports both python2 and python3
What I learned and coded for a function that opens a special Windows folder in Python3 ctypes
[Python] A function that searches the entire string with a regular expression and retrieves all matching strings.
Create a function in Python
Count the number of Thai and Arabic characters well in Python
Use Python's unicodedata library to display both full-width and half-width characters
Read the standard output of a subprocess line by line in Python