Overview

I want to extract only the store name from the store name list of goToEat and output it to CSV.

Beautifulsoup requests python3 windows10

I am using.

Error details and reason

I was able to extract the store name including the tag in the form of a list by specifying the html tag with the following code


        urlName = "https://premium-gift.jp/eatosaka/use_store?events=page&id={}&store=&addr=&industry=".format(PageNumber)
        dataHTML = requests.get(urlName)
        soup = BeautifulSoup(dataHTML.content, "html.parser")
        elems = soup.select('h3.store-card__title')

Replace and delete extra information and output to CSV. I was told that i.text can be used to get text information.

    with open(r'C:\Users\daisuke\Desktop\python\eat.csv', 'w') as f:
        writer = csv.writer(f)
        for i in elems:
            """
            i = str(i)
            i = i.replace('<h3 class="store-card__title">', '')
            i = i.replace('</h3>', '')
            i = i.replace('  ', '  ')
            i = i.replace(' ', ' ')
            """
            print(i.text)

            try:
                writer.writerow([i.text])
            except:
                writer.writerow(['error'])

The following error occurs

Live spiny lobster dish Chunagon Osaka Station 3 Building
Traceback (most recent call last):
  File "C:\Users\daisuke\Desktop\python\go_to_eat.py", line 24, in <module>
    writer.writerow(i)
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 20: illegal multibyte sequence

Reference 1, Reference 2

Since the pages to be scraped are made with various character codes, they are automatically decoded with any character code during scraping.
The target character code is OS-dependent, and CP932 (shift_jis) is selected for windows.
This is a Japanese character code and does not support ** \ xa0 (no break space) **

Solution

Therefore, we replaced the non-breaking space with a half-width space as shown below. So to speak, it's not good because it's a symptomatic treatment.


        for i in elems:
            i = str(i)
            i = i.replace('<h3 class="store-card__title">', '')
            i = i.replace('</h3>', '')
            i = i.replace('  ', '  ')
            i = i.replace(' ', ' ')
            print(i)

            try:
                writer.writerow([i])
            except:
                writer.writerow(['error'])

Perhaps the best thing is to specify a character code that can properly express the character in question. If you give the encoding keyword argument to the open () function as shown below, you can directly specify the character code used in the automatic conversion, so make it UTF-8 etc. that can express Unicode characters. That's fine.

The characters are garbled when the CSV file is opened, but it is okay if you change the character code.


with open(r'C:\Users\daisuke\Desktop\python\eat.csv', 'w', encoding='utf-8') as f:

However, when reading from CSV, an unnecessary blank column was added as shown below. ~~ I still don't know why. ~~ A detailed person told me in the comments and solved it! Thank you

['Wolfgang Steakhouse by Wolfgang Steakhouse Osaka']
[]
['Vineyard']
[]
['Sumikoku Rotating Chicken Cuisine LUCUA']

[GO] UnicodeEncodeError:'cp932' during python scraping

Overview

Error details and reason

Solution