[PYTHON] str (utf-8) or unicode is important

It seems that the garbled character was a serious misunderstanding, and re was innocent.

The string obtained by renderContents () with BeautifulSoup is str → If you do not decode it, the value will not be entered in the data store

Character string = unicode obtained by picking up attribute values etc. with BeautifulSoup → The value is entered in the data store without unicoding

Until you get used to it, whether the string is str or unicode If you don't program while organizing it, you'll be addicted to it again.

Below, for reference, the source for confirmation.

test.py


# -*- coding: utf-8 -*-
import urllib2
import re
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("http://www.hkt48.jp/schedule/").read())
aaa=soup.find("h3").renderContents()
#This is UTF-8
print aaa+":len="+str(len(aaa))
print type(aaa)
#I'll try re dressed in wet clothes
split=re.split("48 ", aaa)
#This is also UTF-8
print split[0]+":len="+str(len(split[0]))
print type(split[0])
print split[1]+":len="+str(len(split[1]))
print type(split[1])

#This is unicode
bbb=soup.find('div',{"class":"categories"}).find("ul",{"class":"cf"}).find("li").find("a")["title"]
print bbb+":len="+str(len(bbb))+"type"
print type(bbb)

/ 【Execution result】 HKT48 Schedule: len = 24 type 'str' HKT:len=3 type 'str' Schedule: len = 18 type 'str' Birthday: len = 3type type 'unicode'

Recommended Posts

str (utf-8) or unicode is important
str and unicode
What is Minisum or Minimax?