[PYTHON] What I was addicted to in Collective Intelligence Chaprter 3. It's not a typo, so I think something is wrong with my code.

Recently, in order to deepen my understanding of machine learning, I started working on Collective Intelligence (the Japanese title is collective intelligence programming). In Chapter 3, Hierarchical Clustering, I created clusters.py and created a function. When I executed it, I was addicted to the element specification because it didn't work.

Collective Intelligence has many typographical errors and few official corrections, so unofficial correction list Was created, but it wasn't listed there either, so I think my code is probably wrong. If you find any mistakes, please let me know.

Executing readfile ('blogdata.txt') in clusters.py

First, I wrote the following code when preparing the dataset and prepared clusters.py.

clusters.py


def readfile(filename):
  lines=[line for line in file(filename)]

  # First line is the column titles
  colnames=lines[0].strip().split('\t')[1:]
  rownames=[]
  data=[]

  for line in lines[1:]:
    p=line.strip().split('\t')
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    data.append([float(x) for x in p[1:]])
  return rownames,colnames,data

Then I imported the file and ran it on the interpreter as follows:

blognames, words, data=clusters.readfile('blogdata.txt')

'could not convert string to float: looking'

I get angry with the message'could not convert string to float: looking'. Here, blogdata.txt stores the following parsed data using feedparser.

	four	looking	second	here	music	until	example	want	wrong	easier	series	re	wasn	service	project	person	episode	best	country	asked	much	life	things	big	couple	had	easy	possible	right	old	people	support	later	time	leave	love	working	awesome	such	data	so	years	didn	internet	million	quite	open	future	san	say	saw	note	take	ways	going	where	many	wants	photos	single	technology	being	around	traffic	world	power	favorite	other	image	her	am	number	tv	th	large	small	past	hours	via	company	learn	states	information	its	always	found	week	really	major	also	play	plan	set	see	movie	last	whole	recent	d	continue	anything	into	link	line	posted	us	ago	having	try	video	let	great	makes	tools	next	process	high	move	doing	could	start	system	fact	should	hope	means	stuff	edition	email	less	web	government	five	become	does	chance	told	work	interview	after	order	office	then	them	they	network	another	do	away	com	voice	hand	photo	night	security	marketing	post	months	way	update	together	p	guy	change	history	live	car	write	product	remember	still	now	january	year	space	shows	friend	than	online	only	between	article	comes	these	media	real	read	early	using	business	aren	lot	trying	building	since	month	very	family	put	ve	site	help	actually	event	reason	ask	american	off	clear	pretty	during	x	close	won	probably	else	look	while	user	game	some	doesn	youtube	go	facebook	click	products	started	control	links	software	front	times	exactly	need	able	based	course	she	state	key	problem	both	well	page	twitter	home	he	friends	amp	companies	likely	even	ever	never	call	tell	give	before	better	went	side	content	isn	features	matter	don	m	points	stop	bad	said	against	three	if	make	left	human	yes	yet	deal	popular	down	digital	me	did	run	box	making	may	man	maybe	talk	nbsp	interesting	thing	think	first	long	little	anyone	were	especially	show	black	get	nearly	morning	behind	reading	across	among	those	different	same	running	money	either	users	enough	videos	film	again	important	u	public	search	two	share	coming	through	late	someone	everyone	house	hard	idea	done	least	part	tool	most	find	please	point	simple	itself	bit	google	often	back	others	bunch	ll	day	text	including	taking	value	almost	thought	latest	add	like	works	buy	minutes	special	under	every	would	phone	must	my	keep	end	over	writing	each	group	got	free	days	already	top	too	took	talking	though	watch	amazon	report	full	however	news	quickly	several	social	everything	why	head	check	no	when	cool	posts	says	goes	sports	today	local	name	turn	place	given	released	any	ideas	sure	written	come	case	good	without	seems	blog	there	program	far	list	design	version	short	might	used	friday	feel	story	store	king	kind	nothing	windows	his	him	art	political	questions	fast	called	once	issues	apple	app	use	few	something	united	six	instead	looks	our	york	their	which	who	ones	view	available	stories	gets	know	press	because	lead	getting	own	made	book
Schneier on Security	1	0	1	2	0	2	1	2	2	1	0	5	0	1	1	0	0	2	2	0	4	0	2	1	2	2	0	1	2	1	4	1	2	6	0	0	0	0	3	2	3	1	0	6	0	0	0	3	0	1	4	0	1	1	5	4	3	0	0	0	2	3	3	0	2	1	0	6	0	0	0	2	0	0	0	1	0	0	0	1	1	2	1	9	0	0	0	0	2	3	0	1	1	3	1	1	0	1	0	0	1	2	0	0	0	15	1	1	1	0	2	0	1	1	0	3	1	1	1	9	0	1	1	9	0	1	0	0	0	0	0	12	0	2	2	0	0	5	0	0	1	1	0	5	20	2	1	5	3	1	0	3	0	1	7	0	2	2	1	0	0	0	0	1	1	1	0	0	0	0	1	2	0	4	0	0	0	4	0	7	4	2	0	6	0	1	0	0	4	0	0	2	1	1	2	0	5	0	0	0	0	1	1	0	1	0	1	3	0	1	1	0	0	0	0	2	0	1	1	0	1	2	0	0	0	0	1	1	1	0	0	0	2	0	4	1	2	0	0	2	0	4	0	5	0	0	0	5	0	0	0	1	6	0	2	2	3	1	2	2	0	0	0	1	0	2	5	0	1	0	0	3	7	1	5	1	0	2	0	0	1	0	4	0	0	9	1	0	3	3	0	1	1	0	1	3	1	3	2	0	0	8	0	1	1	4	2	0	1	0	1	1	3	4	9	0	0	5	0	1	1	0	0	1	0	2	0	4	0	2	1	2	0	1	0	2	0	0	1	1	0	5	0	0	0	0	2	0	0	2	1	1	0	0	0	1	2	1	0	0	0	0	0	3	0	0	0	0	2	1	3	1	0	0	0	0	3	0	1	2	1	0	1	2	0	0	0	0	2	0	0	0	7	1	5	1	4	0	1	5	0	0	2	14	0	0	1	0	0	0	0	0	0	0	0	0	2	0	2	2	1	1	0	2	1	1	4	2	0	0	0	0	0	5	4	1	0	0	2	0	1	0	1	1	0	1	0	0	0	2	1	0	0	0	2	1	1	1	0	0	0	3	0	11	5	13	1	1	3	2	0	7	1	7	0	0	2	0	0
PaulStamatiou.com - Technology, Design and Photography	2	21	13	69	15	38	53	120	5	23	6	115	19	21	5	15	2	47	2	12	141	26	60	29	0	100	34	11	74	29	71	21	34	159	11	31	50	2	36	52	210	28	39	7	3	26	31	17	10	22	2	18	69	12	54	91	66	11	131	13	4	50	76	9	17	18	6	95	105	3	20	13	12 … 

I understood that the situation that occurred this time was that when I tried to convert the numerical data contained in the file to float, I tried to convert the String "looking" to float, and I was angry that I could not do it. I am.

The problem is that lines [1] contains String data What was happening-I guess from here-in the lines, there is a word list in the blog, which is the next element of the line break, and the data of the number of occurrences of the next word. Yes, in the original code, lines [1] contains the blog word list, that is, String data, so I'm wondering if I tried to refer to it and this error occurred.

So you have to skip the first element in the for statement and convert to float (it's terrible code because you're new to python ...) and you don't have to write: I'm thinking. (In fact, this worked.)

clusters.py


def readfile(filename):
  lines=[line for line in file(filename)]

  # First line is the column titles
  colnames=lines[0].strip().split('\t')[1:]
  rownames=[]
  data=[]

  first_line=lines[1]

  for line in lines[1:]:
    p=line.strip().split('\t')
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    if line==first_line: continue
    else: data.append([float(x) for x in p[1:]])
  return rownames,colnames,data

Even if I check some information, it seems that the original code is working fine, so I think it is highly possible that my code is wrong. If you notice anything, please point it out. Or, I hope this article helps someone.

Recommended Posts

What I was addicted to in Collective Intelligence Chaprter 3. It's not a typo, so I think something is wrong with my code.
What I was addicted to when dealing with huge files in a Linux 32bit environment
What I was addicted to with json.dumps in Python base64 encoding
I was addicted to creating a Python venv environment with VS Code
I was addicted to trying Cython with PyCharm, so make a note
A note I was addicted to when running Python with Visual Studio Code
I was addicted to scraping with Selenium (+ Python) in 2020
It's a hassle to write "coding: utf-8" in Python, so I'll do something with Shellscript
There was a doppelganger, so I tried to distinguish it with artificial intelligence (laughs) (Part 1)
When I put Django in my home directory, I was addicted to static files with permission errors
Numpy's intellisense (input completion) is incomplete in VS Code and I was lightly addicted to the solution
A story I was addicted to trying to get a video url with tweepy
I think it's a loss not to use profiler for performance tuning
In IPython, when I tried to see the value, it was a generator, so I came up with it when I was frustrated.
What I was addicted to Python autorun
A story that I was addicted to when I made SFTP communication with python
I set up TensowFlow and was addicted to it, so make a note
I was soberly addicted to calling awscli from a Python 2.7 script registered in crontab
Note that I was addicted to npm script not passing in the verification environment
I want to save a file with "Do not compress images in file" set in OpenPyXL
What I was addicted to when combining class inheritance and Joint Table Inheritance in SQLAlchemy
What to do if Python IntelliSense is not displayed in VS Code on Windows
Here is one of the apps with "artificial intelligence" that I was interested in.
What I did when I was angry to put it in with the enable-shared option
A command to check when something goes wrong when the server is not doing anything
[Go language] Be careful when creating a server with mux + cors + alice. Especially about what I was addicted to around CORS.