Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]

module

# %load ipython_log.py
# IPython log file
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

loading file

path='./usagov_bitly_data2012-03-16-1331923249.txt'
open(path).readline()
import json
record=[json.loads(line) for line in open(path)]   #Read in json format
record[0]   #Since the content of record is long, let's look at only one element for the time being
record[0]['tz']   #Of which tz is the key
time_zone=[rec['tz'] for rec in record if 'tz' in rec]   #See only tz in record. But only when there was tz
time_zone[:10]   #See only the 10th line from the top

Count tz and store in dictionary

def get_counts(seq):
	'''
Count how many are the same as the character string in seq, and dictionary{'String':Quantity,...}Return as
	def get_counts(seq):#Same meaning as this, but easy with defaultdict
		count=defaultdict(int)   #count={}
		for x in seq:
			if x in counts:
				counts[x]+=1
			else:
				count[x]=1
			return counts
	'''
	from collections import defaultdict
	counts=defaultdict(int)   #returns `defaultdict(<class 'int'>, {})`
	for x in seq:
		counts[x]+=1
	return counts

counts=get_counts(time_zone)
counts['America/New_York']
len(time_zone)

Find top10

What we are doing is different in shape, but everyone is together

Make a function

def top_counts(count_dict,n=10):
	value_key_pairs=[(count,tz) for tz, count in count_dict.items()]
	value_key_pairs.sort()
	return value_key_pairs[-n:]

top_counts(counts)

use class

from collections import Counter
counts= Counter(time_zone)
counts.most_common(10)

use pandas

from pandas import DataFrame,Series
import pandas as pd
frame=DataFrame(record)
frame['tz'][:10]
tz_counts=frame['tz'].value_counts()
tz_counts[:10]

NA complement

clean_tz=frame['tz'].fillna('Missing')
clean_tz[clean_tz=='']='UNknown'
clean_tz
tz_counts=clean_tz.value_counts()
tz_counts[:10]

tz_count PLOT

tz_counts[:10].plot(kind='barh',rot=0)
import matplotlib.pyplot as plt
# plt.show()

Element count

frame['a'][1]
frame['a'][50]
frame['a'][51]
results=Series([x.split()[0] for x in frame.a.dropna()])   #.dropna()pandas method Delete blank line Specify line to delete with argument
   #str.split(x)Divide str into a list with x as the delimiter
   #List the strings separated by spaces(In-list notation), Make pandas dataframe in Series class
results[:5]
results.value_counts()[:8]   #value_counts()Count the number of the same element with

Element count (another method)

cframe=frame[frame.a.notnull()]   #Collected only non-null guys in column a of frame(cframe['a']==frame.a.dropna())
bool(map(list,[cframe['a'],frame.a.dropna()]))   #cframe the list function['a']And frame.a.dropna()Apply to and see if they are the same

'Windows' or Not?

import numpy as np
operating_system=np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')   #cframe['a']But'Windows'In True, including the characters'Windows'with false'Not Windows'return it
   #` ['Windows' if 'Windows' in x else 'Not Windows' for x in cframe['a']]`Same as
operating_system[:5]

operating_system Another Way

operating_system2=['Windows' if 'Windows' in x else 'Not Windows' for x in cframe['a']]
bool(list(operating_system)==operating_system2)   #True


by_tz_os=cframe.groupby(['tz',operating_system])
agg_counts=by_tz_os.size().unstack().fillna(0)
agg_counts[:10]



																	#2016/07/28 22:56:30__
indexer=agg_counts.sum(1).argsort()   #argsort()Np sorted index.Returns in array format
   #np.sum()Basically, return the one with all the contents of the array added

'''
# ABOUT np.sum()

>>> np.sum([[0, 1], [0, 5]], axis=0)
array([0, 6])   #return array([0+0],[1+5])
>>> np.sum([[0, 1], [0, 5]], axis=1)
array([1, 5])   #return array([0+1],[0+5])

'''

indexer[:10]


count_subset=agg_counts.take(indexer)[-10:]   #Only 10 minutes from the end of the indexer agg_Returns counts(take=get)

count_subset.plot(kind='barh', stacked=True)
# plt.show()

Normalization (terminates 0,1 or ratio)

normed_subset=count_subset.div(count_subset.sum(1),axis=0)
normed_subset.plot(kind='barh',stacked=True)
# plt.show()

Recommended Posts

Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Reading Note: An Introduction to Data Analysis with Python
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Data analysis with python 2
Data analysis with Python
[Introduction to minimize] Data analysis with SEIR model ♬
[Python] Flow from web scraping to data analysis
I tried fMRI data analysis with python (Introduction to brain information decoding)
How to scrape image data from flickr with python
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
[Impression] [Data analysis starting from zero] Introduction to Python data science learned in business cases
Introduction to Python for VBA users-Calling Python from Excel with xlwings-
Meteorology x Python ~ From weather data acquisition to spectrum analysis ~
[Introduction to Python] How to get data with the listdir function
Create folders from '01' to '12' with python
Data analysis starting with python (data visualization 1)
Introduction to image analysis opencv python
Data analysis starting with python (data visualization 2)
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python
Links to people who are just starting data analysis with python
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
Convert Excel data to JSON with python
[Introduction to Python] Let's use foreach with Python
[Introduction to Python3 Day 19] Chapter 8 Data Destinations (8.4-8.5)
[Python] Introduction to CNN with Pytorch MNIST
Convert FX 1-minute data to 5-minute data with Python
[Introduction to Python3 Day 18] Chapter 8 Data Destinations (8.3.6.2 to 8.3.6.3)
Data analysis starting with python (data preprocessing-machine learning)
"Introduction to data analysis by Bayesian statistical modeling starting with R and Stan" implemented in Python
[Introduction to Data Scientists] Basics of Python ♬
[Data science basics] I tried saving from csv to mysql with python
Data integration from Python app on Linux to Amazon Redshift with ODBC
Data integration from Python app on Windows to Amazon Redshift with ODBC
Copy data from Amazon S3 to Google Cloud Storage with Python (boto)
Data analysis python
[Python] Easy introduction to machine learning with python (SVM)
Introduction to Artificial Intelligence with Python 1 "Genetic Algorithm-Theory-"
Markov Chain Chatbot with Python + Janome (1) Introduction to Janome
Markov Chain Chatbot with Python + Janome (2) Introduction to Markov Chain
Introduction to Artificial Intelligence with Python 2 "Genetic Algorithm-Practice-"
Introduction to Tornado (1): Python web framework started with Tornado
I tried to get CloudWatch data with Python
An introduction to statistical modeling for data analysis
[Introduction to Python] How to handle JSON format data
Introduction to formation flight with Tello edu (Python)
Introduction to Python with Atom (on the way)
Write CSV data to AWS-S3 with AWS-Lambda + Python
Introduction to Generalized Linear Models (GLM) with Python
[Introduction to Udemy Python3 + Application] 9. First, print with print
From Python environment construction to virtual environment construction with anaconda
Extract data from a web page with Python
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 2: Import data to SQL Server using PowerShell
[Introduction to Python] How to get the index of data with a for statement