A well-prepared record of data analysis in Python

Overview

I had O'Reilly buy "Data analysis starting with Python" at the company.

Record the construction procedure so that you can spread it in-house.

Postscript

I was told that there is no procedure for Windows even though it is for in-house missions, so I added it. Windows users should use Cygwin. Here is a reference: How to install pip and setuptools on Cygwin When you can use pip, Click here for how to insert virtualenv

Environment

Introducing the library

I will omit the explanation of pip and virtualenv. Make sure you have the mkvirtualenv and pip commands available. Also, I'm going to get used to python3, so I'll use python3. O'Reilly says to put canopy express, but I'll put the library on my own.

$ mkvirtualenv --no-site-package --python /usr/local/bin/python3 analytics
(analytics)$ pip install numpy
(analytics)$ pip install scipy 
(analytics)$ pip install matplotlib
(analytics)$ pip install ipython
(analytics)$ pip install ipython[notebook] 
(analytics)$ ipython

I separated it into an environment called analytics. From now on, I will work in this environment. Install ipython and other libraries used for analysis. Check the installed library

$ pip freeze
appnope==0.1.0
cycler==0.9.0
decorator==4.0.6
gnureadline==6.3.3
ipykernel==4.2.2
ipython==4.0.1
ipython-genutils==0.1.0
Jinja2==2.8
jsonschema==2.5.1
jupyter-client==4.1.1
jupyter-core==4.0.6
MarkupSafe==0.23
matplotlib==1.5.0
mistune==0.7.1
nbconvert==4.1.0
nbformat==4.0.1
notebook==4.0.6
numpy==1.10.4
path.py==8.1.2
pexpect==4.0.1
pickleshare==0.5
ptyprocess==0.5
Pygments==2.0.2
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
pyzmq==15.1.0
scipy==0.16.1
simplegeneric==0.8.1
six==1.10.0
terminado==0.6
tornado==4.3
traitlets==4.0.0
wheel==0.24.0

Check if ipython works.

Python 3.5.1 (default, Dec  7 2015, 21:59:08) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: 

Exit with Ctrl + d, then install pandas

$ pip install pandas

Operation check

Let's check the operation. Start with --pylab option to use graph drawing

$ ipython --pylab
...
RuntimeError: Python is not installed as a framework. The Mac OS X backend will

I get an error. What is "Python is not installed as a framework." Solved by referring to the result of google here. Create a matplotlibrc file under ~ / .matplotlib. Fill in the following.

~/.matplotlib/matplotlibrc


backend : TkAgg

Check the operation again.

ipython --pylab
In [1]: import pandas  #pandas can be played
In [2]: plot(arange(10))  #You can use matplotlib

OK if a straight line graph is displayed

Use IPython notebook

ipython notebook

The browser will be launched. Create Notebook from New on the upper right. Since it will be a page where you can type commands, first of all

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Hit and execute with the play button above. Now you can draw the graph, then

plt.plot(np.random.randn(1000))

Press the play button with. Generate 1000 random numbers that follow a normal distribution and draw them on a graph. Ipython notebook can record the command line like this. it's amazing!

スクリーンショット 2016-02-11 20.47.21.png

Try data analysis

Advance preparation

Move to a suitable working directory

git clone https://github.com/pydata/pydata-book.git

This will bring you sample data that you can use to practice your statistics.

cd pydata-book/ch02

Let's analyze usagov_bitly_data2012-03-16-1331923249.txt in this with Python! By the way, this is like a log of shortened URL generation.

analysis

I thought I'd write it, but I'll omit it because it will be a textbook plagiarism from now on!

Introducing the line profiler

A handy tool that comes up in Chapter 3. Record it because it is part of the environment construction. In the analysis, it seems that you want to see the behavior of the function line by line when performing some advanced calculations. For example, if the calculation of 10ms is repeated 1 million times, but it can be improved a little to 5ms each time, 1 million times can save a lot of time. I think that these improvements will probably be more effective when it comes to scientific and technological calculations using large-scale matrices. So, it seems that the line profiler is a convenient tool that can evaluate which process is taking how long it takes for each line of the function.

Introduction method

pip install line_profiler
ipython profile create
vi ~/.ipython/extensions/line_profiler_ext.py

txt:~/.ipython/extensions/line_profiler_ext.py


import line_profiler

def load_ipython_extension(ip):
    ip.define_magic('lprun', line_profiler.magic_lprun)
vi ~/.ipython/profile_default/ipython_config.py

py:~/.ipython/profile_default/ipython_config.py


#------------------------------------------------------------------------------
# TerminalIPythonApp configuration
#------------------------------------------------------------------------------

c.TerminalIPythonApp.extensions = [
  'line_profiler_ext',
]

#------------------------------------------------------------------------------
# TerminalIPythonApp configuration
#------------------------------------------------------------------------------

c.TerminalIPythonApp.extensions = [
  'line_profiler_ext',
]

Try to evaluate the function

In [1]: from numpy.random import randn

In [2]: def add_and_sum(x, y):
   ...:     added = x + y
   ...:     summed = added.sum(axis=1)
   ...:     return summed
   ...: 

In [5]: x = randn(3000, 3000)

In [6]: y = randn(3000, 3000)

Execute the add_and_sum defined above. Evaluate how long it takes with the arguments x and y. Can be used with the magic command% lprun.

In [16]: %lprun -f add_and_sum add_and_sum(x, y)
Timer unit: 1e-06 s

Total time: 0.036058 s
File: <ipython-input-2-19f64f63ba0a>
Function: add_and_sum at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def add_and_sum(x, y):
     2         1        28247  28247.0     78.3      added = x + y
     3         1         7809   7809.0     21.7      summed = added.sum(axis=1)
     4         1            2      2.0      0.0      return summed

Recommended Posts

A well-prepared record of data analysis in Python
List of Python code used in big data analysis
Data analysis python
Display a list of alphabets in Python 3
Data analysis with python 2
Data analysis overview python
Get the caller of a function in Python
How to send a visualization image of data created in Python to Typetalk
Real-time visualization of thermography AMG8833 data in Python
Rewriting elements in a loop of lists (Python)
A summary of Python e-books that are useful for free-to-read data analysis
The story of reading HSPICE data in Python
Python data analysis template
Make a joyplot-like plot of R in python
Output in the form of a python array
Get a glimpse of machine learning in Python
Association analysis in Python
Code reading of faker, a library that generates test data in Python
Data analysis with Python
Regression analysis in Python
Data analysis in Python Summary of sources to look at first for beginners
Create a data collection bot in Python using Selenium
Summary of tools needed to analyze data in Python
[Python] [Word] [python-docx] Simple analysis of diff data using python
Power BI visualization of Salesforce data entirely in Python
Receive dictionary data from a Python program in AppleScript
A collection of code often used in personal Python
Challenge principal component analysis of text data with Python
Not being aware of the contents of the data in python
Until you insert data into a spreadsheet in Python
Let's use the open data of "Mamebus" in Python
Group by consecutive elements of a list in Python
A collection of Excel operations often used in Python
A reminder about the implementation of recommendations in Python
Take a screenshot in Python
My python data analysis container
Handle Ambient data in Python
Create a function in Python
Create a dictionary in Python
Python for Data Analysis Chapter 4
Display UTM-30LX data in Python
Static analysis of Python programs
Equivalence of objects in Python
Axisymmetric stress analysis in Python
Python data analysis learning notes
Make a bookmarklet in Python
Python for Data Analysis Chapter 2
Simple regression analysis in Python
Draw a heart in Python
Implementation of quicksort in Python
Data analysis using python pandas
Python for Data Analysis Chapter 3
Summary of statistical data analysis methods using Python that can be used in business
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Consolidate a large number of CSV files in folders with python (data without header)
Try scraping the data of COVID-19 in Tokyo with Python
Basic summary of data manipulation in Python Pandas-Second half: Data aggregation
Published a library that hides character data in Python images
Get the number of specific elements in a python list
A record of hell lessons imposed on beginner Python students
[Understand in the shortest time] Python basics for data analysis