List of Python libraries for data scientists and data engineers

Introducing a Python library that is useful for data analysis, data processing, machine learning, and more.

Why python

For statistics and machine learning, there is also the option R. It is a language that excels in processing, aggregating, and statistically processing R data, and can do a lot with only the language standard functions. There is no doubt that it is a powerful option as it has a rich machine learning library. The advantage of Python over R is the richness of the surrounding ecosystem. The Python ecosystem goes beyond the field of data science. Data processed with NumPy and Pands can also be used in full-scale Web applications using Django.

Installation of libraries

Most of the libraries listed here can be installed in bulk with Anaconda.

Data processing

NumPy NumPy is a library for efficient numerical calculations. A one-dimensional array is taken as an example here, but a multidimensional array can also be supported. Vector and matrix calculations can be performed at high speed.

In [1]: import numpy as np #Import NumPy

In [2]: arr = np.asarray([n for n in range(10)]) #Vector creation

In [3]: arr #output
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: arr * 10 #Data processing
Out[4]: array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

NumPy — NumPy

Pandas

Pandas is a library that extends NumPy and has functions that are indispensable for pre-processing of machine learning, such as reading data and handling missing values. There is an object called DataFrame, which makes it easy to process and merge data. Close to R's data.frame.

In [1]: import pandas as pd #Import Pandas

In [2]: df = pd.DataFrame({ #Creating a data frame
   ...: 'A': [n for n in range(5)],
   ...: 'B': ['male', 'male', 'female', 'female', 'male'],
   ...: 'C': [0.3, 0.4, 1.2, 100.5, -20.0]
   ...: })

In [3]: df
Out[3]: 
   A       B      C
0  0    male    0.3
1  1    male    0.4
2  2  female    1.2
3  3  female  100.5
4  4    male  -20.0

In [4]: df.describe() #Output of basic statistics
Out[4]: 
              A           C
count  5.000000    5.000000
mean   2.000000   16.480000
std    1.581139   47.812101
min    0.000000  -20.000000
25%    1.000000    0.300000
50%    2.000000    0.400000
75%    3.000000    1.200000
max    4.000000  100.500000

In [5]: df[df['B'] == 'female'] #Subset call
Out[5]: 
   A       B      C
2  2  female    1.2
3  3  female  100.5

Python Data Analysis Library — pandas: Python Data Analysis Library

Report, visualization

jupyter

Jupyter Notebook is a Python execution environment that records code content and output results, so it can be used as a coding environment for exploratory data processing and statistical processing. It can also be output as a report or slide.

Project Jupyter | Home

matplotlib

matplotlib is a graph drawing library. It supports various graphs such as bar graphs, scatter plots, and histograms.

Matplotlib: Python plotting — Matplotlib 2.0.2 documentation

plotly

plotly can draw richer and more interactive graphs than matplotlib. The created graph can also be shared with plot.ly.

Plot 9

Python Graphing Library, Plotly

Messaging, stream processing

Kafka-Python

Kafka-Python, as the name implies, is Apache Kafka's Python client.

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer('topic', bootstrap_servers='localhost:9092')

for msg in consumer:
    data = json.loads(msg.value.decode())
    print(data)

PySpark

Spark and Kafka have become indispensable for big data. There is a machine learning library called MLlib.

Python Programming Guide - Spark 0.9.0 Documentation

Machine learning

scikit-learn

scikit-learn is a machine learning library. Not only popular neural networks, but other algorithms are also available. In addition, it has functions such as division into training data and verification data, cross-validation, and grid search, which are necessary for machine learning, and it is a library that can reach the itchy place. If you want to touch the machine learning library, start from now on.

scikit-learn: machine learning in Python — scikit-learn 0.18.2 documentation

TensorFlow

You know the deep learning library.

TensorFlow

Keras

Keras is a wrapper for TensorFlow, CNTK, Theano and more.

Keras Documentation

Recommended books

O'Reilly Japan -Introduction to Data Analysis with Python

A book by the author of Pandas. Learn how to use Pandas and data analysis techniques. It also covers peripheral libraries such as NumPy and matplotlib.

O'Reilly Japan -Machine learning starting with Python

A book by the author of scikit-learn. You can learn how to use scikit-learn and the engineering required for machine learning.

Pop out python

If you're not happy with just tweaking data in Pandas or tuning your machine learning library, you'll need to jump out of the Python ecosystem. The world of data is deep and vast, and engineers need to cover a wider area to follow data scientists. Specifically, if you suppress distributed processing infrastructure such as Hadoop, Spark, Apex, and fully managed DWH such as BigQuery and TreasureData, the field of activity will expand.

-Count the frequency of occurrence of words in a sentence by stream processing \ (Apache Apex ) -Bad sentence pattern -Set up a fluentd container with Docker and save Rails log in Treasure Data by IDCF -Bad sentence pattern

Recommended Posts

List of Python libraries for data scientists and data engineers
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
6 Python libraries for faster development and debugging
Python data structure and internal implementation ~ List ~
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
I measured the speed of list comprehension, for and while with python2.7.
List of Python code to move and remember
Install Python and libraries for Python on MacOS Catalina
List of python modules
Python Data Visualization Libraries
[Python] Create a list of date and time (datetime type) for a certain period
Use data class for data storage of Python 3.7 or higher
Mayungo's Python Learning Note: List of stories and links
Full-width and half-width processing of CSV data in Python
Python netCDF4 read speed and nesting of for statements
List of sample program distribution sites for python books
A quick comparison of Python and node.js test libraries
List of Python code used in big data analysis
Python: Get a list of methods for an object
Data processing methods for mechanical engineers and non-computer engineers (Introduction 2)
Data processing methods for mechanical engineers and non-computer engineers (Introduction 1)
What to use for Python stacks and queues (speed comparison of each data structure)
Python list, for statement, dictionary
Summary of Python3 list operations
Python for Data Analysis Chapter 4
[Python] Copy of multidimensional list
Python list and tuples and commas
Python for Data Analysis Chapter 2
Python list comprehensions and generators
[Python / PyQ] 4. list, for statement
Python #list for super beginners
Source installation and installation of Python
Python for Data Analysis Chapter 3
Recommendation of Jupyter Notebook, a coding environment for data scientists
[Python] Chapter 04-01 Various data structures (list creation and element retrieval)
List method argument information for classes and modules in Python
[Note] List of basic commands for building python / conda environment
[Python] Create a date and time list for a specified period
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Get rid of dirty data with Python and regular expressions
Useful tricks related to list and for statements in Python
Summary of Hash (Dictionary) operation support for Ruby and Python
Python: Create a dictionary from a list of keys and values
List of libraries to install when installing Python using Pyenv
Get a list of CloudWatch Metrics and a correspondence table for Unit units with Python boto
I studied four libraries of Python 3 engineer certified data analysis exams
Recommended books and sources of data analysis programming (Python or R)
The story of Python and the story of NaN
Difference between list () and [] in Python
Build and test a CI environment for multiple versions of Python
Installation of SciPy and matplotlib (Python)
Automatic acquisition of gene expression level data by python and R
Python environment construction 2016 for those who aim to be data scientists
Python course for data science_useful techniques
[Python of Hikari-] Chapter 05-03 Control syntax (for statement-extracting elements from list-)
This and that of python properties
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Hashing data in R and Python
Preprocessing template for data analysis (Python)