Notes on handling large amounts of data with python + pandas

Extract data from MySQL

"""Get data from MySQL with pandas library."""
import MySQLdb
import pandas.io.sql as psql

con = MySQLdb.connect(db='work', user='root', passwd='') #DB connection
sql = """SELECT product_id, product_nm, product_features FROM electronics"""
df = psql.read_sql(sql, con) #Extract data in the form of pandas DataFrame
con.close()

Create a vector from data 1

When creating a vector for clustering etc. using large-scale data, iterative processing is performed while deleting the data in order to reduce memory consumption.

"""Delete rows while creating dataset."""
X = []
for index, row in df.iterrows(): #Iterate line by line
    Xi = [row.col1, row.col2, row.col3]
    X.append(X)
    df = df.ix[index:] #Create vector while deleting data to reduce memory consumption

Creating a vector from data 2 (speed improvement)

The first method cleans the code, but has the drawback of slow iterations. It's many times faster to list once.

"""High speed row iteration in pandas DataFrame"""
#Copy the data to the list
df_index, df_col1, df_col2, df_col3 = \
    list(df.index), list(df.col1), list(df.col2), list(df.col3)
del df #Delete data
for _ in df_index:
    #Iterate while deleting data
    col1, col2, col3 = df_col1.pop(), df_col2.pop(), df_col3.pop()
    Xi = [col1, col2, col3]
    X.append(Xi)

Recommended Posts

Notes on handling large amounts of data with python + pandas
[Python] Notes on data analysis
Handling of python on mac
Python Pandas Data Preprocessing Personal Notes
Notes on using rstrip with python.
Comparison of data frame handling in Python (pandas), R, Pig
Notes on doing Japanese OCR with Python
Recommendation of Altair! Data visualization with Python
Example of efficient data processing with PANDAS
python pandas notes
Automatic operation of Chrome with Python + Selenium + pandas
A story stuck with handling Python binary data
Folium: Visualize data on a map with Python
Poetry-virtualenv environment construction with python of centos-sclo-rh ~ Notes
Detect General MIDI data from large amounts of MIDI
[Data science memorandum] Handling of missing values ​​[python]
Data analysis with python 2
Handling yaml with python
Installing pandas on python2.6
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Data analysis with Python
Notes on HDR and RAW image processing with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Challenge principal component analysis of text data with Python
Summary of Pandas methods used when extracting data [Python]
Planar skeleton analysis with Python (4) Handling of forced displacement
Process csv data with python (count processing using pandas)
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Consolidate a large number of CSV files in folders with python (data without header)
Sample data created with python
Try scraping the data of COVID-19 in Tokyo with Python
Get Youtube data with python
[Python] Change dtype with pandas
python> Handling of 2D arrays
Get rid of dirty data with Python and regular expressions
Install pandas 0.14 on python3.4 [on Mac]
Python data analysis learning notes
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
Notes on installing Python on Mac
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~
Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
A collection of methods used when aggregating data with pandas
Notes on deploying pyenv with Homebrew and managing Python versions
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
Data analysis using python pandas
[Python] Extracts data frames that do not match a specific column with other data frames of Pandas
Notes on installing Python on CentOS
Data processing tips with Pandas
The Power of Pandas: Python
Read json data with python
Manage the overlap when drawing scatter plots with a large amount of data (Matplotlib, Pandas, Datashader)
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
[Pandas] I tried to analyze sales data with Python [For beginners]
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)