[PYTHON] How to find the memory address of a Pandas dataframe value

Pandas data frames are convenient, but memory management I'm not sure, I was curious about where and how they are actually placed, so I looked it up.

Survey method

import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3.0, 4.0], 'C': [5, 6]})
for block in df._data.blocks:
    memory_address = block.values.__array_interface__['data'][0]
    memory_hex = block.values.data.hex()
    print(f"({id(block)}) {block}")
    print(f"<{memory_address}> {memory_hex}")
    print()
(4886642416) FloatBlock: slice(1, 2, 1), 1 x 2, dtype: float64
<140474854679968> 00000000000008400000000000001040

(4886642608) IntBlock: slice(0, 4, 2), 2 x 2, dtype: int64
<140474585659872> 0100000000000000020000000000000005000000000000000600000000000000

The number in the angle bracket is the memory address, and the number after that is the hexadecimal representation of the memory value. Since both columns A and C are Int values, you can see that they are collectively allocated in memory. I see?

data structure

The data frame manages the data in blocks through a class called BlockManger. The idea around this is the article "[A Roadmap for Rich Scientific Data Structures in Python](https://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/] by the author of Pandas. ) ”Is easy to understand.

If you follow the type of the variable that appears in the above code, it will be as follows.

You can see that the block holds a NumPy ndarray. So, from here on, it's the world of NumPy, "2.2. Advanced NumPy β€” Scipy lecture notes ], You can get the memory address with ndarray.__array_interface__ ['data'] [0]. And since you can get the memoryview with ndarray.data, you can also look at the memory value.

Note that when you print the memoryview, it is displayed as <memory at 0x11b6a3ad0>, but this is the address of the instance of memoryview, which is different from the address of the value. For more information, see "[Numpy, Python3.6 --not able to understand why address is different? --Stack Overflow](https://stackoverflow.com/questions/52032545/numpy-python3-6-not-able-to-understand-" why-address-is-different) ”.

Experiment

Let's experiment with how the memory allocation changes by doing some simple data frame operations.

df1 = df[0:1]
(4886726416) FloatBlock: slice(1, 2, 1), 1 x 1, dtype: float64
<140474854679968> 0000000000000840

(4886727088) IntBlock: slice(0, 4, 2), 2 x 1, dtype: int64
<140474585659872> 01000000000000000500000000000000

First is the slice of the first line. You can see that the memory address has not changed and the reference range has become shorter. The instance of the block has changed.

df2 = df[1:2]
(4886798416) FloatBlock: slice(1, 2, 1), 1 x 1, dtype: float64
<140474854679976> 0000000000001040

(4886798896) IntBlock: slice(0, 4, 2), 2 x 1, dtype: int64
<140474585659880> 02000000000000000600000000000000

This is the slice on the second line. Since all the memory addresses are +8, you can see that they are referring to the same memory block just by shifting the pointer.

df['D'] = [True, False]
(4886642416) FloatBlock: slice(1, 2, 1), 1 x 2, dtype: float64
<140474854679968> 00000000000008400000000000001040

(4886642608) IntBlock: slice(0, 4, 2), 2 x 2, dtype: int64
<140474585659872> 0100000000000000020000000000000005000000000000000600000000000000

(4886800144) BoolBlock: slice(3, 4, 1), 1 x 2, dtype: bool
<140474855093504> 0100

Add a column. For existing columns, not only the memory address but also the block does not change.

df3 = df.append(df)
(4886726224) IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64
<140474855531008> 0100000000000000020000000000000001000000000000000200000000000000

(4509301648) FloatBlock: slice(1, 2, 1), 1 x 4, dtype: float64
<140474585317312> 0000000000000840000000000000104000000000000008400000000000001040

(4509301840) IntBlock: slice(2, 3, 1), 1 x 4, dtype: int64
<140474585630688> 0500000000000000060000000000000005000000000000000600000000000000

(4509301552) BoolBlock: slice(3, 4, 1), 1 x 4, dtype: bool
<140474855008224> 01000100

I tried to combine the lines. The memory layout has changed drastically. There are also two IntBlocks. This causes fragmentation, so I'd like you to put it together at the right time.

df4 = df3._consolidate()
(4509301552) BoolBlock: slice(3, 4, 1), 1 x 4, dtype: bool
<140474855008224> 01000100

(4509301648) FloatBlock: slice(1, 2, 1), 1 x 4, dtype: float64
<140474585317312> 0000000000000840000000000000104000000000000008400000000000001040

(4886728240) IntBlock: slice(0, 4, 2), 2 x 4, dtype: int64
<140475125920528> 01000000000000000200000000000000010000000000000002000000000000000500000000000000060000000000000005000000000000000600000000000000

When I called the private method _consolidate (), the Int values were grouped together and placed at the new memory address.

Recommended Posts

How to find the memory address of a Pandas dataframe value
How to find the scaling factor of a biorthogonal wavelet
How to check the memory size of a variable in Python
How to check the memory size of a dictionary in Python
[Linux] [C / C ++] How to get the return address value of a function and the function name of the caller
How to find out if there is an arbitrary value in "somewhere" of pandas DataFrame
How to calculate the volatility of a brand
[Circuit x Python] How to find the transfer function of a circuit using Lcapy
[Ubuntu] How to delete the entire contents of a directory
How to find the optimal number of clusters in k-means
Inherit the standard library to find the average value of Queue
Find the index of the maximum value (minimum value) of a multidimensional array
Put the lists together in pandas to make a DataFrame
How to get the last (last) value in a list in Python
How to connect the contents of a list into a string
Find the definition of the value of errno
How to find the average amount of information (entropy) of the original probability distribution from a sample
Find the optimal value of a function with a genetic algorithm (Part 2)
python / pandas / dataframe / How to get the simplest row / column / index / column
[Python] How to add rows and columns to a table (pandas DataFrame)
How to output the output result of the Linux man command to a file
How to get the vertex coordinates of a feature in ArcPy
DataFrame of pandas From creating a DataFrame from two lists to writing a file
[NNabla] How to remove the middle tier of a pre-built network
[Python] A simple function to find the center coordinates of a circle
[Python] Summary of how to use pandas
How to reassign index in pandas dataframe
[Pandas] Expand the character string to DataFrame
[Pandas_flavor] Add a method of Pandas DataFrame
How to display the CPU usage, pod name, and IP address of a pod created with Kubernetes
[Python] What is a formal argument? How to set the initial value
How to get a specific column name and index name in pandas DataFrame
How to find out the number of CPUs without using the sar command
[Introduction to Python] How to sort the contents of a list efficiently with list sort
A memorandum of how to write pandas that I tend to forget personally
[NNabla] How to add a quantization layer to the middle layer of a trained model
How to put a line number at the beginning of a CSV file
[Python] How to read a csv file (read_csv method of pandas module)
How to find a specific type (str, float etc) column in a DataFrame column
How to create a wrapper that preserves the signature of the function to wrap
Find the minimum value of a function by particle swarm optimization (PSO)
I tried to display the altitude value of DTM in a graph
How to play a video while watching the number of frames (Mac)
A simple example of how to use ArgumentParser
Combinatorial optimization to find the hand of "Millijan"
Find the number of days in a month
Find the divisor of the value entered in python
How to find the correlation for categorical variables
How to pass the execution result of a shell command in a list in Python
How to mention a user group in slack notification, how to check the id of the user group
The story of IPv6 address that I want to keep at a minimum
A programming beginner tried to find out the execution time of sorting etc.
Find out how to divide a file with a certain number of lines evenly
To output a value even in the middle of a cell with Jupyter Notebook
[NNabla] How to get the output (variable) of the middle layer of a pre-built network
How to access the contents of a Linux disk on a Mac (but read-only)
[Scientific / technical calculation by Python] Numerical calculation to find the value of derivative (differential)
[python] How to sort by the Nth Mth element of a multidimensional array
A memorandum of how to execute the! Sudo magic command in Jupyter Notebook
[Numpy, scipy] How to calculate the square root of a semi-fixed definite matrix
How to find the coefficient of the trendline that passes through the vertices in Python