Data formatting for Python / color plots

Introduction

When experimenting, it is common for the measurement $ Z $ to depend on two parameters, $ X $ and $ Y $. In such cases, you can summarize the results individually in a graph of $ X $ vs $ Z $ or $ Y $ vs $ Z $, but a color plot of $ X $ vs $ Y $ vs $ Z $ [^ 1] If you put them together in one sheet, you can see the data from a more bird's-eye view.

However, this color plot is a little troublesome, and in order to prepare the 2D data of $ (X, Y, Z) $ required for creation, it is almost always necessary to perform some preprocessing on the raw data. is. Therefore, in this article, I will explain how to format data for creating color plots, using two cases that are often encountered in practice as examples.

Data required to create a color plot

名称未設定-1.png

When you create a color plot, you need the above data structure. In other words, it is mesh-like data in which the value of $ Z $ is stored for the two axes of $ X $ and $ Y $. In the following, we will format the data with this shape as the goal.

Common cases

So there are two cases that require pretreatment. In the example below, the extension of the file is DAT, but in the processing, CSV is also supplemented. The process is the same, only the initial loading method is different.

1. The data file is split

名称未設定-2.png

The first example is when the data file is divided by either parameter of $ X $ or $ Y $. In this case, you need to combine the split files, and now that the value of $ Y $ is recorded in the filename, you need to extract that value to make a list of $ Y $.

`sample1.py`


#Executing this file will produce the sample data of Example 1.
#Create an appropriate working directory before executing.

import numpy as np

#Parameter definition
intensity = 50 #Strength
HWHM = 3 #Half width half width
a = 3 #Large amount of data variation

#Creating a data file
for Y in np.arange (0, 10.1, 0.1):
    filename = 'sample1_Y={}.dat'.format(str(Y))
    X0 = (200 * Y * Y + 2500) ** 0.5 - 50
    with open(filename, 'w') as file:
        file.writelines('X' + '\t' + 'Z' +'\n')
        for X in range (0, 101):
            Z = intensity * HWHM ** 2 / ((X - X0) ** 2 + HWHM ** 2)\
                + 20 + a * np.random.rand()
            file.writelines(str(X) + '\t' + str(Z) + '\n')

2. All data is connected in one file

名称未設定-3.png

The second example is when $ X $, $ Y $, and $ Z $ are written in one file. In this case, data formatting is required to sort $ Z $ with the $ Y $ column as the horizontal axis.

`sample2.py`


#Executing this file will produce the sample data of Example 2.
#Create an appropriate working directory before executing.

import numpy as np

#Parameter definition
intensity = 50 #Strength
HWHM = 3 #Half width half width
a = 3 #Large amount of data variation

#Creating a data file
with open('sample2.dat', 'w') as file:
    file.writelines('X' + '\t' + 'Y' + '\t' + 'Z' +'\n')
    for Y in np.arange (0, 10.1, 0.1):
        X0 = (200 * Y * Y + 2500) ** 0.5 - 50
        for X in range (0, 101):
            Z = intensity * HWHM ** 2 / ((X - X0) ** 2 + HWHM ** 2)\
                + 20 + a * np.random.rand()
            file.writelines(str(X) + '\t' + str(Y) + '\t' + str(Z) + '\n')

[Digression] The Lorentz distribution was prepared as a sample this time. sample2.py is one file, but in the case of sample1.py, the data file is divided, and it becomes 101 dat files in 0.1 increments from Y = 0.0 to 10.0. If you try to open Y = 5.5.dat, it will look like the figure on the left. The graph on the right plots $ Z $ for $ X $. The position of this mountain depends on $ X $ and $ Y $.

名称未設定-1.png

Data processing and color plot

So, using these two cases as an example, we will format the data from now on. Use Jupyter Notebook for work.

1. When the data file is divided

`In[1]`


import re, glob
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Load the library.

`In[2]`


filelist = glob.glob('*.dat')
df = pd.DataFrame()
for file in filelist:
    match = re.match('sample1_Y=(.*).dat', file)
    df_sub = pd.read_table(file) #For CSV, pd.read_csv()use
    df[float(match.group(1))] = df_sub.Z
df.columns.name = 'Y'
df.index = df_sub.X
df

`Out[2]`


Y	0.0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	...	9.0	9.1	9.2	9.3	9.4	9.5	9.6	9.7	9.8	9.9
X																					
0	71.307065	71.624977	72.568247	70.073268	71.388264	71.429283	69.430455	66.104600	64.251044	61.960019	...	20.892164	21.724984	20.259025	21.625291	22.658143	22.641024	20.799494	21.042593	20.667364	20.451245
1	66.347248	65.184597	66.907600	67.807422	67.879276	70.401552	72.100718	72.617697	72.195462	70.692071	...	20.409888	22.230631	21.106551	22.801198	21.159110	20.973036	21.779757	20.625188	21.405971	21.577096
2	54.815612	54.960281	55.477640	57.619689	59.971637	60.601975	63.984228	65.729155	67.846441	69.637961	...	21.854349	20.668861	20.172761	20.416828	21.374005	21.202518	21.688063	21.056256	22.637612	20.305400
3	46.311290	46.916455	47.512971	47.175870	48.731614	50.673641	52.572572	55.803255	59.562894	62.597950	...	22.427942	20.156526	21.141887	22.187281	21.712688	22.921697	22.876228	22.972608	22.592168	21.185094
4	40.442910	38.820936	38.994950	41.859569	40.333883	42.995725	45.152994	47.650007	49.414120	53.309453	...	21.873397	20.659303	21.022158	20.543980	23.023661	21.418374	22.771670	20.218522	22.349163	21.412955
5	35.598731	33.633262	35.782304	36.352184	35.778557	37.035520	38.947890	40.425551	41.307669	43.021355	...	22.766781	20.876074	20.208458	21.890359	22.792392	22.499805	22.652404	22.497508	22.339281	21.668357
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

95	21.804124	21.797596	23.010401	20.258773	20.073975	21.918238	22.169896	21.988170	20.120070	20.975194	...	28.402855	28.041022	32.390451	36.689843	48.158836	58.781772	71.028003	67.065172	52.942427	42.138778
96	21.250080	21.630843	21.553191	22.952056	21.329605	20.270100	21.658320	20.191202	21.166837	20.145893	...	27.438835	27.227526	31.282917	32.428334	40.363609	50.900390	63.092738	70.580351	63.513627	48.526807
97	22.389748	21.693057	20.886997	21.460203	22.610140	20.102447	23.021290	22.793081	22.306881	20.704143	...	25.180590	26.366878	28.042743	30.121939	36.192960	40.735346	53.298041	67.425563	70.242088	60.144091
98	21.265201	21.367930	21.225976	20.466155	21.115541	20.294466	20.556839	22.789051	20.945778	21.343996	...	23.949042	24.200023	26.902070	28.732446	32.388426	35.483425	44.613251	56.697203	68.448203	70.253491
99	22.688459	22.243006	22.604197	22.114754	22.967067	22.538572	21.954847	22.286714	22.779653	20.139557	...	23.386930	24.568997	25.573001	27.852061	30.095048	32.057468	37.229132	47.773504	61.152210	71.645167
100	21.503768	20.480336	21.507903	21.943483	21.158995	20.880028	22.613661	21.468507	22.059082	20.855645	...	24.196745	25.333946	24.965214	27.570366	27.059141	31.368592	34.169847	40.928392	48.591212	62.390900
101 rows × 101 columns

Search the DAT files in the directory with glob and read them one by one as a data frame df_sub. Then, add the second column ($ Z $) of df_sub to the data frame df that finally causes the color plot. At that time, the value of $ Y $ read by re.match from the file name is set in the column name.

`py:In[3]:`


plt.pcolor(df.index, df.columns, df.T)
plt.colorbar()
plt.axis('tight')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

Finally, you can use matplotlib to color plot like this.

In the data frame, it seems that there is no column with Y = 10.0, but it is between Y = 1.9 and Y = 2.0. This is because when you search the file with glob, the numerical parts are compared as strings. Of course, when plotting, there is no problem because they are lined up by the value of Y, but if there is any problem, you need to sort the columns separately.

2. When all data are connected

`In[1]`


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Load the library.

`In[2]`


df = pd.read_table('sample2.dat') #For CSV, pd.read_csv()use
df

`Out[2]`


|  | X | Y | Z |
|:--|:--|:--|:--|
| 0 | 0 | 0.0 | 72.891364 |
| 1 | 1 | 0.0 | 66.015389 |
| 2 | 2 | 0.0 | 56.577833 |
| 3 | 3 | 0.0 | 47.967175 |
| 4 | 4 | 0.0 | 40.049795 |
| 5 | 5 | 0.0 | 33.520995 |
| ... | ... | ... | ... |
| 10195 | 95 | 10.0 | 34.230043 |
| 10196 | 96 | 10.0 | 39.323960 |
| 10197 | 97 | 10.0 | 47.548997 |
| 10198 | 98 | 10.0 | 55.833268 |
| 10199 | 99 | 10.0 | 66.757378 |
| 10200 | 100 | 10.0 | 70.632926 |
10201 rows × 3 columns

I read the data. This is aggregated for $ X $ and $ Y $. Use the DataFrame method .pivot_table ().

`In[3]`


df_pivot = pd.pivot_table(data=df, values='Z', columns='Y', index='X', aggfunc=np.mean)
df_pivot

`Out[3]`


Y	0.0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	...	9.1	9.2	9.3	9.4	9.5	9.6	9.7	9.8	9.9	10.0
X																					
0	72.891364	71.124620	70.654984	70.037212	70.172797	69.732972	68.793112	65.933899	64.488065	59.392308	...	23.045598	22.673641	22.600140	22.112334	21.315886	21.963097	21.105755	21.827151	21.567903	21.151945
1	66.015389	66.330797	67.099211	69.468310	68.399146	68.998129	70.942877	71.911890	70.655064	68.509530	...	20.235786	21.015988	22.415627	20.175461	20.249661	21.286285	22.163261	20.167906	22.193590	22.611962
2	56.577833	55.291176	57.559546	57.364896	58.140628	61.156353	63.832460	66.498951	67.410308	69.306595	...	20.598574	21.103155	21.149578	21.014833	21.009504	21.841099	21.587648	22.296160	21.123641	22.874411
3	47.967175	47.952907	45.950706	47.029444	48.391456	51.034951	52.411894	56.019204	59.728020	63.807839	...	22.514587	22.240905	22.201533	21.571261	22.403295	21.390697	20.246681	22.210926	21.520711	21.784959
4	40.049795	38.779545	41.234613	40.129730	41.675496	43.363557	44.458340	45.149790	49.194773	53.192819	...	20.296393	21.070061	20.863386	21.854448	21.168673	22.133117	21.882360	20.162296	21.350260	20.466510
5	33.520995	34.508065	33.640439	34.172906	37.379081	35.589519	38.723906	38.893507	40.820806	44.050499	...	21.119756	20.837089	22.140866	23.018667	21.209434	22.741423	20.494395	21.803438	20.179044	21.418150
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	21.609598	21.595324	21.822186	21.381549	21.119773	22.047828	22.708401	21.714110	22.622491	21.829242	...	28.407967	32.270296	37.226492	48.122027	60.701823	69.480529	65.041743	52.447009	40.167799	34.230043
96	22.725472	20.298792	22.131073	20.807929	21.241496	20.429434	21.873849	20.708636	21.940816	21.854451	...	27.856948	29.345740	33.471768	39.978015	50.693978	64.438375	70.241885	63.729539	48.403309	39.323960
97	22.734694	22.755155	21.598300	20.712057	22.349692	21.692798	22.985825	22.995810	20.447362	22.031959	...	27.358939	27.096302	30.500497	35.350520	41.393291	54.023276	65.693318	69.426084	60.709163	47.548997
98	20.429320	20.835029	22.714230	22.396262	22.322744	21.048957	22.671866	21.613990	20.339620	22.711587	...	24.988529	28.050123	29.537614	32.430894	36.235623	44.963558	56.700622	68.762960	69.940436	55.833268
99	21.826368	22.945654	22.277211	20.131568	21.019710	21.633040	21.798181	21.139721	20.183818	22.055120	...	25.848098	27.116651	27.592164	29.924541	31.438265	39.224727	45.971381	60.573153	70.092905	66.757378
100	22.964366	22.522586	22.005465	20.918149	21.038924	22.418933	21.325841	22.340799	20.054492	22.689244	...	23.969884	26.387081	25.109298	26.920826	28.967549	34.624010	38.952119	49.455348	64.123081	70.632926
101 rows × 101 columns

The data has been formatted in the same way as before. All you have to do is plot with matplotlib.

`In[4]`


plt.pcolor(df_pivot.index, df_pivot.columns, df_pivot.T)
plt.colorbar()
plt.axis('tight')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

So, I made a color plot with two examples.

[Promotion] I have written an article like this, so please refer to it if you like. .. Experimental data analysis with DataFrame of Python / pandas (for physicists and engineers)

[^ 1]: Also called image plot, color map, or heat map, in this article, we will use color plots for the same name.