Introduction to Pandas¶

Before we jump into Pandas, let us review what we have considered so far.

First, we learned how to read data from files into numpy arrays. We learned how to use variables to store that data, and to either slice the array into a few variables, or use slices themselves for something. We also learned how to make a record array that enabled us to access columns of the array by a name.

When we loaded a json file, we got a dictionary data structure, which also allowed us to access data by a name.

Second, we imported a visualization library, and made plots that used the arrays as arguments.

For “small” data sets, i.e. not too many columns, this is a perfectly reasonable thing to do. For larger datasets, however, it can be tedious to create a lot of variable names, and it is also hard to remember what is in each column.

Many tasks are pretty standard, e.g. read a data set, summarize and visualize it. It would be nice if we had a simple way to do this, with few lines of code, since those lines will be the same every time.

The Pandas library was developed to address all these issues. From the website: “pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”

Review of the numpy array way¶

Let’s review what we learned already.

import matplotlib.pyplot as plt
import numpy as np

data = np.loadtxt("raman.txt")

wavenumber, intensity = data.T  # the transpose has data in rows for unpacking
ind = (wavenumber >= 1000) & (wavenumber < 1500)


plt.figure()
plt.plot(wavenumber[ind], intensity[ind])
plt.xlabel("Wavenumber")
plt.ylabel("Intensity");

Now, with Pandas¶

We will unpack this code shortly. For now, look how short it is to create this plot. Note that we have condensed all the code in the example above basically into three lines of code. That is pretty remarkable, but should give you some pause. We now have to learn how to use such a dense syntax!

import pandas as pd

df = pd.read_csv(
    "raman.txt", delimiter="\t", index_col=0, names=["wavenumber", "intensity"]
)
df[(df.index >= 1000) & (df.index < 1500)].plot();

And to summarize.

df.describe()

	intensity
count	7620.000000
mean	558.388392
std	982.659031
min	77.104279
25%	113.666535
50%	219.409340
75%	552.632848
max	15275.059000

What is the benefit of this dense syntax? Because it is so short, it is faster to type (at least, when you know what to type). That means it is also faster for you to read.

The downside is that it is like learning a whole new language within Python, and a new mental model for how the data is stored and accessed. You have to decide if it is worthwhile doing that. If you do this a lot, it is probably worthwhile.

Pandas¶

The main object we will work with is called a DataFrame.

type(df)

pandas.core.frame.DataFrame

Jupyter notebooks can show you a fancy rendering of your dataframe.

df

	intensity
wavenumber
500.00000	294.378690
500.39374	288.922000
500.78751	286.066220
501.18124	275.222840
501.57501	275.119380
...	...
3498.42500	86.151878
3498.81880	85.178947
3499.21240	87.969734
3499.60620	83.638931
3500.00000	84.009064

7620 rows × 1 columns

The dataframe combines a few ideas we used from arrays and dictionaries. First, we can access a column by name. When we do this, we get a Series object.

type(df["intensity"])

pandas.core.series.Series

You can extract the values into a numpy array like this.

df["intensity"].values

array([294.37869 , 288.922   , 286.06622 , ...,  87.969734,  83.638931,
        84.009064])

A Series (and DataFrame) are like numpy arrays in some ways, and different in others. Suppose we want to see the first five entries of the intensity. If we want to use integer-based indexing like we have so far, you have to use the iloc attribute on the series like this. iloc is for integer location.

df["intensity"].iloc[0:5]

wavenumber
00000    294.37869
39374    288.92200
78751    286.06622
18124    275.22284
57501    275.11938
Name: intensity, dtype: float64

What about the wavenumbers? These are called the index of the dataframe.

df.index

Index([    500.0, 500.39374, 500.78751, 501.18124, 501.57501, 501.96875,
       502.36252, 502.75626, 503.15002, 503.54376,
       ...
       3496.4563, 3496.8501, 3497.2437, 3497.6375, 3498.0313,  3498.425,
       3498.8188, 3499.2124, 3499.6062,    3500.0],
      dtype='float64', name='wavenumber', length=7620)

You can index the index with integers as you can with an array.

df.index[0:5]

Index([500.0, 500.39374, 500.78751, 501.18124, 501.57501], dtype='float64', name='wavenumber')

Finally, you can combine these so that you index a column with a slice of the index like this.

df["intensity"][df.index[0:5]]

wavenumber
00000    294.37869
39374    288.92200
78751    286.06622
18124    275.22284
57501    275.11938
Name: intensity, dtype: float64

In summary, we can think of a dataframe as a hybrid array/dictionary where we have an index which is like the independent variable, and a set of columns that are like dependent variables. You can access the columns like a dictionary.

Dataframes and visualization¶

Dataframes also provide easy access to visualization. The simplest method is to just call the plot method on a dataframe. Note this automatically makes the plot with labels and a legend. If there are many columns, you will have a curve for each one of them. We will see that later.

df.plot();

Reading data in Pandas¶

Let’s get back to how we got the data into Pandas. Let’s retrieve the data file we used before with several columns in it.

fname = "p-t.dat"
url = "https://www.itl.nist.gov/div898/handbook/datasets/MODEL-4_4_4.DAT"

import urllib.request

urllib.request.urlretrieve(url, fname);

Let’s refresh our memory of what is in this file:

! head p-t.dat

 Run          Ambient                            Fitted

Order  Day  Temperature  Temperature  Pressure    Value    Residual

    1      23.820      54.749      225.066   222.920     2.146

    1      24.120      23.323      100.331    99.411     0.920

    1      23.434      58.775      230.863   238.744    -7.881

    1      23.993      25.854      106.160   109.359    -3.199

    1      23.375      68.297      277.502   276.165     1.336

    1      23.233      37.481      148.314   155.056    -6.741

    1      24.162      49.542      197.562   202.456    -4.895

    1      23.667      34.101      138.537   141.770    -3.232

We use Pandas.read_csv to read this, similar to how we used numpy.loadtxt. It also takes a lot of arguments to fine-tune the output. We use spaces as the delimiter here. '\s+' is a regular expression for multiple spaces. We still skip two rows, and we have to manually define the column names. We do not specify an index column here, we get a default one based on integers. Pandas is smart enough to recognize the first two columns are integers, so we do not have to do anything special here.

df = pd.read_csv(
    "p-t.dat",
    delimiter="\s+",
    skiprows=2,
    names=[
        "Run order",
        "Day",
        "Ambient Temperature",
        "Temperature",
        "Pressure",
        "Fitted Value",
        "Residual",
    ],
)
df

	Run order	Day	Ambient Temperature	Temperature	Pressure	Fitted Value	Residual
0	1	1	23.820	54.749	225.066	222.920	2.146
1	2	1	24.120	23.323	100.331	99.411	0.920
2	3	1	23.434	58.775	230.863	238.744	-7.881
3	4	1	23.993	25.854	106.160	109.359	-3.199
4	5	1	23.375	68.297	277.502	276.165	1.336
5	6	1	23.233	37.481	148.314	155.056	-6.741
6	7	1	24.162	49.542	197.562	202.456	-4.895
7	8	1	23.667	34.101	138.537	141.770	-3.232
8	9	1	24.056	33.901	137.969	140.983	-3.014
9	10	1	22.786	29.242	117.410	122.674	-5.263
10	11	2	23.785	39.506	164.442	163.013	1.429
11	12	2	22.987	43.004	181.044	176.759	4.285
12	13	2	23.799	53.226	222.179	216.933	5.246
13	14	2	23.661	54.467	227.010	221.813	5.198
14	15	2	23.852	57.549	232.496	233.925	-1.429
15	16	2	23.379	61.204	253.557	248.288	5.269
16	17	2	24.146	31.489	139.894	131.506	8.388
17	18	2	24.187	68.476	273.931	276.871	-2.940
18	19	2	24.159	51.144	207.969	208.753	-0.784
19	20	2	23.803	68.774	280.205	278.040	2.165
20	21	3	24.381	55.350	227.060	225.282	1.779
21	22	3	24.027	44.692	180.605	183.396	-2.791
22	23	3	24.342	50.995	206.229	208.167	-1.938
23	24	3	23.670	21.602	91.464	92.649	-1.186
24	25	3	24.246	54.673	223.869	222.622	1.247
25	26	3	25.082	41.449	172.910	170.651	2.259
26	27	3	24.575	35.451	152.073	147.075	4.998
27	28	3	23.803	42.989	169.427	176.703	-7.276
28	29	3	24.660	48.599	192.561	198.748	-6.188
29	30	3	24.097	21.448	94.448	92.042	2.406
30	31	4	22.816	56.982	222.794	231.697	-8.902
31	32	4	24.167	47.901	199.003	196.008	2.996
32	33	4	22.712	40.285	168.668	166.077	2.592
33	34	4	23.611	25.609	109.387	108.397	0.990
34	35	4	23.354	22.971	98.445	98.029	0.416
35	36	4	23.669	25.838	110.987	109.295	1.692
36	37	4	23.965	49.127	202.662	200.826	1.835
37	38	4	22.917	54.936	224.773	223.653	1.120
38	39	4	23.546	50.917	216.058	207.859	8.199
39	40	4	24.450	41.976	171.469	172.720	-1.251

The default plot is not that nice.

df.plot();

The default is to plot each column vs the index, which is not that helpful for us. Say we just want to plot the pressure vs. the temperature.

df.plot(x="Temperature", y="Pressure", style="b.");

We can add multiple plots to a figure, but we have to tell the subsequent calls which axes to put them on. To do that, save the first one, and pass it as an argument in subsequent plots. That also allows you to fine-tune the plot appearance, e.g. add a y-label. See the matplotlib documentation to learn how to set all of these.

p1 = df.plot(x="Temperature", y="Pressure", style="b.")
df.plot(x="Temperature", y="Fitted Value", ax=p1)

p1.set_ylabel("values");

It is a reasonable question to ask if this is simpler than what we did before using arrays, variables and plotting commands. Dataframes are increasingly common in data science, and are the data structure used in many data science/machine learning projects.

Another real-life example¶

LAMMPS is a molecular simulation code used to run molecular dynamics. It outputs a text file that is somewhat challenging to read. There are variable numbers of time steps that depend on how the simulation was setup.

Start by downloading and opening this file. It is a molecular dynamics trajectory at constant volume, where the pressure, temperature and energy fluctuate.

Open this file log1.lammpsto get a sense for what is in it. The data starts around:

timestep 0.005
run ${runSteps}
run 500000
Per MPI rank memory allocation (min/avg/max) = 4.427 | 4.427 | 4.427 Mbytes
Step v_mytime Temp Press Volume PotEng TotEng v_pxy v_pxz v_pyz v_v11 v_v22 v_v33 CPU
       0            0         1025    601.28429    8894.6478   -1566.6216   -1500.5083    2065.6285    1713.4095    203.00499 1.3408976e-05 9.2260011e-06 1.2951038e-07            0 w

And it ends around this line.

  500000         2500    978.62359   -2100.7614    8894.6478   -1570.5382   -1507.4162   -252.80665    614.87398    939.65393 0.00045263648 0.00043970796 0.00044228719    1288.0233
Loop time of 1288.02 on 1 procs for 500000 steps with 500 atoms

Our job is to figure out where those lines are so we can read them into Pandas. There are many ways to do this, but we will stick with a pure Python way. The strategy is to search for the lines, and keep track of their positions.

start, stop = None, None
with open("log1.lammps") as f:
    for i, line in enumerate(f):
        if line.startswith("Step v_mytime"):
            start = i
        if line.startswith("Loop time of "):
            stop = i - 1  # stop on the previous line
            break
start, stop

(69, 2570)

This gets tricky. We want to skip the rows up to the starting line. At that point, the line numbers restart as far as Pandas is concerned, so the header is in line 0 then, and the number of rows to read is defined by the stop line minus the start line. The values are separated by multiple spaces, so we use a pattern to indicate multiple spaces. Finally, we prevent the first column from being the index column by setting index_col to be False. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>for all the details.

df = pd.read_csv(
    "log1.lammps",
    skiprows=start,
    header=0,
    nrows=stop - start,
    delimiter="\s+",
    index_col=False,
)
df

/tmp/ipykernel_2193/86227813.py:1: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
  df = pd.read_csv(

	Step	v_mytime	Temp	Press	Volume	PotEng	TotEng	v_pxy	v_pxz	v_pyz	v_v11	v_v22	v_v33	CPU
0	0	0	1025.00000	601.28429	8894.6478	-1566.6216	-1500.5083	2065.62850	1713.409500	203.00499	0.000013	0.000009	1.295104e-07	0.000000
1	200	1	1045.85100	-1974.43580	8894.6478	-1569.6934	-1502.2352	2530.16720	-2203.376800	-2193.88770	0.000423	-0.000564	9.109558e-04	0.505391
2	400	2	1050.44480	2974.54030	8894.6478	-1564.3755	-1496.6210	1446.73930	637.829780	1794.70610	0.001115	0.000125	4.583668e-04	1.018666
3	600	3	1071.37780	2386.37510	8894.6478	-1566.4325	-1497.3278	599.73943	-462.748090	558.54192	0.000766	0.000387	3.082071e-04	1.532061
4	800	4	1055.52810	-661.78795	8894.6478	-1569.3172	-1501.2348	1775.43870	-1551.263500	-493.01032	0.000569	0.000380	2.913082e-04	2.051839
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2496	499200	2496	977.95894	-1747.91220	8894.6478	-1570.0162	-1506.9370	1411.85600	1154.883400	-2265.50500	0.000452	0.000440	4.424962e-04	1285.960300
2497	499400	2497	1066.50870	-77.15260	8894.6478	-1568.3816	-1499.5910	4071.88400	3847.295900	-1279.02860	0.000452	0.000440	4.428113e-04	1286.474100
2498	499600	2498	1052.18860	1958.97410	8894.6478	-1565.8013	-1497.9343	-2152.54460	925.775780	-1162.76120	0.000453	0.000439	4.427215e-04	1286.987000
2499	499800	2499	1057.30140	-1709.95670	8894.6478	-1570.6800	-1502.4832	-1530.34090	-71.479217	731.44735	0.000453	0.000440	4.423706e-04	1287.503200
2500	500000	2500	978.62359	-2100.76140	8894.6478	-1570.5382	-1507.4162	-252.80665	614.873980	939.65393	0.000453	0.000440	4.422872e-04	1288.023300

2501 rows × 14 columns

Visualizing the data¶

Plot a column¶

The effort was worth it though; look how easy it is to plot the data!

df.plot(x="Step", y="Press");

import matplotlib.pyplot as plt

fig, (ax0, ax1) = plt.subplots(1, 2)
df.plot(x="Temp", y="PotEng", style="b.", ax=ax0)
df.plot(x="Press", y="PotEng", style="b.", ax=ax1)
plt.tight_layout();

Plot distributions of a column¶

We can look at histograms of properties as easily.

df.hist("PotEng", xrot=45, bins=20, density=True);

Plot column correlations¶

This is just the beginning of using Pandas. Suppose we want to see which columns are correlated (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). With variables this would be tedious.

plt.matshow(df.corr());

It can be helpful to see what these correlations mean. Here we plot all the columns against each other. Note, it is not possible to plot a column against itself with Pandas (I think this is a bug https://github.com/pandas-dev/pandas/issues/22088), so here I use matplotlib functions for the plotting. This should be symmetric, so I only plot the upper triangle.

keys = df.keys()

fig, axs = plt.subplots(13, 13)
fig.set_size_inches((8, 8))
for i in range(13):
    for j in range(i, 13):
        axs[i, j].plot(df[keys[i]], df[keys[j]], "b.", ms=2)
        # remove axes so it is easier to read
        axs[i, j].axes.get_xaxis().set_visible(False)
        axs[i, j].axes.get_yaxis().set_visible(False)
        axs[j, i].axes.get_xaxis().set_visible(False)
        axs[j, i].axes.get_yaxis().set_visible(False)

Getting parts of a Pandas DataFrame¶

We have seen how to get a column from a DataFrame like this:

df["Press"]

      601.28429
    -1974.43580
     2974.54030
     2386.37510
     -661.78795
           ...    
 -1747.91220
   -77.15260
  1958.97410
 -1709.95670
 -2100.76140
Name: Press, Length: 2501, dtype: float64

In this context, the DataFrame is acting like a dictionary. You can get a few columns by using a list of column names.

df[["Press", "PotEng"]]

	Press	PotEng
0	601.28429	-1566.6216
1	-1974.43580	-1569.6934
2	2974.54030	-1564.3755
3	2386.37510	-1566.4325
4	-661.78795	-1569.3172
...	...	...
2496	-1747.91220	-1570.0162
2497	-77.15260	-1568.3816
2498	1958.97410	-1565.8013
2499	-1709.95670	-1570.6800
2500	-2100.76140	-1570.5382

2501 rows × 2 columns

What about a row? This is what we would have done with a numpy array, but it just doesn’t work here.

Error

You get a KeyError here.

df[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
   3651 try:
-> 3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[28], line 1
----> 1 df[0]

File /opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3654, in Index.get_loc(self, key)
   3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:
-> 3654     raise KeyError(key) from err
   3655 except TypeError:
   3656     # If we have a listlike key, _check_indexing_error will raise
   3657     #  InvalidIndexError. Otherwise we fall through and re-raise
   3658     #  the TypeError.
   3659     self._check_indexing_error(key)

KeyError: 0

The problem is that as a dictionary, the keys are for the columns.

df.keys()

Index(['Step', 'v_mytime', 'Temp', 'Press', 'Volume', 'PotEng', 'TotEng',
       'v_pxy', 'v_pxz', 'v_pyz', 'v_v11', 'v_v22', 'v_v33', 'CPU'],
      dtype='object')

One way to get the rows by their integer index is to use the integer location attribute for a row.

df.iloc[0]

Step        0.000000e+00
v_mytime    0.000000e+00
Temp        1.025000e+03
Press       6.012843e+02
Volume      8.894648e+03
PotEng     -1.566622e+03
TotEng     -1.500508e+03
v_pxy       2.065628e+03
v_pxz       1.713409e+03
v_pyz       2.030050e+02
v_v11       1.340898e-05
v_v22       9.226001e-06
v_v33       1.295104e-07
CPU         0.000000e+00
Name: 0, dtype: float64

We can use slices on this.

df.iloc[0:5]

	Step	v_mytime	Temp	Press	Volume	PotEng	TotEng	v_pxy	v_pxz	v_pyz	v_v11	v_v22	v_v33	CPU
0	0	0	1025.0000	601.28429	8894.6478	-1566.6216	-1500.5083	2065.62850	1713.40950	203.00499	0.000013	0.000009	1.295104e-07	0.000000
1	200	1	1045.8510	-1974.43580	8894.6478	-1569.6934	-1502.2352	2530.16720	-2203.37680	-2193.88770	0.000423	-0.000564	9.109558e-04	0.505391
2	400	2	1050.4448	2974.54030	8894.6478	-1564.3755	-1496.6210	1446.73930	637.82978	1794.70610	0.001115	0.000125	4.583668e-04	1.018666
3	600	3	1071.3778	2386.37510	8894.6478	-1566.4325	-1497.3278	599.73943	-462.74809	558.54192	0.000766	0.000387	3.082071e-04	1.532061
4	800	4	1055.5281	-661.78795	8894.6478	-1569.3172	-1501.2348	1775.43870	-1551.26350	-493.01032	0.000569	0.000380	2.913082e-04	2.051839

This example may be a little confusing, because our index does include 0, so we can in this case also use the row label with the location attribute. You can use any value in the index for this.

df.index

RangeIndex(start=0, stop=2501, step=1)

df.loc[0]

Step        0.000000e+00
v_mytime    0.000000e+00
Temp        1.025000e+03
Press       6.012843e+02
Volume      8.894648e+03
PotEng     -1.566622e+03
TotEng     -1.500508e+03
v_pxy       2.065628e+03
v_pxz       1.713409e+03
v_pyz       2.030050e+02
v_v11       1.340898e-05
v_v22       9.226001e-06
v_v33       1.295104e-07
CPU         0.000000e+00
Name: 0, dtype: float64

We can access the first five rows like this.

df.loc[0:4]

	Step	v_mytime	Temp	Press	Volume	PotEng	TotEng	v_pxy	v_pxz	v_pyz	v_v11	v_v22	v_v33	CPU
0	0	0	1025.0000	601.28429	8894.6478	-1566.6216	-1500.5083	2065.62850	1713.40950	203.00499	0.000013	0.000009	1.295104e-07	0.000000
1	200	1	1045.8510	-1974.43580	8894.6478	-1569.6934	-1502.2352	2530.16720	-2203.37680	-2193.88770	0.000423	-0.000564	9.109558e-04	0.505391
2	400	2	1050.4448	2974.54030	8894.6478	-1564.3755	-1496.6210	1446.73930	637.82978	1794.70610	0.001115	0.000125	4.583668e-04	1.018666
3	600	3	1071.3778	2386.37510	8894.6478	-1566.4325	-1497.3278	599.73943	-462.74809	558.54192	0.000766	0.000387	3.082071e-04	1.532061
4	800	4	1055.5281	-661.78795	8894.6478	-1569.3172	-1501.2348	1775.43870	-1551.26350	-493.01032	0.000569	0.000380	2.913082e-04	2.051839

And a slice of a column like this.

df.loc[0:4, "Press"]

   601.28429
 -1974.43580
  2974.54030
  2386.37510
  -661.78795
Name: Press, dtype: float64

We can access a value in a row and column with the at function on a DataFrame.

df.at[2, "Press"]

2974.5403

Or if you know the row and column numbers you can use iat.

df.iat[2, 3]

2974.5403

Operating on columns in the DataFrame¶

Some functions just work across the columns. For example, DataFrames have statistics functions like this.

df.mean()

Step        250000.000000
v_mytime      1250.000000
Temp          1025.136938
Press          188.035304
Volume        8894.647800
PotEng       -1567.935373
TotEng       -1501.813207
v_pxy           16.669575
v_pxz           -4.852837
v_pyz           14.175372
v_v11            0.000465
v_v22            0.000429
v_v33            0.000428
CPU            644.103768
dtype: float64

We should tread carefully with other functions that work on arrays. For example consider this example that computes the mean of an entire array.

a = np.array([[1, 1, 1], [2, 2, 2]])
np.mean(a)

1.5

It does not do the same thing on a DataFrame. The index and column labels are preserved with numpy functions.

import numpy as np

np.mean(df)  # takes mean along axis 0

18497.01204734716

np.max(df)

500000.0

np.exp(df)

/opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/core/internals/blocks.py:329: RuntimeWarning: overflow encountered in exp
  result = func(self.values, **kwargs)

	Step	v_mytime	Temp	Press	Volume	PotEng	TotEng	v_pxy	v_pxz	v_pyz	v_v11	v_v22	v_v33	CPU
0	1.000000e+00	1.000000	inf	1.362854e+261	inf	0.0	0.0	inf	inf	1.458636e+88	1.000013	1.000009	1.000000	1.000000
1	7.225974e+86	2.718282	inf	0.000000e+00	inf	0.0	0.0	inf	0.000000e+00	0.000000e+00	1.000423	0.999436	1.000911	1.657634
2	5.221470e+173	7.389056	inf	inf	inf	0.0	0.0	inf	1.013804e+277	inf	1.001116	1.000125	1.000458	2.769498
3	3.773020e+260	20.085537	inf	inf	inf	0.0	0.0	2.907536e+260	1.074133e-201	3.729699e+242	1.000766	1.000387	1.000308	4.627705
4	inf	54.598150	inf	3.882801e-288	inf	0.0	0.0	inf	0.000000e+00	7.732831e-215	1.000569	1.000380	1.000291	7.782200
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2496	inf	inf	inf	0.000000e+00	inf	0.0	0.0	inf	inf	0.000000e+00	1.000452	1.000440	1.000443	inf
2497	inf	inf	inf	3.112086e-34	inf	0.0	0.0	inf	inf	0.000000e+00	1.000452	1.000440	1.000443	inf
2498	inf	inf	inf	inf	inf	0.0	0.0	0.000000e+00	inf	0.000000e+00	1.000453	1.000439	1.000443	inf
2499	inf	inf	inf	0.000000e+00	inf	0.0	0.0	0.000000e+00	9.056711e-32	inf	1.000453	1.000440	1.000442	inf
2500	inf	inf	inf	0.000000e+00	inf	0.0	0.0	1.612378e-110	1.087368e+267	inf	1.000453	1.000440	1.000442	inf

2501 rows × 14 columns

2 * df

	Step	v_mytime	Temp	Press	Volume	PotEng	TotEng	v_pxy	v_pxz	v_pyz	v_v11	v_v22	v_v33	CPU
0	0	0	2050.00000	1202.56858	17789.2956	-3133.2432	-3001.0166	4131.25700	3426.819000	406.00998	0.000027	0.000018	2.590208e-07	0.000000
1	400	2	2091.70200	-3948.87160	17789.2956	-3139.3868	-3004.4704	5060.33440	-4406.753600	-4387.77540	0.000846	-0.001128	1.821912e-03	1.010782
2	800	4	2100.88960	5949.08060	17789.2956	-3128.7510	-2993.2420	2893.47860	1275.659560	3589.41220	0.002231	0.000249	9.167335e-04	2.037332
3	1200	6	2142.75560	4772.75020	17789.2956	-3132.8650	-2994.6556	1199.47886	-925.496180	1117.08384	0.001531	0.000774	6.164143e-04	3.064122
4	1600	8	2111.05620	-1323.57590	17789.2956	-3138.6344	-3002.4696	3550.87740	-3102.527000	-986.02064	0.001137	0.000760	5.826165e-04	4.103678
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2496	998400	4992	1955.91788	-3495.82440	17789.2956	-3140.0324	-3013.8740	2823.71200	2309.766800	-4531.01000	0.000905	0.000880	8.849924e-04	2571.920600
2497	998800	4994	2133.01740	-154.30520	17789.2956	-3136.7632	-2999.1820	8143.76800	7694.591800	-2558.05720	0.000905	0.000879	8.856225e-04	2572.948200
2498	999200	4996	2104.37720	3917.94820	17789.2956	-3131.6026	-2995.8686	-4305.08920	1851.551560	-2325.52240	0.000906	0.000879	8.854430e-04	2573.974000
2499	999600	4998	2114.60280	-3419.91340	17789.2956	-3141.3600	-3004.9664	-3060.68180	-142.958434	1462.89470	0.000905	0.000879	8.847412e-04	2575.006400
2500	1000000	5000	1957.24718	-4201.52280	17789.2956	-3141.0764	-3014.8324	-505.61330	1229.747960	1879.30786	0.000905	0.000879	8.845744e-04	2576.046600

2501 rows × 14 columns

We can apply a function to the DataFrame. The default is the columns (axis=0). Either way, we get a new DataFrame.

def minmax(roworcolumn):
    return np.min(roworcolumn), np.max(roworcolumn)


df.apply(minmax)

	Step	v_mytime	Temp	Press	Volume	PotEng	TotEng	v_pxy	v_pxz	v_pyz	v_v11	v_v22	v_v33	CPU
0	0	0	883.05875	-6247.6768	8894.6478	-1576.8745	-1513.6955	-6747.6227	-7547.8097	-7509.4468	0.000013	-0.000564	1.295104e-07	0.0000
1	500000	2500	1154.17900	6673.7634	8894.6478	-1559.1555	-1488.9586	7428.1767	6523.4548	6229.5043	0.001115	0.000531	9.109558e-04	1288.0233

Here we analyze across the rows.

df.apply(minmax, axis=1)

     (-1566.6216, 8894.6478)
     (-2203.3768, 8894.6478)
     (-1564.3755, 8894.6478)
     (-1566.4325, 8894.6478)
     (-1569.3172, 8894.6478)
                 ...           
    (-2265.505, 499200.0)
   (-1568.3816, 499400.0)
   (-2152.5446, 499600.0)
   (-1709.9567, 499800.0)
   (-2100.7614, 500000.0)
Length: 2501, dtype: object

Summary¶

Pandas is a multipurpose data science tool. In many ways it is like a numpy array, and in many ways it is different. In some ways it is like a dictionary.

The similarities include the ability to do some indexing and slicing. This is only a partial similarity though.

The differences include integrated plotting.

Data science and machine learning in science and engineering

Introduction to Pandas

Contents

Introduction to Pandas¶

Review of the numpy array way¶

Now, with Pandas¶

Pandas¶

Dataframes and visualization¶

Reading data in Pandas¶

Another real-life example¶

Visualizing the data¶

Plot a column¶

Plot distributions of a column¶

Plot column correlations¶

Getting parts of a Pandas DataFrame¶

Operating on columns in the DataFrame¶

Summary¶