Intermediate Pandas¶

Today we look at some ways to use Pandas DataFrames like databases.

Revisiting a previous example with batches of data¶

We start with the example we looked at before. It is a dataset from a set of experiments. The experiments are grouped by the Day they were run on. We will use Pandas to do some analysis by the day.

import pandas as pd

df = pd.read_csv(
    "p-t.dat",
    delimiter="\s+",
    skiprows=2,
    names=[
        "Run order",
        "Day",
        "Ambient Temperature",
        "Temperature",
        "Pressure",
        "Fitted Value",
        "Residual",
    ],
)
df
Run order Day Ambient Temperature Temperature Pressure Fitted Value Residual
0 1 1 23.820 54.749 225.066 222.920 2.146
1 2 1 24.120 23.323 100.331 99.411 0.920
2 3 1 23.434 58.775 230.863 238.744 -7.881
3 4 1 23.993 25.854 106.160 109.359 -3.199
4 5 1 23.375 68.297 277.502 276.165 1.336
5 6 1 23.233 37.481 148.314 155.056 -6.741
6 7 1 24.162 49.542 197.562 202.456 -4.895
7 8 1 23.667 34.101 138.537 141.770 -3.232
8 9 1 24.056 33.901 137.969 140.983 -3.014
9 10 1 22.786 29.242 117.410 122.674 -5.263
10 11 2 23.785 39.506 164.442 163.013 1.429
11 12 2 22.987 43.004 181.044 176.759 4.285
12 13 2 23.799 53.226 222.179 216.933 5.246
13 14 2 23.661 54.467 227.010 221.813 5.198
14 15 2 23.852 57.549 232.496 233.925 -1.429
15 16 2 23.379 61.204 253.557 248.288 5.269
16 17 2 24.146 31.489 139.894 131.506 8.388
17 18 2 24.187 68.476 273.931 276.871 -2.940
18 19 2 24.159 51.144 207.969 208.753 -0.784
19 20 2 23.803 68.774 280.205 278.040 2.165
20 21 3 24.381 55.350 227.060 225.282 1.779
21 22 3 24.027 44.692 180.605 183.396 -2.791
22 23 3 24.342 50.995 206.229 208.167 -1.938
23 24 3 23.670 21.602 91.464 92.649 -1.186
24 25 3 24.246 54.673 223.869 222.622 1.247
25 26 3 25.082 41.449 172.910 170.651 2.259
26 27 3 24.575 35.451 152.073 147.075 4.998
27 28 3 23.803 42.989 169.427 176.703 -7.276
28 29 3 24.660 48.599 192.561 198.748 -6.188
29 30 3 24.097 21.448 94.448 92.042 2.406
30 31 4 22.816 56.982 222.794 231.697 -8.902
31 32 4 24.167 47.901 199.003 196.008 2.996
32 33 4 22.712 40.285 168.668 166.077 2.592
33 34 4 23.611 25.609 109.387 108.397 0.990
34 35 4 23.354 22.971 98.445 98.029 0.416
35 36 4 23.669 25.838 110.987 109.295 1.692
36 37 4 23.965 49.127 202.662 200.826 1.835
37 38 4 22.917 54.936 224.773 223.653 1.120
38 39 4 23.546 50.917 216.058 207.859 8.199
39 40 4 24.450 41.976 171.469 172.720 -1.251

The first aggregation we will look at is how to make groups of data that are related by values in a column. We use the groupby function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby), and specify a column to group on. The result is a DataFrameGroupBy object, which we next have to work with.

groups = df.groupby("Day")
type(groups)
pandas.core.groupby.generic.DataFrameGroupBy

The groups can describe themselves. Here we see we get 4 groups, one for each day, and you can see some statistics about each group. We do not need those for now.

groups.describe()
Run order Ambient Temperature ... Fitted Value Residual
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Day
1 10.0 5.5 3.02765 1.0 3.25 5.5 7.75 10.0 10.0 23.6646 ... 217.80400 276.165 10.0 -2.9823 3.452383 -7.881 -5.17100 -3.2155 -0.06350 2.146
2 10.0 15.5 3.02765 11.0 13.25 15.5 17.75 20.0 10.0 23.7758 ... 244.69725 278.040 10.0 2.6827 3.606824 -2.940 -0.23075 3.2250 5.23400 8.388
3 10.0 25.5 3.02765 21.0 23.25 25.5 27.75 30.0 10.0 24.2883 ... 205.81225 225.282 10.0 -0.6690 3.948274 -7.276 -2.57775 0.0305 2.13900 4.998
4 10.0 35.5 3.02765 31.0 33.25 35.5 37.75 40.0 10.0 23.5207 ... 206.10075 231.697 10.0 0.9687 4.255487 -8.902 0.55950 1.4060 2.40275 8.199

4 rows × 48 columns

We can get a dictionary of the group names and labels from the groups attribute.

groups.groups
{1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 2: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 3: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 4: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]}

We can get the subset of rows from those group labels.

df.loc[groups.groups[2]]
Run order Day Ambient Temperature Temperature Pressure Fitted Value Residual
10 11 2 23.785 39.506 164.442 163.013 1.429
11 12 2 22.987 43.004 181.044 176.759 4.285
12 13 2 23.799 53.226 222.179 216.933 5.246
13 14 2 23.661 54.467 227.010 221.813 5.198
14 15 2 23.852 57.549 232.496 233.925 -1.429
15 16 2 23.379 61.204 253.557 248.288 5.269
16 17 2 24.146 31.489 139.894 131.506 8.388
17 18 2 24.187 68.476 273.931 276.871 -2.940
18 19 2 24.159 51.144 207.969 208.753 -0.784
19 20 2 23.803 68.774 280.205 278.040 2.165

We don’t usually work with groups that way though, it is more common to do some analysis on each group.

Suppose we want to plot the Pressure vs Temperature for each group, so we can see visually if there are any trends that could be attributed to the group. To do this, we need to iterate over the groups and then make a plot on each one.

A DataFrameGroupBy is iterable and when you loop over it, you get the key it was grouped on, and a DataFrame that contains the items in the group. Here we loop over each group, and plot each group with a different color.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
for day, group in groups:
    group.plot("Temperature", "Pressure", ax=ax, label=f"{day}", style="o")
plt.ylabel("Pressure");
../_images/intermediate-pandas_14_0.png

the point of this is we cannot see a visual clustering of the groups by day. That is important, because if we did it could suggest something was different that day.

Combining data sets¶

Siddhant Lambor provided from two experiments conducted to measure the properties of a worm-like micelles solution. He had carried out experiments on a rheometer to measure the viscosity of a worm-like micelles solution in a Couette cell geometry and a Cone and Plate geometry. Ideally, there should not be a difference as viscosity is intrinsic to the fluid. Analysis of this data will confirm if that is true. First, we read this data in from the two data files.

  • cp.xlsx Cone and plate data

  • couette.xlsx Couette data

couette = pd.read_excel(
    "couette.xls", sheet_name="Flow sweep - 1", header=1
)  # sheet name is case sensitive, excel file name is not

couette
Stress Shear rate Viscosity Step time Temperature Normal stress
0 Pa 1/s Pa.s s °C Pa
1 0.009981 0.04953 0.201504 35.1469 25.001 -0.001263
2 0.015817 0.079875 0.198028 70.1689 25.001 -0.000835
3 0.025071 0.127313 0.196926 105.253 25 -0.000913
4 0.039734 0.204094 0.194685 140.26 25 -0.001132
5 0.062977 0.327253 0.192442 175.313 25 -0.002026
6 0.099806 0.530364 0.188183 210.32 24.998 -0.002429
7 0.158165 0.875494 0.180658 245.373 25 -0.001837
8 0.250637 1.49135 0.168061 280.426 25.002 -0.001655
9 0.397131 2.64707 0.150027 315.433 25.001 -0.001889
10 0.629111 5.00904 0.125595 350.501 25.003 -0.002233
11 0.996153 10.3993 0.09579 385.508 24.996 -0.001807
12 1.57528 24.5624 0.064134 420.608 24.998 -0.001823
13 2.47652 67.2972 0.0368 455.63 25.003 -0.00182
14 3.75092 210.991 0.017778 490.668 25 -0.002939

We can drop the row at index 0, it just has the units in it. With this syntax, we have to save the resulting DataFrame back into the variable, or it will not be changed.

couette = couette.drop(0)
couette
Stress Shear rate Viscosity Step time Temperature Normal stress
1 0.009981 0.04953 0.201504 35.1469 25.001 -0.001263
2 0.015817 0.079875 0.198028 70.1689 25.001 -0.000835
3 0.025071 0.127313 0.196926 105.253 25 -0.000913
4 0.039734 0.204094 0.194685 140.26 25 -0.001132
5 0.062977 0.327253 0.192442 175.313 25 -0.002026
6 0.099806 0.530364 0.188183 210.32 24.998 -0.002429
7 0.158165 0.875494 0.180658 245.373 25 -0.001837
8 0.250637 1.49135 0.168061 280.426 25.002 -0.001655
9 0.397131 2.64707 0.150027 315.433 25.001 -0.001889
10 0.629111 5.00904 0.125595 350.501 25.003 -0.002233
11 0.996153 10.3993 0.09579 385.508 24.996 -0.001807
12 1.57528 24.5624 0.064134 420.608 24.998 -0.001823
13 2.47652 67.2972 0.0368 455.63 25.003 -0.00182
14 3.75092 210.991 0.017778 490.668 25 -0.002939

There is a second file called cp.xls we want to combine with this. Here, we combine the drop function all into one line.

conePlate = pd.read_excel("cp.xls", sheet_name="Flow sweep - 1", header=1).drop(0)
conePlate.head(5)
Stress Shear rate Viscosity Step time Temperature Normal stress
1 0.009984 0.053193 0.187693 34.9909 25 1.2658
2 0.015822 0.082 0.19295 70.0909 25.001 0.569149
3 0.025078 0.129969 0.192952 105.16 25 0.295899
4 0.039759 0.20176 0.19706 140.213 25 1.0171
5 0.062988 0.336087 0.187415 175.282 25 0.546196

For this analysis, we are only interested in the shear rate, stress and viscosity values. Let us drop the other columns. We do that by the names, and specify inplace=True, which modifies the DataFrame itself.

conePlate.drop(["Temperature", "Step time", "Normal stress"], axis=1, inplace=True)
# if we do not use inplace=True, the data frame will not be changed. It would by default create a new data frame
# and we would have to assign a different variable to capture this change.
conePlate.head(5)
Stress Shear rate Viscosity
1 0.009984 0.053193 0.187693
2 0.015822 0.082 0.19295
3 0.025078 0.129969 0.192952
4 0.039759 0.20176 0.19706
5 0.062988 0.336087 0.187415

We also do that for the couette data. Here we did not use inplace=True, so we have to save the result back into the variable to get the change.

couette = couette.drop(
    ["Temperature", "Step time", "Normal stress"], axis=1
)  # without using inplace = True
couette.head(5)
Stress Shear rate Viscosity
1 0.009981 0.04953 0.201504
2 0.015817 0.079875 0.198028
3 0.025071 0.127313 0.196926
4 0.039734 0.204094 0.194685
5 0.062977 0.327253 0.192442

We can see info about each DataFrame like this.

couette.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 1 to 14
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Stress      14 non-null     object
 1   Shear rate  14 non-null     object
 2   Viscosity   14 non-null     object
dtypes: object(3)
memory usage: 468.0+ bytes
conePlate.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 1 to 17
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Stress      17 non-null     object
 1   Shear rate  17 non-null     object
 2   Viscosity   17 non-null     object
dtypes: object(3)
memory usage: 540.0+ bytes

We could proceed to analyze the DataFrames separately, but instead, we will combine them into one DataFrame. Before doing that, we need to add a column to each one so we know which data set is which. Simply assigning a value to a new column name will do that.

couette["type"] = "couette"
couette
Stress Shear rate Viscosity type
1 0.009981 0.04953 0.201504 couette
2 0.015817 0.079875 0.198028 couette
3 0.025071 0.127313 0.196926 couette
4 0.039734 0.204094 0.194685 couette
5 0.062977 0.327253 0.192442 couette
6 0.099806 0.530364 0.188183 couette
7 0.158165 0.875494 0.180658 couette
8 0.250637 1.49135 0.168061 couette
9 0.397131 2.64707 0.150027 couette
10 0.629111 5.00904 0.125595 couette
11 0.996153 10.3993 0.09579 couette
12 1.57528 24.5624 0.064134 couette
13 2.47652 67.2972 0.0368 couette
14 3.75092 210.991 0.017778 couette
conePlate["type"] = "cone"

Now, we can combine these into a single DataFrame. This is not critical, and you can get by without it, but I want to explore the idea, and illustrate it is possible.

df = pd.concat([conePlate, couette])
df
Stress Shear rate Viscosity type
1 0.009984 0.053193 0.187693 cone
2 0.015822 0.082 0.19295 cone
3 0.025078 0.129969 0.192952 cone
4 0.039759 0.20176 0.19706 cone
5 0.062988 0.336087 0.187415 cone
6 0.09982 0.556749 0.179291 cone
7 0.15819 0.925403 0.170942 cone
8 0.250701 1.57526 0.159148 cone
9 0.397263 2.782 0.142798 cone
10 0.629346 5.16333 0.121888 cone
11 0.996863 10.4029 0.095826 cone
12 1.57768 23.6558 0.066693 cone
13 2.49053 62.5901 0.039791 cone
14 3.85502 194.265 0.019844 cone
15 5.45425 795.474 0.006857 cone
16 8.62182 2011.14 0.004287 cone
17 13.7498 3633.47 0.003784 cone
1 0.009981 0.04953 0.201504 couette
2 0.015817 0.079875 0.198028 couette
3 0.025071 0.127313 0.196926 couette
4 0.039734 0.204094 0.194685 couette
5 0.062977 0.327253 0.192442 couette
6 0.099806 0.530364 0.188183 couette
7 0.158165 0.875494 0.180658 couette
8 0.250637 1.49135 0.168061 couette
9 0.397131 2.64707 0.150027 couette
10 0.629111 5.00904 0.125595 couette
11 0.996153 10.3993 0.09579 couette
12 1.57528 24.5624 0.064134 couette
13 2.47652 67.2972 0.0368 couette
14 3.75092 210.991 0.017778 couette

Finally, we are ready for the visualization. We will group the DataFrame and then make plots for each group. Here we illustrate several new arguments, including loglog plots, secondary axes, colored tick labels, and multiple legends.

g = df.groupby("type")
ax1 = g.get_group("cone").plot(
    "Shear rate", "Viscosity", logx=True, logy=True, style="b.-", label="CP viscosity"
)

g.get_group("couette").plot(
    "Shear rate",
    "Viscosity",
    logx=True,
    logy=True,
    style="g.-",
    ax=ax1,
    label="Couette viscosity",
)

ax2 = g.get_group("cone").plot(
    "Shear rate",
    "Stress",
    secondary_y=True,
    logx=True,
    logy=True,
    style="r.-",
    ax=ax1,
    label="CP stress",
)

g.get_group("couette").plot(
    "Shear rate",
    "Stress",
    secondary_y=True,
    logx=True,
    logy=True,
    style="y.",
    ax=ax2,
    label="Couette Stress",
)

# Setting y axis labels
ax1.set_ylabel("Viscosity (Pa.s)", color="b")
[ticklabel.set_color("b") for ticklabel in ax1.get_yticklabels()]

ax2.set_ylabel("Stress (Pa)", color="r")
[ticklabel.set_color("r") for ticklabel in ax1.get_yticklabels()]

# setting legend locations
ax1.legend(loc=6)
ax2.legend(loc=7)

ax1.set_xlabel("Shear rate (1/s)")
plt.title("Comparison of Cone and Plate with Couette Cell")
Text(0.5, 1.0, 'Comparison of Cone and Plate with Couette Cell')
../_images/intermediate-pandas_36_1.png

So, in fact we can see these two experiments are practically equivalent.