Using Pandas DataFrames as a small database¶

Pandas DataFrames can serve as small databases. You can use them to construct tables from which you can run queries to compute things. In this example, we will explore a database of molecules and atomic element properties.

A Table for the chemical elements and their atomic masses¶

We will use the Atomic Simulation Environment library for this. It is already installed in Deepnote.

We will use it because it has data about the chemical elements, and some molecules we will use.

First, we make a DataFrame containing the chemical elements, and their atomic masses.

import ase
import numpy as np
import pandas as pd

dtypes = np.dtype([("symbol", str), ("atomic mass", float)])
data = np.empty(0, dtype=dtypes)

elements = pd.DataFrame(data)
elements["symbol"] = ase.data.chemical_symbols
elements["atomic mass"] = ase.data.atomic_masses
elements.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   symbol       119 non-null    object 
 1   atomic mass  119 non-null    float64
dtypes: float64(1), object(1)
memory usage: 2.0+ KB

You can use a query function to select rows from the database that meet some criteria.

?elements.query

For example, to get Carbon we can use a query like this.

elements.query('symbol == "C"')
symbol atomic mass
6 C 12.011

We can do some things with python variables like this.

sym = "H"
elements.query("symbol == @sym")
symbol atomic mass
1 H 1.008

Although, f-strings also work for this.

sym = "H"
elements.query(f'symbol == "{sym}"')
symbol atomic mass
1 H 1.008

The atomic mass is trickier to work with because it has a space in it. We have to use back-ticks to “quote” this.

elements.query("`atomic mass` < 4")
symbol atomic mass
0 X 1.000
1 H 1.008

Note it does not appear you can use the @ or f-string on column names. Also note the mysterious element X in the table. We can ignore that.

A Table for some molecules¶

Next, let’s build the molecule database. This will be a table where each row corresponds to an atom in a molecule. We will be able to get the atoms in a molecule by aggregating these on a Molecule-ID. We build this table up row by row, by first getting a molecule, and then iterating over each atom in the molecule. It is conventional to use an integer for an ID, but we will just use the molecular formula in this example. That has some limitations for larger databases (e.g. isomers have different properties, but the same molecular formula), but we will not have that problem here.

df = pd.DataFrame(columns=["Molecule-ID", "Atom symbol", "x", "y", "z"])

from ase.build import molecule

i = 0

for mlc in ["H2O", "NH3", "CH4"]:
    for atom in molecule(mlc):
        df.loc[i] = [mlc, atom.symbol, atom.x, atom.y, atom.z]
        i += 1
df
Molecule-ID Atom symbol x y z
0 H2O O 0.000000 0.000000 0.119262
1 H2O H 0.000000 0.763239 -0.477047
2 H2O H 0.000000 -0.763239 -0.477047
3 NH3 N 0.000000 0.000000 0.116489
4 NH3 H 0.000000 0.939731 -0.271808
5 NH3 H 0.813831 -0.469865 -0.271808
6 NH3 H -0.813831 -0.469865 -0.271808
7 CH4 C 0.000000 0.000000 0.000000
8 CH4 H 0.629118 0.629118 0.629118
9 CH4 H -0.629118 -0.629118 0.629118
10 CH4 H 0.629118 -0.629118 -0.629118
11 CH4 H -0.629118 0.629118 -0.629118

What molecules are in the database?¶

Ok, now we are ready to do some queries. First, let’s see what molecules we have in our database. We want the unique values of the ‘Molecule-ID’ column.

df["Molecule-ID"].unique()
array(['H2O', 'NH3', 'CH4'], dtype=object)

Which molecules have three H atoms?¶

Now, how do we find molecules that have 3 H atoms? We need to do some grouping. First, we select the H rows, and then we group by the Molecule-ID Then, we need a count of each sub group. I prefer the size function for this over count. size returns a Series, while count seems to return a DataFrame.

tf = df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]).size()
tf
Molecule-ID
CH4    4
H2O    2
NH3    3
dtype: int64

Finally, we can select the rows that have 3 hydrogen atoms.

tf[tf == 3]
Molecule-ID
NH3    3
dtype: int64

Getting the molecular weight¶

Getting the molecular weight requires us to combine information from two DataFrames. To do this, we need to merge them, aligning the rows on a common key. That key is the Atom symbol in the molecule DataFrame, and symbol in the elements DataFrame. Then we have to do the right grouping, and use the sum aggregation method on each group.

mf = pd.merge(df, elements, how="inner", left_on="Atom symbol", right_on="symbol")
mf
Molecule-ID Atom symbol x y z symbol atomic mass
0 H2O O 0.000000 0.000000 0.119262 O 15.999
1 H2O H 0.000000 0.763239 -0.477047 H 1.008
2 H2O H 0.000000 -0.763239 -0.477047 H 1.008
3 NH3 H 0.000000 0.939731 -0.271808 H 1.008
4 NH3 H 0.813831 -0.469865 -0.271808 H 1.008
5 NH3 H -0.813831 -0.469865 -0.271808 H 1.008
6 CH4 H 0.629118 0.629118 0.629118 H 1.008
7 CH4 H -0.629118 -0.629118 0.629118 H 1.008
8 CH4 H 0.629118 -0.629118 -0.629118 H 1.008
9 CH4 H -0.629118 0.629118 -0.629118 H 1.008
10 NH3 N 0.000000 0.000000 0.116489 N 14.007
11 CH4 C 0.000000 0.000000 0.000000 C 12.011

Now, we group by the Molecule-ID, select the atomic mass column, and aggregate with the sum.

MW = mf.groupby("Molecule-ID")["atomic mass"].sum()
MW
Molecule-ID
CH4    16.043
H2O    18.015
NH3    17.031
Name: atomic mass, dtype: float64

Here is one of many ways to print this in a different format:

for mlc, mw in MW.items():
    print(f"The molecular weight of {mlc} is {mw} gm/mol")
The molecular weight of CH4 is 16.043 gm/mol
The molecular weight of H2O is 18.015 gm/mol
The molecular weight of NH3 is 17.031 gm/mol

Chaining commands in Pandas¶

So far, we have mostly seen sequential commands in Pandas

tf = df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]).size()
tf[tf == 3]
Molecule-ID
NH3    3
dtype: int64

The query lets us chain these all into one line.

df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]).count().query(
    "`Atom symbol` == 3"
)
Atom symbol x y z
Molecule-ID
NH3 3 3 3 3

It is common to see this syntax where parentheses allow us to separate these into multiple lines. This may enhance readability.

(
    df.query('`Atom symbol` == "H"')
    .groupby(["Molecule-ID"])
    .count()
    .query("`Atom symbol` == 3")
)
Atom symbol x y z
Molecule-ID
NH3 3 3 3 3

The main benefit of chaining is that you do not have to define temporary variables that exist only so you can reuse them in subsequent lines. The downside is it is more challenging to debug them, and it is common to build them iteratively in a notebook.

Here is another example of chaining to get the molecular weight of water.

(
    pd.merge(df, elements, how="inner", left_on="Atom symbol", right_on="symbol")
    .groupby("Molecule-ID")["atomic mass"]
    .sum()["H2O"]
)
18.015

Subtle points¶

Pandas offers many ways to do what appear to be the same thing, but they are not. For example, this works:

(
    df.query('`Atom symbol` == "H"')
    .groupby(["Molecule-ID"])
    .count()
    .query("`Atom symbol` == 3")
)
Atom symbol x y z
Molecule-ID
NH3 3 3 3 3

And this doesn’t.

(
    df.query('`Atom symbol` == "H"')
    .groupby(["Molecule-ID"])
    .size()
    .query("`Atom symbol` == 3")
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2249/4122089245.py in ?()
----> 2 (
      3     df.query('`Atom symbol` == "H"')
      4     .groupby(["Molecule-ID"])
      5     .size()

/opt/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'query'

The problem is the size function here returns Series, and you cannot query a series. We can get back to this with some acrobatics.

(
    df.query('`Atom symbol` == "H"')
    .groupby(["Molecule-ID"])
    .size()  # This is a series
    .rename("counts")  # we give the Series a name
    .to_frame()  # Convert to dataframe so we can query
    .query("counts == 3")
)
counts
Molecule-ID
NH3 3

In the beginning, it will be challenging to remember all of this, and figure out how to debug it. With practice, it will get easier!

The Pandas manual (https://pandas.pydata.org/docs/pandas.pdf) is ~3000 pages long! You cannot learn it all, and most likely you don’t need to as it covers a lot of use cases that may fall outside your needs.

It is also challenging that there are many ways to do the same thing. For example, here we solve this problem in a different way that has a subtly different syntax. You cannot just cut and paste bits of code between these two examples without knowing what each one does.

(
    df.query('`Atom symbol` == "H"')
    .groupby(["Molecule-ID"])
    .agg({"Atom symbol": "size"})  # Now this is dataframe
    .query("`Atom symbol` == 3")
)
Atom symbol
Molecule-ID
NH3 3

How do you learn/remember these? One way is reading lots of code, and running lots of code. You can read code in the manual. You can also use the notebook to access documentation on these methods.

Here are some of the commands we used today,

?pd.DataFrame.query
?pd.DataFrame.groupby

Getting help on the agg command is a little trickier. There are several agg functions, so we want to make sure we get the one that is relevant to the result from a groupby call. First, we get the type of things that is returned:

type(df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]))
pandas.core.groupby.generic.DataFrameGroupBy

Then, we get the help for that thing.

?pd.core.groupby.generic.DataFrameGroupBy.agg
?pd.Series.to_frame
?pd.Series.rename