Using Pandas DataFrames as a small database
Contents
Using Pandas DataFrames as a small database¶
Pandas DataFrames can serve as small databases. You can use them to construct tables from which you can run queries to compute things. In this example, we will explore a database of molecules and atomic element properties.
A Table for the chemical elements and their atomic masses¶
We will use the Atomic Simulation Environment library for this. It is already installed in Deepnote.
We will use it because it has data about the chemical elements, and some molecules we will use.
First, we make a DataFrame containing the chemical elements, and their atomic masses.
import ase
import numpy as np
import pandas as pd
dtypes = np.dtype([("symbol", str), ("atomic mass", float)])
data = np.empty(0, dtype=dtypes)
elements = pd.DataFrame(data)
elements["symbol"] = ase.data.chemical_symbols
elements["atomic mass"] = ase.data.atomic_masses
elements.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symbol 119 non-null object
1 atomic mass 119 non-null float64
dtypes: float64(1), object(1)
memory usage: 2.0+ KB
You can use a query
function to select rows from the database that meet some criteria.
?elements.query
For example, to get Carbon we can use a query like this.
elements.query('symbol == "C"')
symbol | atomic mass | |
---|---|---|
6 | C | 12.011 |
We can do some things with python variables like this.
sym = "H"
elements.query("symbol == @sym")
symbol | atomic mass | |
---|---|---|
1 | H | 1.008 |
Although, f-strings also work for this.
sym = "H"
elements.query(f'symbol == "{sym}"')
symbol | atomic mass | |
---|---|---|
1 | H | 1.008 |
The atomic mass is trickier to work with because it has a space in it. We have to use back-ticks to âquoteâ this.
elements.query("`atomic mass` < 4")
symbol | atomic mass | |
---|---|---|
0 | X | 1.000 |
1 | H | 1.008 |
Note it does not appear you can use the @ or f-string on column names. Also note the mysterious element X in the table. We can ignore that.
A Table for some molecules¶
Next, letâs build the molecule database. This will be a table where each row corresponds to an atom in a molecule. We will be able to get the atoms in a molecule by aggregating these on a Molecule-ID
. We build this table up row by row, by first getting a molecule, and then iterating over each atom in the molecule. It is conventional to use an integer for an ID, but we will just use the molecular formula in this example. That has some limitations for larger databases (e.g. isomers have different properties, but the same molecular formula), but we will not have that problem here.
df = pd.DataFrame(columns=["Molecule-ID", "Atom symbol", "x", "y", "z"])
from ase.build import molecule
i = 0
for mlc in ["H2O", "NH3", "CH4"]:
for atom in molecule(mlc):
df.loc[i] = [mlc, atom.symbol, atom.x, atom.y, atom.z]
i += 1
df
Molecule-ID | Atom symbol | x | y | z | |
---|---|---|---|---|---|
0 | H2O | O | 0.000000 | 0.000000 | 0.119262 |
1 | H2O | H | 0.000000 | 0.763239 | -0.477047 |
2 | H2O | H | 0.000000 | -0.763239 | -0.477047 |
3 | NH3 | N | 0.000000 | 0.000000 | 0.116489 |
4 | NH3 | H | 0.000000 | 0.939731 | -0.271808 |
5 | NH3 | H | 0.813831 | -0.469865 | -0.271808 |
6 | NH3 | H | -0.813831 | -0.469865 | -0.271808 |
7 | CH4 | C | 0.000000 | 0.000000 | 0.000000 |
8 | CH4 | H | 0.629118 | 0.629118 | 0.629118 |
9 | CH4 | H | -0.629118 | -0.629118 | 0.629118 |
10 | CH4 | H | 0.629118 | -0.629118 | -0.629118 |
11 | CH4 | H | -0.629118 | 0.629118 | -0.629118 |
What molecules are in the database?¶
Ok, now we are ready to do some queries. First, letâs see what molecules we have in our database. We want the unique values of the âMolecule-IDâ column.
df["Molecule-ID"].unique()
array(['H2O', 'NH3', 'CH4'], dtype=object)
Which molecules have three H atoms?¶
Now, how do we find molecules that have 3 H atoms? We need to do some grouping. First, we select the H rows, and then we group by the Molecule-ID Then, we need a count of each sub group. I prefer the size
function for this over count
. size
returns a Series, while count
seems to return a DataFrame.
tf = df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]).size()
tf
Molecule-ID
CH4 4
H2O 2
NH3 3
dtype: int64
Finally, we can select the rows that have 3 hydrogen atoms.
tf[tf == 3]
Molecule-ID
NH3 3
dtype: int64
Getting the molecular weight¶
Getting the molecular weight requires us to combine information from two DataFrames. To do this, we need to merge them, aligning the rows on a common key. That key is the Atom symbol
in the molecule DataFrame, and symbol
in the elements DataFrame. Then we have to do the right grouping, and use the sum aggregation method on each group.
mf = pd.merge(df, elements, how="inner", left_on="Atom symbol", right_on="symbol")
mf
Molecule-ID | Atom symbol | x | y | z | symbol | atomic mass | |
---|---|---|---|---|---|---|---|
0 | H2O | O | 0.000000 | 0.000000 | 0.119262 | O | 15.999 |
1 | H2O | H | 0.000000 | 0.763239 | -0.477047 | H | 1.008 |
2 | H2O | H | 0.000000 | -0.763239 | -0.477047 | H | 1.008 |
3 | NH3 | H | 0.000000 | 0.939731 | -0.271808 | H | 1.008 |
4 | NH3 | H | 0.813831 | -0.469865 | -0.271808 | H | 1.008 |
5 | NH3 | H | -0.813831 | -0.469865 | -0.271808 | H | 1.008 |
6 | CH4 | H | 0.629118 | 0.629118 | 0.629118 | H | 1.008 |
7 | CH4 | H | -0.629118 | -0.629118 | 0.629118 | H | 1.008 |
8 | CH4 | H | 0.629118 | -0.629118 | -0.629118 | H | 1.008 |
9 | CH4 | H | -0.629118 | 0.629118 | -0.629118 | H | 1.008 |
10 | NH3 | N | 0.000000 | 0.000000 | 0.116489 | N | 14.007 |
11 | CH4 | C | 0.000000 | 0.000000 | 0.000000 | C | 12.011 |
Now, we group by the Molecule-ID
, select the atomic mass
column, and aggregate with the sum.
MW = mf.groupby("Molecule-ID")["atomic mass"].sum()
MW
Molecule-ID
CH4 16.043
H2O 18.015
NH3 17.031
Name: atomic mass, dtype: float64
Here is one of many ways to print this in a different format:
for mlc, mw in MW.items():
print(f"The molecular weight of {mlc} is {mw} gm/mol")
The molecular weight of CH4 is 16.043 gm/mol
The molecular weight of H2O is 18.015 gm/mol
The molecular weight of NH3 is 17.031 gm/mol
Chaining commands in Pandas¶
So far, we have mostly seen sequential commands in Pandas
tf = df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]).size()
tf[tf == 3]
Molecule-ID
NH3 3
dtype: int64
The query
lets us chain these all into one line.
df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]).count().query(
"`Atom symbol` == 3"
)
Atom symbol | x | y | z | |
---|---|---|---|---|
Molecule-ID | ||||
NH3 | 3 | 3 | 3 | 3 |
It is common to see this syntax where parentheses allow us to separate these into multiple lines. This may enhance readability.
(
df.query('`Atom symbol` == "H"')
.groupby(["Molecule-ID"])
.count()
.query("`Atom symbol` == 3")
)
Atom symbol | x | y | z | |
---|---|---|---|---|
Molecule-ID | ||||
NH3 | 3 | 3 | 3 | 3 |
The main benefit of chaining is that you do not have to define temporary variables that exist only so you can reuse them in subsequent lines. The downside is it is more challenging to debug them, and it is common to build them iteratively in a notebook.
Here is another example of chaining to get the molecular weight of water.
(
pd.merge(df, elements, how="inner", left_on="Atom symbol", right_on="symbol")
.groupby("Molecule-ID")["atomic mass"]
.sum()["H2O"]
)
18.015
Subtle points¶
Pandas offers many ways to do what appear to be the same thing, but they are not. For example, this works:
(
df.query('`Atom symbol` == "H"')
.groupby(["Molecule-ID"])
.count()
.query("`Atom symbol` == 3")
)
Atom symbol | x | y | z | |
---|---|---|---|---|
Molecule-ID | ||||
NH3 | 3 | 3 | 3 | 3 |
And this doesnât.
(
df.query('`Atom symbol` == "H"')
.groupby(["Molecule-ID"])
.size()
.query("`Atom symbol` == 3")
)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_2249/4122089245.py in ?()
----> 2 (
3 df.query('`Atom symbol` == "H"')
4 .groupby(["Molecule-ID"])
5 .size()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
5985 and name not in self._accessors
5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
-> 5989 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'query'
The problem is the size
function here returns Series, and you cannot query a series. We can get back to this with some acrobatics.
(
df.query('`Atom symbol` == "H"')
.groupby(["Molecule-ID"])
.size() # This is a series
.rename("counts") # we give the Series a name
.to_frame() # Convert to dataframe so we can query
.query("counts == 3")
)
counts | |
---|---|
Molecule-ID | |
NH3 | 3 |
In the beginning, it will be challenging to remember all of this, and figure out how to debug it. With practice, it will get easier!
The Pandas manual (https://pandas.pydata.org/docs/pandas.pdf) is ~3000 pages long! You cannot learn it all, and most likely you donât need to as it covers a lot of use cases that may fall outside your needs.
It is also challenging that there are many ways to do the same thing. For example, here we solve this problem in a different way that has a subtly different syntax. You cannot just cut and paste bits of code between these two examples without knowing what each one does.
(
df.query('`Atom symbol` == "H"')
.groupby(["Molecule-ID"])
.agg({"Atom symbol": "size"}) # Now this is dataframe
.query("`Atom symbol` == 3")
)
Atom symbol | |
---|---|
Molecule-ID | |
NH3 | 3 |
How do you learn/remember these? One way is reading lots of code, and running lots of code. You can read code in the manual. You can also use the notebook to access documentation on these methods.
Here are some of the commands we used today,
?pd.DataFrame.query
?pd.DataFrame.groupby
Getting help on the agg
command is a little trickier. There are several agg
functions, so we want to make sure we get the one that is relevant to the result from a groupby
call. First, we get the type of things that is returned:
type(df.query('`Atom symbol` == "H"').groupby(["Molecule-ID"]))
pandas.core.groupby.generic.DataFrameGroupBy
Then, we get the help for that thing.
?pd.core.groupby.generic.DataFrameGroupBy.agg
?pd.Series.to_frame
?pd.Series.rename