Databases in molecular simulations

Databases in molecular simulations#

The continued increase in computing power has enabled us to create massive amounts of computational data. Some of this data is accessible in papers, or at websites, e.g. https://cmr.fysik.dtu.dk.

Our Vasp module works natively with the ase-database. It is easy to write an entry to a database.

! rm -f example-1.db
from vasp import Vasp
from ase.db import connect
calc = Vasp(label='molecules/simple-co')
atoms = calc.load_atoms()
print(calc.results)
con = connect('example-1.db')
con.write(atoms)

{}

! ase db example-1.db

id|age|formula|calculator|natoms|pbc| volume|charge|  mass
 1| 3s|CO     |vasp      |     2|TTT|216.000| 0.000|28.010
Rows: 1

Vector Databases for Materials Discovery#

Traditional databases store and query data by exact matches or numerical ranges. Vector databases enable similarity search - finding materials with similar structures or properties even when they don’t match exactly.

The vasp package includes a built-in VectorAtomDatabase that stores per-atom embeddings for similarity search. This is particularly powerful for:

Finding atoms with similar local chemical environments
Discovering materials with similar bonding patterns
Building recommendation systems for materials discovery

Using VectorAtomDatabase#

The vasp package provides embedders that convert atomic structures into fixed-length vectors:

MockEmbedder - deterministic embeddings for testing
FairChemEmbedder - production embeddings using FairChem pretrained models

from vasp import Vasp
from vasp.database import VectorAtomDatabase, MockEmbedder

# Create an embedder (MockEmbedder for demonstration, FairChemEmbedder for production)
embedder = MockEmbedder(dim=128)

# Create a vector database with the embedder
db = VectorAtomDatabase("vasp_atoms.db", embedder=embedder)
db.connect()

# Load and store structures from our VASP calculations
for label in ['molecules/simple-co', 'molecules/O2-sp-triplet', 'molecules/h2o_relax']:
    try:
        calc = Vasp(label=label)
        atoms = calc.load_atoms()
        energy = calc.results.get('energy', None)
        forces = calc.results.get('forces', None)
        
        structure_id = db.write(atoms, energy=energy, forces=forces, source=label)
        print(f"Added {atoms.get_chemical_formula()} (id={structure_id}) from {label}")
    except Exception as e:
        print(f"Could not load {label}: {e}")

print(f"\nDatabase contains {db.count_structures()} structures, {db.count_atoms()} atoms")

Added CO (id=1) from molecules/simple-co
Added O2 (id=2) from molecules/O2-sp-triplet
Added H2O (id=3) from molecules/h2o_relax

Database contains 3 structures, 7 atoms

Finding Similar Atomic Environments#

The key feature of VectorAtomDatabase is finding atoms with similar local chemical environments. This enables discovering similar bonding patterns across different structures:

from ase.build import molecule

# Create a query structure
query_atoms = molecule('CO')
query_atoms.cell = [6, 6, 6]
query_atoms.pbc = True

# Find atoms similar to the carbon atom (index 0) in our query
similar = db.find_similar_environments(query_atoms, center_index=0, k=5)

print("Atoms similar to C in CO molecule:")
print("-" * 50)
for match in similar:
    struct = db.get_structure(match.structure_id)
    print(f"  {match.symbol} in {struct.get_chemical_formula()}")
    print(f"    Structure ID: {match.structure_id}, Atom index: {match.atom_index}")
    print(f"    Distance: {match.distance:.4f}")
    print()

Atoms similar to C in CO molecule:
--------------------------------------------------
  O in O2
    Structure ID: 2, Atom index: 0
    Distance: 0.1992

  O in O2
    Structure ID: 2, Atom index: 1
    Distance: 0.9729

  O in CO
    Structure ID: 1, Atom index: 1
    Distance: 0.9988

  O in H2O
    Structure ID: 3, Atom index: 2
    Distance: 1.0165

  H in H2O
    Structure ID: 3, Atom index: 1
    Distance: 1.0216

Filtering by Element#

You can also filter similarity searches by element type to find, for example, all oxygen atoms with similar environments:

# Find only oxygen atoms with similar environments to O in CO
similar_O = db.find_similar_environments(query_atoms, center_index=1, k=5, symbol='O')

print("Oxygen atoms similar to O in CO molecule:")
print("-" * 50)
for match in similar_O:
    struct = db.get_structure(match.structure_id)
    print(f"  {match.symbol} in {struct.get_chemical_formula()}")
    print(f"    Position: {match.position}")
    print(f"    Distance: {match.distance:.4f}")
    print()

Oxygen atoms similar to O in CO molecule:
--------------------------------------------------
  O in H2O
    Position: [-0.00025969  0.          0.        ]
    Distance: 0.9379

  O in CO
    Position: [1.2 0.  0. ]
    Distance: 0.9662

  O in O2
    Position: [6.22723468 5.         5.        ]
    Distance: 1.0077

  O in O2
    Position: [4.99276532 5.         5.        ]
    Distance: 1.1035

Adding Bulk Materials#

The VectorAtomDatabase handles both molecules and periodic bulk structures seamlessly:

# Add bulk materials to the same database
bulk_dirs = ['bulk/fcc-Cu', 'bulk/fcc-Pt', 'bulk/fcc-Pd']

for label in bulk_dirs:
    try:
        calc = Vasp(label=label)
        atoms = calc.load_atoms()
        energy = calc.results.get('energy', None)
        
        structure_id = db.write(atoms, energy=energy, source=label)
        print(f"Added {atoms.get_chemical_formula()} (id={structure_id})")
    except Exception as e:
        print(f"Could not load {label}: {e}")

print(f"\nDatabase now contains {db.count_structures()} structures, {db.count_atoms()} atoms")

Could not load bulk/fcc-Cu: No POSCAR/CONTCAR in /home/jovyan/dft-book/notebooks/bulk/fcc-Cu
Could not load bulk/fcc-Pt: No POSCAR/CONTCAR in /home/jovyan/dft-book/notebooks/bulk/fcc-Pt
Could not load bulk/fcc-Pd: No POSCAR/CONTCAR in /home/jovyan/dft-book/notebooks/bulk/fcc-Pd

Database now contains 3 structures, 7 atoms

# Retrieve and inspect a stored structure
struct_id = 1
atoms = db.get_structure(struct_id)
if atoms:
    print(f"Structure {struct_id}: {atoms.get_chemical_formula()}")
    print(f"  Number of atoms: {len(atoms)}")
    print(f"  Cell: {atoms.cell.lengths()}")
    
    # Get embedding for a specific atom
    emb = db.get_atom_embedding(struct_id, atom_index=0)
    if emb is not None:
        print(f"  Embedding dim: {len(emb)}")

Structure 1: CO
  Number of atoms: 2
  Cell: [6. 6. 6.]
  Embedding dim: 128

Applications of Vector Databases in Computational Materials Science#

Vector databases enable powerful workflows for materials discovery:

Structure Matching: Find known materials similar to an experimental or hypothetical structure
Property Prediction: Use k-nearest neighbors in embedding space to predict properties of new materials
Anomaly Detection: Identify unusual structures that don’t match known patterns
Dataset Curation: Find and remove duplicate or near-duplicate structures
Active Learning: Select diverse structures for DFT calculations to maximize information gain

The combination of ASE’s traditional database capabilities with vector search creates a comprehensive system for managing and exploring computational materials data.

# Clean up
import os

db.close()

if os.path.exists("vasp_atoms.db"):
    os.remove("vasp_atoms.db")
    print("Cleaned up vasp_atoms.db")

if os.path.exists("example-1.db"):
    os.remove("example-1.db")
    print("Cleaned up example-1.db")

Cleaned up vasp_atoms.db
Cleaned up example-1.db