Databases in molecular simulations#
The continued increase in computing power has enabled us to create massive amounts of computational data. Some of this data is accessible in papers, or at websites, e.g. https://cmr.fysik.dtu.dk.
Our Vasp module works natively with the ase-database. It is easy to write an entry to a database.
! rm -f example-1.db
from vasp import Vasp
from ase.db import connect
calc = Vasp(label='molecules/simple-co')
atoms = calc.load_atoms()
print(calc.results)
con = connect('example-1.db')
con.write(atoms)
{}
1
! ase db example-1.db
id|age|formula|calculator|natoms|pbc| volume|charge| mass
1| 3s|CO |vasp | 2|TTT|216.000| 0.000|28.010
Rows: 1
Vector Databases for Materials Discovery#
Traditional databases store and query data by exact matches or numerical ranges. Vector databases enable similarity search - finding materials with similar structures or properties even when they don’t match exactly.
The vasp package includes a built-in VectorAtomDatabase that stores per-atom embeddings for similarity search. This is particularly powerful for:
Finding atoms with similar local chemical environments
Discovering materials with similar bonding patterns
Building recommendation systems for materials discovery
Using VectorAtomDatabase#
The vasp package provides embedders that convert atomic structures into fixed-length vectors:
MockEmbedder - deterministic embeddings for testing
FairChemEmbedder - production embeddings using FairChem pretrained models
from vasp import Vasp
from vasp.database import VectorAtomDatabase, MockEmbedder
# Create an embedder (MockEmbedder for demonstration, FairChemEmbedder for production)
embedder = MockEmbedder(dim=128)
# Create a vector database with the embedder
db = VectorAtomDatabase("vasp_atoms.db", embedder=embedder)
db.connect()
# Load and store structures from our VASP calculations
for label in ['molecules/simple-co', 'molecules/O2-sp-triplet', 'molecules/h2o_relax']:
try:
calc = Vasp(label=label)
atoms = calc.load_atoms()
energy = calc.results.get('energy', None)
forces = calc.results.get('forces', None)
structure_id = db.write(atoms, energy=energy, forces=forces, source=label)
print(f"Added {atoms.get_chemical_formula()} (id={structure_id}) from {label}")
except Exception as e:
print(f"Could not load {label}: {e}")
print(f"\nDatabase contains {db.count_structures()} structures, {db.count_atoms()} atoms")
Added CO (id=1) from molecules/simple-co
Added O2 (id=2) from molecules/O2-sp-triplet
Added H2O (id=3) from molecules/h2o_relax
Database contains 3 structures, 7 atoms
Finding Similar Atomic Environments#
The key feature of VectorAtomDatabase is finding atoms with similar local chemical environments. This enables discovering similar bonding patterns across different structures:
from ase.build import molecule
# Create a query structure
query_atoms = molecule('CO')
query_atoms.cell = [6, 6, 6]
query_atoms.pbc = True
# Find atoms similar to the carbon atom (index 0) in our query
similar = db.find_similar_environments(query_atoms, center_index=0, k=5)
print("Atoms similar to C in CO molecule:")
print("-" * 50)
for match in similar:
struct = db.get_structure(match.structure_id)
print(f" {match.symbol} in {struct.get_chemical_formula()}")
print(f" Structure ID: {match.structure_id}, Atom index: {match.atom_index}")
print(f" Distance: {match.distance:.4f}")
print()
Atoms similar to C in CO molecule:
--------------------------------------------------
O in O2
Structure ID: 2, Atom index: 0
Distance: 0.1992
O in O2
Structure ID: 2, Atom index: 1
Distance: 0.9729
O in CO
Structure ID: 1, Atom index: 1
Distance: 0.9988
O in H2O
Structure ID: 3, Atom index: 2
Distance: 1.0165
H in H2O
Structure ID: 3, Atom index: 1
Distance: 1.0216
Filtering by Element#
You can also filter similarity searches by element type to find, for example, all oxygen atoms with similar environments:
# Find only oxygen atoms with similar environments to O in CO
similar_O = db.find_similar_environments(query_atoms, center_index=1, k=5, symbol='O')
print("Oxygen atoms similar to O in CO molecule:")
print("-" * 50)
for match in similar_O:
struct = db.get_structure(match.structure_id)
print(f" {match.symbol} in {struct.get_chemical_formula()}")
print(f" Position: {match.position}")
print(f" Distance: {match.distance:.4f}")
print()
Oxygen atoms similar to O in CO molecule:
--------------------------------------------------
O in H2O
Position: [-0.00025969 0. 0. ]
Distance: 0.9379
O in CO
Position: [1.2 0. 0. ]
Distance: 0.9662
O in O2
Position: [6.22723468 5. 5. ]
Distance: 1.0077
O in O2
Position: [4.99276532 5. 5. ]
Distance: 1.1035
Adding Bulk Materials#
The VectorAtomDatabase handles both molecules and periodic bulk structures seamlessly:
# Add bulk materials to the same database
bulk_dirs = ['bulk/fcc-Cu', 'bulk/fcc-Pt', 'bulk/fcc-Pd']
for label in bulk_dirs:
try:
calc = Vasp(label=label)
atoms = calc.load_atoms()
energy = calc.results.get('energy', None)
structure_id = db.write(atoms, energy=energy, source=label)
print(f"Added {atoms.get_chemical_formula()} (id={structure_id})")
except Exception as e:
print(f"Could not load {label}: {e}")
print(f"\nDatabase now contains {db.count_structures()} structures, {db.count_atoms()} atoms")
Could not load bulk/fcc-Cu: No POSCAR/CONTCAR in /home/jovyan/dft-book/notebooks/bulk/fcc-Cu
Could not load bulk/fcc-Pt: No POSCAR/CONTCAR in /home/jovyan/dft-book/notebooks/bulk/fcc-Pt
Could not load bulk/fcc-Pd: No POSCAR/CONTCAR in /home/jovyan/dft-book/notebooks/bulk/fcc-Pd
Database now contains 3 structures, 7 atoms
# Retrieve and inspect a stored structure
struct_id = 1
atoms = db.get_structure(struct_id)
if atoms:
print(f"Structure {struct_id}: {atoms.get_chemical_formula()}")
print(f" Number of atoms: {len(atoms)}")
print(f" Cell: {atoms.cell.lengths()}")
# Get embedding for a specific atom
emb = db.get_atom_embedding(struct_id, atom_index=0)
if emb is not None:
print(f" Embedding dim: {len(emb)}")
Structure 1: CO
Number of atoms: 2
Cell: [6. 6. 6.]
Embedding dim: 128
Applications of Vector Databases in Computational Materials Science#
Vector databases enable powerful workflows for materials discovery:
Structure Matching: Find known materials similar to an experimental or hypothetical structure
Property Prediction: Use k-nearest neighbors in embedding space to predict properties of new materials
Anomaly Detection: Identify unusual structures that don’t match known patterns
Dataset Curation: Find and remove duplicate or near-duplicate structures
Active Learning: Select diverse structures for DFT calculations to maximize information gain
The combination of ASE’s traditional database capabilities with vector search creates a comprehensive system for managing and exploring computational materials data.
# Clean up
import os
db.close()
if os.path.exists("vasp_atoms.db"):
os.remove("vasp_atoms.db")
print("Cleaned up vasp_atoms.db")
if os.path.exists("example-1.db"):
os.remove("example-1.db")
print("Cleaned up example-1.db")
Cleaned up vasp_atoms.db
Cleaned up example-1.db