Searching for directories by tags

| categories: python | tags:

Today I explore searching for directories by using tags. We will create a TAGS table by first walking through the directories and tagging each directory with all of the relative path components. Then, we will use set algebra to identify specific directories.

First, let us make a directory setup to use. We will nest some calculations that might be typical. These will be nested directories that might contain planewave (pw) and k-points (kpts) convergence and equation of state (eos) directories, with the sets of calculations inside those. The idea then is that the directory components will form at least some of the tags.

import os

os.mkdir('tags')

for cmpd in ['A', 'B', 'C', 'D']:
    for c in ['pw', 'kpts', 'eos']:
        os.makedirs(os.path.join('tags', cmpd, c))

for cmpd in ['A', 'B', 'C', 'D']:
    for run in [1, 2, 3, 4, 5]:
        os.makedirs(os.path.join('tags', cmpd, 'eos', str(run)))

for cmpd in ['A', 'B', 'C', 'D']:
    for run in [200, 250, 300, 350, 400]:
        os.makedirs(os.path.join('tags', cmpd, 'pw', str(run)))

for cmpd in ['A', 'B', 'C', 'D']:
    for run in ['2x2x2', '4x4x4', '8x8x8']:
        os.makedirs(os.path.join('tags', cmpd, 'kpts', str(run)))

Let us just double check what this directory tree looks like for one compound A.

ls tags/A/*
tags/A/eos:
1
2
3
4
5

tags/A/kpts:
2x2x2
4x4x4
8x8x8

tags/A/pw:
200
250
300
350
400

Now, we will walk through the directories, and split the path components to create a TAGS structure. I will store these as sets within a dictionary. We will save the structure in a pickle file to reuse it later.

import os
import pickle

TAGS = {}
TAGS['all'] = set()

for root, dirs, files in os.walk('tags'):
    base, tail = os.path.split(root)
    TAGS['all'].add(root)
    while base:
        if tail in TAGS:
            TAGS[tail].add(root)
        else:
            TAGS[tail] = set([root])
        base, tail = os.path.split(base)

with open('TAGS.pkl', 'w') as f:
    f.write(pickle.dumps(TAGS))

print TAGS.keys()
['A', '1', 'all', 'B', '250', 'pw', '2x2x2', '300', 'C', '400', 'kpts', '8x8x8', 'eos', '3', '2', '5', '4', '350', '200', '4x4x4', 'D']

Now we have a lot of keys that tag each directory. Each tag is a set of directories, and we can do set algebra to get specific results. For example, we can find a result by appropriate differences, intersections and unions of the sets. An advantage of this approach is that order of the tags is not relevant (unlike the path, where each component must be in the right order).

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

print TAGS['A'].intersection(TAGS['pw']).intersection(TAGS['300'])
print TAGS['300'].intersection(TAGS['pw']).intersection(TAGS['A'])
set(['tags\\A\\pw\\300'])
set(['tags\\A\\pw\\300'])

The syntax here is a tad heavy because of the chained dot notation operations. You can also use the logical operators like this:

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

# 300 and pw and A
print TAGS['300'] & TAGS['pw'] & TAGS['A']
set(['tags\\A\\pw\\300'])

We can get a set of calculations, for example an equation of state like this:

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

# 300 and pw and A
print TAGS['A'] & TAGS['eos']
set(['tags\\A\\eos', 'tags\\A\\eos\\5', 'tags\\A\\eos\\4', 'tags\\A\\eos\\1', 'tags\\A\\eos\\3', 'tags\\A\\eos\\2'])

Now, let us construct some more complex queries. With sets we use intersections for and and we construct unions of queries that are like an or. We examine different notations to see which one is better.

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

# find calculation 3 in eos for B and C
print (TAGS['3']
       .intersection(TAGS['eos'])
       .intersection(TAGS['B'])
       .union
       (TAGS['3']
        .intersection(TAGS['eos'])
        .intersection(TAGS['C'])))

# this notation makes more sense to me. ^ = or
print TAGS['3'] & TAGS['eos'] & (TAGS['B'] ^ TAGS['C'])
set(['tags\\C\\eos\\3', 'tags\\B\\eos\\3'])
set(['tags\\C\\eos\\3', 'tags\\B\\eos\\3'])

You can see the two approaches give the same results. The logical operator syntax is more concise and (I think) more readable. Let us consider a query with "not". We can use a difference operator for that. We subtract all the paths with tag "B" from the set containing "4x4x4", which will give us paths tagged with "4x4x4" but not "B".

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

# find calculation 4x4x4 but not B
print (TAGS['4x4x4'].difference(TAGS['B']))

# this makes more sense 4x4x4 but not B, 
# i.e. subtract paths tagged B from those tagged 4x4x4
print TAGS['4x4x4'] -  TAGS['B']
set(['tags\\D\\kpts\\4x4x4', 'tags\\A\\kpts\\4x4x4', 'tags\\C\\kpts\\4x4x4'])
set(['tags\\D\\kpts\\4x4x4', 'tags\\A\\kpts\\4x4x4', 'tags\\C\\kpts\\4x4x4'])

Note it is not so obvious how to get results not tagged with "A". We need to subtract the tagged calculations from some set.

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

print TAGS['all'] - TAGS['A']  # not A
set(['tags\\D\\kpts', 'tags\\D\\pw\\200', 'tags\\D\\pw\\350', 'tags\\D\\pw\\250', 'tags\\D\\kpts\\8x8x8', 'tags\\C\\eos', 'tags\\D\\eos\\2', 'tags\\D\\eos\\3', 'tags\\D\\eos\\4', 'tags\\D\\eos\\5', 'tags\\B\\kpts\\2x2x2', 'tags\\C\\kpts\\4x4x4', 'tags\\C\\eos\\3', 'tags\\C\\eos\\2', 'tags\\C\\eos\\1', 'tags\\C\\kpts\\8x8x8', 'tags\\C\\eos\\5', 'tags\\C\\eos\\4', 'tags\\B\\kpts', 'tags\\C\\pw\\200', 'tags\\B\\eos\\2', 'tags\\B\\pw\\350', 'tags\\B\\eos\\1', 'tags\\B\\kpts\\8x8x8', 'tags\\C\\pw\\300', 'tags\\B\\eos\\4', 'tags\\B\\eos\\5', 'tags\\C\\kpts', 'tags\\D\\pw\\300', 'tags\\B\\kpts\\4x4x4', 'tags\\C\\kpts\\2x2x2', 'tags\\D\\kpts\\4x4x4', 'tags\\B\\pw\\250', 'tags', 'tags\\D\\pw\\400', 'tags\\D\\eos', 'tags\\C\\pw\\400', 'tags\\D\\kpts\\2x2x2', 'tags\\D\\pw', 'tags\\C\\pw\\250', 'tags\\C\\pw\\350', 'tags\\C\\pw', 'tags\\D\\eos\\1', 'tags\\B\\pw\\400', 'tags\\B\\pw', 'tags\\B\\eos', 'tags\\B\\pw\\300', 'tags\\B\\eos\\3', 'tags\\C', 'tags\\B', 'tags\\D', 'tags\\B\\pw\\200'])

We can also look at "or". This is done with the union function, which is the set of elements in either set. The logical operator is |.

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

print TAGS['1'].union(TAGS['2'])
print TAGS['1'] | TAGS['2']        # 1 or 2
set(['tags\\B\\eos\\2', 'tags\\B\\eos\\1', 'tags\\A\\eos\\1', 'tags\\A\\eos\\2', 'tags\\D\\eos\\1', 'tags\\D\\eos\\2', 'tags\\C\\eos\\2', 'tags\\C\\eos\\1'])
set(['tags\\B\\eos\\2', 'tags\\B\\eos\\1', 'tags\\A\\eos\\1', 'tags\\A\\eos\\2', 'tags\\D\\eos\\1', 'tags\\D\\eos\\2', 'tags\\C\\eos\\2', 'tags\\C\\eos\\1'])

There is an xor operator too. xor is not the same as or, it means A xor B means "A or B but not both". The symmetric_difference function gives this behavior. The logical operator is ^.

import pickle

with open('TAGS.pkl', 'r') as f:
    TAGS = pickle.loads(f.read())

# I think this is like the xor, A or 2 but not both
print TAGS['A'].symmetric_difference(TAGS['2']), '\n'
print TAGS['A'] ^ TAGS['2']
set(['tags\\A\\pw\\300', 'tags\\A\\pw\\250', 'tags\\A\\eos', 'tags\\B\\eos\\2', 'tags\\A\\eos\\4', 'tags\\A\\kpts\\2x2x2', 'tags\\A\\pw\\400', 'tags\\A\\kpts\\4x4x4', 'tags\\A\\eos\\3', 'tags\\A\\kpts\\8x8x8', 'tags\\A\\pw\\350', 'tags\\A\\eos\\5', 'tags\\D\\eos\\2', 'tags\\A\\kpts', 'tags\\A', 'tags\\C\\eos\\2', 'tags\\A\\eos\\1', 'tags\\A\\pw', 'tags\\A\\pw\\200']) 

set(['tags\\A\\pw\\300', 'tags\\A\\pw\\250', 'tags\\A\\eos', 'tags\\B\\eos\\2', 'tags\\A\\eos\\4', 'tags\\A\\kpts\\2x2x2', 'tags\\A\\pw\\400', 'tags\\A\\kpts\\4x4x4', 'tags\\A\\eos\\3', 'tags\\A\\kpts\\8x8x8', 'tags\\A\\pw\\350', 'tags\\A\\eos\\5', 'tags\\D\\eos\\2', 'tags\\A\\kpts', 'tags\\A', 'tags\\C\\eos\\2', 'tags\\A\\eos\\1', 'tags\\A\\pw', 'tags\\A\\pw\\200'])

To summarize, this is one approach to using tags with the technical infrastructure of the set. The advantage is flexibility, that you do not need to know the full path to a result, provided you know the set of tags that refers to it. You can use the tags in any order.

There are many alternative approaches to implementing this idea. One could create a sqlite table and do SQL queries. You could also store lists in the dictionary, and use python code to find the matches. The syntax here is varied. Using the functional approach, the syntax gets heavy with all the dot notation. With the logical operators, the syntax is a little lighter.

Copyright (C) 2014 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.5h

Discuss on Twitter