Collaborative database framework (CDF)
We have a vision for the future of cyberinfrastructure enabled collaborative research. This site is part of our effort to make that vision reality. Below I describe what our goals in the CDF are by describing how different users might use it. We welcome collaborative efforts to define and develop this concept.
1. User experiences utilizing the Collaborative Database Framework
The CDF is a web-based portal to facilitate computational research by providing:
- Community developed, reviewed and rated content relevant to research including tutorials on using codes, performing analysis, software installation guides, etc…
- Social networking capabilities to identify collaborators and researchers with similar interests.
- A web-based and scripting framework to setup, run, monitor and analyze calculations.
- A web-based and scripting framework for searching for data (particularly computational results) in the database and
- A framework for collaborating and sharing data with other users, groups and the public using fine-grained permissions to control access.
The CDF does not exist yet, but it is technically feasible with existing software tools today. Without worrying too much about how this could be implemented, I describe here what using the CDF would be like for a diverse set of users.
1.1. New student in a computational research group experience
A new student logs into the web-based portal hosted at a National Supercomputing facility where they are presented with content related to the computing system they are working on, announcements of maintenance, training opportunities, tutorial sections relevant to the codes that are available, community specific modeling tutorials (e.g. materials simulation, reaction simulations, code development, best practices, etc…), and community developed and rated links to external content, literature articles, new publications, etc… This information is developed, reviewed and rated by the community through wiki-like interfaces or by uploaded files that are indexed and searchable. The student can eventually customize the start page to reflect their needs, and the information they want through user-configurable portlets that may contain smart updated searches on keywords, RSS feeds to other sites, lists of running or finished jobs, etc…
The student then navigates to their personal folder where all their work will be stored. They fill in some optional profile information about themselves, their research interests, etc… This information is used to identify particular content on the site they might be interested in, as well as potential collaborators, reviewers, and other researchers with similar interests.
The student creates a new folder titled “Tutorial-1” in their personal folder through simple mouse clicks using intuitive menus. Following a video tutorial in another window, the student adds a calculation object to that folder, defining the geometry of the atoms and unit cell through a form that is automatically generated by the portal. The student uses a structure factory accessible in another part of the portal to generate the positions of the atoms in a 2x1 unit cell of the (211) surface, and then uses an interactive visualization tool in the browser to place an adsorbate near the step edge. Another student imported a structure from the Protein database, while a third student directly input fractional coordinates computed by hand. With the geometry defined, the student selects a computational method (e.g. Vasp, Dacapo, Wien2K, NWChem, GPAW, Abinit, Siesta, Gaussian, etc…) that will be used for the calculation. A new form appears with the parameters that need to be chosen for the particular calculator, which the student fills in. A portlet shows the student all calculations that exist in the database with similar geometries (but not necessarily similar computational parameters) to the one they have just setup. No calculation exactly matches what the student needs, but one of them had a relaxed geometry, which the student uses as their initial guess instead of the ideal positions used before. Satisfied with the setup, the student now submits the job through the portal. A set of portlets on the screen show the student the status of their job and alerts the student when the job has completed or if errors have occurred. Another student found exactly the calculation they wanted in the database, and with the press of a button converted the calculation they just setup into a link to the other calculation.
After the job is completed, there are a set of tabs across the top of the window describing the calculation that provide different views of the results, e.g. a simple summary of the geometry (with an interactive visualization of the geometry), forces, and energy, a configurable plot of the density of states, a plot of the charge density, etc… Metadata about the calculations, e.g. which host the calculation was run on, software versions, etc… is available in one of the tabs. Each view has a data download link to use the data in other applications. The student downloads the data needed for the tutorial and analyzes it in whatever program they are comfortable with. Their results do not make sense, and when their advisor logs in to examine the results she notices that the student used inconsistent parameters between the calculations. After fixing the problem, the results look good. The student submits the calculations for public release. After the advisor approves the release the results are available to anyone with web access.
Advanced students will utilize portal interfaces to geometry optimization, phonon, vibrational, transition state search algorithms, etc… for sophisticated calculations beyond the single point calculations. These sets of calculations will be automatically linked in the CDF making them easy to assemble and analyze. They will be able to write limited python scripts through the web in the CDF for analyzing data, for example collecting a set of results for download, or calculating reaction energies. The CDF will allow them to edit objects with external editors, e.g. xemacs, after the student installs a small client program. Advanced students will annotate their calculation objects with references to literature or similar calculations, as well as research notes about why a calculation was run and user-defined tags connecting datasets to make them easier to find in the future.
The CDF does not limit the objects in the user’s personal space to calculations. They may also create files (e.g. word documents, excel spreadsheets, presentations, etc…) that can be edited in the native programs that created them. They can upload pdfs of relevant literature articles, and create wikis or blogs in their personal folders. All of these are indexed by the CDF and they are viewable and searchable by people with appropriate view permissions.
1. Behind the scenes
Simply creating the calculation setup puts an entry in the CDF that is searchable by the user, facilitating finding the results later. The user can easily create custom portals that show all unfinished calculations meeting particular criteria, e.g. tagged with “surface oxide”. They can then navigate to those calculation objects and perform various actions as needed. Using the CDF to manage the job submission and execution means the results are in the CDF without further user intervention or action. The only user action that is required is setting the permissions on the results to give the level of desired access.
Many users can log in to the CDF and all of them store data in it. Access to user data is controlled by permissions and the “state” of the objects. Objects start in a “private” state which is only viewable and editable by the owner, managers and anyone explicitly given permission to see the data. Users have the option to change the state to make it internally visible to logged in members, or to submit the data to be publically visible to anyone in the world. If the user has a “Reviewer” role (normally reserved for supervisors/advisors, although they can give their students the Reviewer role if desired) the user can release the data to the world. Users can also give access to specific users or groups of users.
The CDF uses workflows to manage states of objects, permissions on them, and actions upon them. For example, when a user submits a job, the calculation parameters should no longer be editable because then they would be inconsistent with the submitted job. Thus, on submitting the edit permission would be removed. If the user wanted to change something they would have to “retract” the submitted job (perhaps resulting in a running job being killed) to edit it. Similarly, when a job is finished the parameters should not be edited, otherwise they would be inconsistent with the stored results. Finished jobs should not be edited, the results are what they are. Finished jobs cannot be retracted, although if desired they could be deleted by a manager or moved to another folder. New calculations can be started from those results with different parameter sets.
The workflow for submission would be responsible for assembling the input files needed for a simulation, securely copying the files to a (possibly remote) file system somewhere, and submitting the job to the queue system on behalf of the user. When the job is done, it alerts the CDF and the results are uploaded for the user. Only users with permission are allowed to submit jobs to particular systems. Others will simply not see an option to submit a job. The CDF also allows one to generate and download the input files to run on your own cluster, and to upload the results when they are done.
The CDF provides wrappers around computational codes. We separate the atomic geometry aspects of calculations (which are completely independent of codes, although there are many equivalent ways to represent them and there may be code specific requirements on the representation). The CDF provides translation adapters to present the atomic geometry in different forms as needed. Packages can be installed in the CDF for each calculation code. These packages provide interfaces to the input parameters of the code, and code for generating the input files. Finally, packages will exist for submitting jobs to computational resources. These will be specific to particular systems and the packages will be responsible for secure communication with the resource and queue system. Users will likely have to provide some configuration data that is used for authentication, etc…
1.2. Educational use of the CDF
The CDF enables anonymous users to access publicly visible data. An instructor teaching a course on molecular simulations can create assignments where students use the search interface to find data sets for further analysis. For example, in an exercise on equations of state the students would search for calculations similar to fcc Pd in a 1x1x1 unit cell at different volumes. They would probably get a large number of results which could be downloaded into Excel and further refined through the use of Pivot tables, or refined through the web by restricting the results to particular codes, particular exchange correlation functionals, etc…. The students would then use the refined data to fit an equation of state and compare the results to experimental data. They may examine the convergence of a lattice parameter or bulk modulus as a function of different computational parameters, e.g. k-point density. Finally, the CDF could be used to help students discover how the lattice constant depends on other parameters such as exchange-correlation functionals, on different pseudo/PAW potentials, all-electron calculations, etc…
Other examples of exercises include:
- Download the datasets from several papers on surface oxides to first reproduce the analysis from the paper, and then to examine discrepancies between the papers or find similarities in datasets that could not have been identified before when access to the complete dataset was unavailable.
- Find all bulk structures of in the database of Ag-Pd alloys and construct a phase diagram from the results, and to parameterize a cluster expansion using the ATAT or UNCLE code for finite temperature monte carlo simulations.
- Compute molecular reaction energies from database results and use the results to parameterize a simple group additivity model.
- Find vibrational calculation results for different molecules and use them to construct rules of thumb for interpreting experimental spectra.
- Download molecular dynamics trajectories for local analysis.
The CDF also allows anonymous users to create their own accounts. These accounts have limited permissions, and quotas, but they allow the users to save queries and links to calculations, as well as use their personal folders for collaborations.
1.3. Power user experience
Through the web computation is not likely to be useful for power users. These users typically have sophisticated script/command-line interfaces they have developed to facilitate their work. Furthermore, power users have the skills needed to run their own CDF which they maintain. The CDF provides a framework for subscriptions which allows the CDF servers to communicate results.
The power user creates the input files needed for their calculations using their own tools on their local file system. They use a python module downloaded from the CDF to query the database (and any CDFs they are subscribed to) for similar calculations, and adjust their input files as needed, or use the results they find. For the calculations that need to be run, they upload the input files into their personal directory on their local CDF server through webDAV [1] or sftp, organizing the files as they see fit. The CDF automatically creates the calculation objects from the input files. The power user submits the calculations all at once either through the CDF or via a script they have created. The CDF provides convenient monitoring portlets, but the power user prefers to use their own scripts which query the database for their jobs by status (submitted, running, recently finished, completed, etc…). When the jobs are done, the results are also available through mapped network drives via webDAV. This allows the power user to use their own analysis scripts on the results as if they existed on their local filesystem or to download the results to their local filesystem for backup and I/O-intensive analysis. The tabs shown in the CDF for completed calculations are simply views triggered by visiting specific URLs. The power user can use these URLs in scripts to download specific data if desired.
The power user collaborates with two other groups that are geographically and cybergraphically (i.e. users outside the power user’s network) distant. The CDF can use OpenID [2] (among other schemes) for authentication, so the external users are easily granted access to the power user’s local CDF installation without requiring a new account/password. The power user simply grants appropriate permissions for their collaborator’s login ids to give them access only to the data they are collaboratively developing. One of the collaborators is an experimentalist who uses his personal space to upload powerpoint presentations containing STM images and other experimental results that the computational collaborators are trying to simulation. After their brilliant work is published, they “publish” their experimental and calculation results in the CDF which are tagged with a reference to the peer-reviewed publication so it can be appropriately cited by others who may use it.
After seeing how useful the CDF is for sharing data, the power user decides to upload all of their legacy data. The CDF provides scripts that can scrape a local file system, identify computational results, and upload them to the CDF with a user-defined permission policy. Alternatively, the power user may choose to manually upload the results to maintain a folder hierarchy that is useful for organization of the results for the user. This is easily done with traditional tools like rsync/scp or through file explorers using mapped network drives.
1.4. Data-mining the CDF
An advanced user could mine the CDF to look for trends in reactivity, correlations in structures, or to test ideas of reactivity. For example, carbon tolerant alloy catalysts might be found by searching the database for results used to identify alloys with weak carbon binding energies. If the electronic structures are available for those calculations, it might be possible to develop models connecting electronic structure features of the alloys to the chemical properties. If not, the geometries would provide good starting points for electronic structure calculations.
1.5. Developers
The CDF is built upon a web application framework called Zope [3] and Plone [4]. Many of the features discussed so far (security, webDAV, automatic form generation) are built in. The objects are indexed in the ZODB (Zope Object Database) which makes them searchable, although multiple storage and database options exist including relational databases that utilize SQL. The searching only displays results which the user has permission to view. Which data is indexed is configurable by the developers. One also has the option to walk the object database to make sure nothing is missed.
Zope supports customized content objects, workflows, skins (appearances), as well as adapters that modify or extend existing objects. A power user could write a Product that adds functionality to existing packages without modifying the original source code.
Developers of the CDF would include a large community of open-source software developers (Zope/Plone), human-computer interface specialists, and computational scientists. A dedicated group would focus on the user interface, how it was used, and how it could facilitate collaboration. Research groups will be able to develop content for the CDF through the web.
The CDF itself does not need to run any intensive calculations, so it can be installed on standard Windows, Mac or Linux based servers, including personal desktop machines for personal research data management or laptops for portability. The CDF is modular, so only Products that are needed by the research group need to be installed. Longer term efforts will be needed to enable transparent and secure communications between CDF servers, and perhaps to provide mirroring capabilities for data security and redundancy.
2. Broader impacts
Although the description above is focused on computations, the cyberinfrastructure described could be used by any nearly any field. The data sharing mechanisms are independent of the data themselves, so the CDF could just as easily be used simply for sharing data that individual groups use.
3. Size of effort
The effort to create the CDF would be substantial and on par with the iPlant collaborative project [5] and the nanohub.org [6]. A significant difference in philosophy of the CFD is that a single software package would be available for anyone to download, install and use on their own server if desired. The foundation of the CDF will allow other 3rd party Zope/Plone packages to be installed which could provide additional functionalities such as calendars, instant messaging, etc…
I envision a Center-scale effort that includes workshops for the various stakeholders including computational code developers (to discuss ways the CFD might be incorporated into future versions of codes as well as information that is useful as metadata for data mining), computational researchers (to identify features that are needed/desired) and students (to teach them computational principles and methods).
4. References
1. WebDAV. [cited; Available from: http://www.webdav.org/].
2. OpenID. [cited; Available from: http://openid.net/].
3. Zope. [cited; Available from: www.zope.org].
4. Plone. [cited; Available from: http://plone.org].
5. iPlant Collaborative. [cited; Available from: http://www.iplantcollaborative.org/].
6. nanohub.org. [cited; Available from: http://nanohub.org/].

