Using swish-e to index org files as html

| categories: search, emacs | tags:

When we wrote about using swish-e before , we just indexed the org files as text. This worked pretty well, but we lost some resolution, e.g. being able to search for text in a headline. that is more possible if we index html or xml. So, here we try indexing the org files as html. It will be slower to index because we will filter each org file through a command that exports it to html, but hopefully it will be worth it for the enhanced search capability.

We will need a filter shell command that takes an org-file and spits out html. This command is shown as an emacs-lisp script here. This is a pretty bare bones export, and would lack the export of all my custom links from org-ref. I tried this, but org-ref outputs a lot of stuff to stdout when it loads, and unless I can figure out how to suppress that I don't want it here for now.

:;exec emacs -batch -l $0 -f main "$@"
(require 'org)
;(add-to-list 'load-path "/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa")
;(add-to-list 'load-path "/Users/jkitchin/Dropbox/kitchingroup/jmax/org-ref")
;(setq package-user-dir "/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa")
;(package-initialize)
;(require 'org-ref)
(defun main ()
  (find-file (car command-line-args-left))
  (org-html-export-as-html nil nil nil t)
  (switch-to-buffer "*Org HTML Export*")
  (print (buffer-string)))

;; Local Variables:
;; mode: emacs-lisp
;; End:

We try it out here:

./org2html.el index-org-as-html.org
"<div id=\"table-of-contents\">
<h2>Table of Contents</h2>
<div id=\"text-table-of-contents\">
<ul>
<li><a href=\"#sec-1\">1. Using swish-e to index org files as html</a></li>
</ul>
</div>
</div>
<div id=\"outline-container-sec-1\" class=\"outline-2\">
<h2 id=\"sec-1\"><span class=\"section-number-2\">1</span> Using swish-e to index org files as html</h2>
<div class=\"outline-text-2\" id=\"text-1\">
<p>
When we wrote about using swish-e <a href=\"http://kitchingroup.cheme.cmu.edu/blog/2015/06/25/Integrating-swish-e-and-Emacs/\">before</a>, we just indexed the org files as text. This worked pretty well, but we lost some resolution, e.g. being able to search for text in a headline. that is more possible if we index html or xml. So, here we try indexing the org files as html. It will be slower to index because we will filter each org file through a command that exports it to html, but hopefully it will be worth it for the enhanced search capability.
</p>

<p>
We will need a filter shell command that takes an org-file and spits out html. This command is shown as an emacs-lisp script here. This is a pretty bare bones export, and would lack the export of all my custom links
</p>

<p>
cite:dauenhauer-2006-renew
</p>

<div class=\"org-src-container\">

<pre class=\"src src-emacs-lisp\">:;exec emacs -batch -l $0 -f main \"$@\"
(require 'org)
;(add-to-list 'load-path \"/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa\")
;(add-to-list 'load-path \"/Users/jkitchin/Dropbox/kitchingroup/jmax/org-ref\")
;(setq package-user-dir \"/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa\")
;(package-initialize)
;(require 'org-ref)
(defun main ()
  (find-file (car command-line-args-left))
  (org-html-export-as-html nil nil nil t)
  (switch-to-buffer \"*Org HTML Export*\")
  (print (buffer-string)))

;; Local Variables:
;; mode: emacs-lisp
;; End:
</pre>
</div>


<div class=\"org-src-container\">

<pre class=\"src src-sh\">./org2html.el index-org-as-html.org
</pre>
</div>

<div class=\"org-src-container\">

<pre class=\"src src-text\"># Example configuration file

# Tell Swish-e what to directories to index
IndexDir /Users/jkitchin/blogofile-jkitchin.github.com

# where to save the index
IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/index.swish-e

# What to index
IndexOnly .org

# Tell Swish-e that .txt files are to use the text parser.
IndexContents TXT* .org

FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/org2html.el

# Ask libxml2 to report any parsing errors and warnings or
# any UTF-8 to 8859-1 conversion errors
ParserWarnLevel 9
</pre>
</div>
</div>
</div>
"

I think that looks good. Now, let's configure a swish indexer.

# Example configuration file

# Tell Swish-e what to directories to index
IndexDir /Users/jkitchin/blogofile-jkitchin.github.com

# where to save the index
IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/index.swish-e

# What to index
IndexOnly .org

# Tell Swish-e that .txt files are to use the HTML parser.
IndexContents HTML* .org

FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/org2html.el

# Ask libxml2 to report any parsing errors and warnings or
# any UTF-8 to 8859-1 conversion errors
ParserWarnLevel 9

MetaNames class swishtitle
HTMLLinksMetaName links

PropertyNames author subjects

StoreDescription HTML <body>

And now, run the index command. I did this at the command line. A lot of output! mostly not being able to fontify source blocks because htmlize was not on the path, and a bunch of attribute parsing errors, and a few utf-8 errors.

swish-e -c swish-org-html.conf

And a test search for files with "selector" in a headline.

swish-e -f index.swish-e -x '%r\t%p\n' -w selector -t h
# SWISH format: 2.4.7
# Search words: selector
# Removed stopwords:
# Number of hits: 4
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000	/Users/jkitchin/blogofile-jkitchin.github.com/org/2015/03/14/A-helm-mu4e-contact-selector.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/03/14/A-helm-mu4e-contact-selector.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_deploy/org/2015/03/14/A-helm-mu4e-contact-selector.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog-2014.org
.

A phrase in a headline.

swish-e -f index.swish-e -x '%r\t%p\n' -w "information for all documents" -t h
# SWISH format: 2.4.7
# Search words: information for all documents
# Removed stopwords:
# Number of hits: 5
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog.org
921	/Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog-2014.org
794	/Users/jkitchin/blogofile-jkitchin.github.com/org/2015/04/03/Getting-data-from-the-Scopus-API.org
794	/Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/04/03/Getting-data-from-the-Scopus-API.org
794	/Users/jkitchin/blogofile-jkitchin.github.com/_deploy/org/2015/04/03/Getting-data-from-the-Scopus-API.org
.

Sweet. How about all documents containing this citation:

swish-e -f index.swish-e -x '%r\t%p\n' -w cite:kitchin-2004-modif-pt
# SWISH format: 2.4.7
# Search words: cite:kitchin-2004-modif-pt
# Removed stopwords:
# Number of hits: 3
# Search time: 0.000 seconds
# Run time: 0.008 seconds
1000	/Users/jkitchin/blogofile-jkitchin.github.com/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_deploy/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org
.

Super nice.

Copyright (C) 2015 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter