Using swish-e to index org files as html
Posted July 03, 2015 at 10:13 AM | categories: search, emacs | tags:
When we wrote about using swish-e before , we just indexed the org files as text. This worked pretty well, but we lost some resolution, e.g. being able to search for text in a headline. that is more possible if we index html or xml. So, here we try indexing the org files as html. It will be slower to index because we will filter each org file through a command that exports it to html, but hopefully it will be worth it for the enhanced search capability.
We will need a filter shell command that takes an org-file and spits out html. This command is shown as an emacs-lisp script here. This is a pretty bare bones export, and would lack the export of all my custom links from org-ref. I tried this, but org-ref outputs a lot of stuff to stdout when it loads, and unless I can figure out how to suppress that I don't want it here for now.
:;exec emacs -batch -l $0 -f main "$@" (require 'org) ;(add-to-list 'load-path "/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa") ;(add-to-list 'load-path "/Users/jkitchin/Dropbox/kitchingroup/jmax/org-ref") ;(setq package-user-dir "/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa") ;(package-initialize) ;(require 'org-ref) (defun main () (find-file (car command-line-args-left)) (org-html-export-as-html nil nil nil t) (switch-to-buffer "*Org HTML Export*") (print (buffer-string))) ;; Local Variables: ;; mode: emacs-lisp ;; End:
We try it out here:
./org2html.el index-org-as-html.org
"<div id=\"table-of-contents\"> <h2>Table of Contents</h2> <div id=\"text-table-of-contents\"> <ul> <li><a href=\"#sec-1\">1. Using swish-e to index org files as html</a></li> </ul> </div> </div> <div id=\"outline-container-sec-1\" class=\"outline-2\"> <h2 id=\"sec-1\"><span class=\"section-number-2\">1</span> Using swish-e to index org files as html</h2> <div class=\"outline-text-2\" id=\"text-1\"> <p> When we wrote about using swish-e <a href=\"http://kitchingroup.cheme.cmu.edu/blog/2015/06/25/Integrating-swish-e-and-Emacs/\">before</a>, we just indexed the org files as text. This worked pretty well, but we lost some resolution, e.g. being able to search for text in a headline. that is more possible if we index html or xml. So, here we try indexing the org files as html. It will be slower to index because we will filter each org file through a command that exports it to html, but hopefully it will be worth it for the enhanced search capability. </p> <p> We will need a filter shell command that takes an org-file and spits out html. This command is shown as an emacs-lisp script here. This is a pretty bare bones export, and would lack the export of all my custom links </p> <p> cite:dauenhauer-2006-renew </p> <div class=\"org-src-container\"> <pre class=\"src src-emacs-lisp\">:;exec emacs -batch -l $0 -f main \"$@\" (require 'org) ;(add-to-list 'load-path \"/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa\") ;(add-to-list 'load-path \"/Users/jkitchin/Dropbox/kitchingroup/jmax/org-ref\") ;(setq package-user-dir \"/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa\") ;(package-initialize) ;(require 'org-ref) (defun main () (find-file (car command-line-args-left)) (org-html-export-as-html nil nil nil t) (switch-to-buffer \"*Org HTML Export*\") (print (buffer-string))) ;; Local Variables: ;; mode: emacs-lisp ;; End: </pre> </div> <div class=\"org-src-container\"> <pre class=\"src src-sh\">./org2html.el index-org-as-html.org </pre> </div> <div class=\"org-src-container\"> <pre class=\"src src-text\"># Example configuration file # Tell Swish-e what to directories to index IndexDir /Users/jkitchin/blogofile-jkitchin.github.com # where to save the index IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/index.swish-e # What to index IndexOnly .org # Tell Swish-e that .txt files are to use the text parser. IndexContents TXT* .org FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/org2html.el # Ask libxml2 to report any parsing errors and warnings or # any UTF-8 to 8859-1 conversion errors ParserWarnLevel 9 </pre> </div> </div> </div> "
I think that looks good. Now, let's configure a swish indexer.
# Example configuration file # Tell Swish-e what to directories to index IndexDir /Users/jkitchin/blogofile-jkitchin.github.com # where to save the index IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/index.swish-e # What to index IndexOnly .org # Tell Swish-e that .txt files are to use the HTML parser. IndexContents HTML* .org FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/org2html.el # Ask libxml2 to report any parsing errors and warnings or # any UTF-8 to 8859-1 conversion errors ParserWarnLevel 9 MetaNames class swishtitle HTMLLinksMetaName links PropertyNames author subjects StoreDescription HTML <body>
And now, run the index command. I did this at the command line. A lot of output! mostly not being able to fontify source blocks because htmlize was not on the path, and a bunch of attribute parsing errors, and a few utf-8 errors.
swish-e -c swish-org-html.conf
And a test search for files with "selector" in a headline.
swish-e -f index.swish-e -x '%r\t%p\n' -w selector -t h
# SWISH format: 2.4.7 # Search words: selector # Removed stopwords: # Number of hits: 4 # Search time: 0.000 seconds # Run time: 0.007 seconds 1000 /Users/jkitchin/blogofile-jkitchin.github.com/org/2015/03/14/A-helm-mu4e-contact-selector.org 1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/03/14/A-helm-mu4e-contact-selector.org 1000 /Users/jkitchin/blogofile-jkitchin.github.com/_deploy/org/2015/03/14/A-helm-mu4e-contact-selector.org 1000 /Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog-2014.org .
A phrase in a headline.
swish-e -f index.swish-e -x '%r\t%p\n' -w "information for all documents" -t h
# SWISH format: 2.4.7 # Search words: information for all documents # Removed stopwords: # Number of hits: 5 # Search time: 0.000 seconds # Run time: 0.007 seconds 1000 /Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog.org 921 /Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog-2014.org 794 /Users/jkitchin/blogofile-jkitchin.github.com/org/2015/04/03/Getting-data-from-the-Scopus-API.org 794 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/04/03/Getting-data-from-the-Scopus-API.org 794 /Users/jkitchin/blogofile-jkitchin.github.com/_deploy/org/2015/04/03/Getting-data-from-the-Scopus-API.org .
Sweet. How about all documents containing this citation:
swish-e -f index.swish-e -x '%r\t%p\n' -w cite:kitchin-2004-modif-pt
# SWISH format: 2.4.7 # Search words: cite:kitchin-2004-modif-pt # Removed stopwords: # Number of hits: 3 # Search time: 0.000 seconds # Run time: 0.008 seconds 1000 /Users/jkitchin/blogofile-jkitchin.github.com/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org 1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org 1000 /Users/jkitchin/blogofile-jkitchin.github.com/_deploy/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org .
Super nice.
Copyright (C) 2015 by John Kitchin. See the License for information about copying.
Org-mode version = 8.2.10