Indexing text in screenshots so they are searchable

| categories: search, image | tags:

I had an interesting conversation with someone yesterday about using screenshots to document a process. This has an appeal of simplicity, since it is agnostic to the process, but I wondered about the long-term utility. If the documentation is valuable, you would like to search it in the future, and we so far have limited capability to search images. But, if there is important text in the image, we might be able to recognize the text, and use it to index the images. Then they would become searchable at least for the text in the image. Don't get me wrong, it would be better if we could store that text directly, but suppose that just isn't possible, maybe because it is all stored in some kind of gui, or it is locked up in proprietary software. Here we explore the potential utility of this. We only explore getting text out of images here, not the actual indexing part. See this post for how we could integrate this into an actual index and search tool.

First we need an OCR engine. Tesseract (tesseract-ocr/tesseract) is supposed to be pretty good, and it easily installs via brew:

brew install tesseract
(shell-command-to-string "tesseract -v")
tesseract 3.02.02
 leptonica-1.72
  libjpeg 8d : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.5

We will test it on some screenshots taken with the default settings on my Mac. Here is an example. This example has text in color boxes, and some shadowing on the left side because another window was near it and that is some OS effect.

And here is the simplest text extraction.

tesseract ./screenshots/date-23-07-2015-time-19-56-52.png test
cat test.txt
rm test.txt
M. ~ an an. -.2... nuIunIrv- I.nd=d¢wn, a.....=.-p. u|u.=u_x. van" bar imam-

smm-ry ~ nu... ma, Camilla man Qu-In lui. Sdlwnu u._-. /-x/mu



master node q)l9amesh.chemc.cmn.edn up 9: days, mas, n nsersv

cutflmltc nude: men): :1

bpxoc node sum 17 dmln , o.1s,u.:n up
cuxque node state: n17 dawn

tuxqng jobs: 207 max, m2 nmmnq. as queued
aggregate laud: 799.55

That is not too good. I am skeptical it would be useful for anything. This page suggests improvements can be made with image clean up, but that image is pretty clear to me. There is a little shadow on the left side, but otherwise it is very readable to me. Let us try doubling the size of the image. Maybe that will help. Imagemagick lets us resize the image pretty easily.

convert -resize 200% ./screenshots/date-23-07-2015-time-19-56-52.png ./screenshots/doubled.png

tesseract ./screenshots/doubled.png test
cat test.txt
rm test.txt ./screenshots/doubled.png
- 2D Node status summary. Rnd=down, Groun=up, BIu¢=boot. Vdlow bur indium --



Summlry - Hnflnr Mods, Compuh Nodu, Quuue, Land, Sofiwan Ric:-as, /dzc/maul

master node gilqnmesh.cheme.cmu.edu up 91 days, 10:45, 11 users, .'

compute nodes total: 31

bproc node state: 17 down , 0-16,18-30 up
torque node state: n17 down

torque jobs: 207 total, 102 running, 88 queued
aggregate load: 799.55

That is a very remarkable improvement! This looks very useful, and could even be used for indexing. Interestingly, the white text on colored backgrounds does not do as well. That could be important to remember in designing the GUI front end if you have any say in it. Let's try a few others.

convert -resize 200% ./screenshots/date-23-07-2015-time-20-01-43.png ./screenshots/doubled.png

tesseract ./screenshots/doubled.png test
cat test.txt
rm test.txt ./screenshots/doubled.png
Fllbn V Q I-"Ind a repository...

tesseract
Tesseract Open Source OCR Enghe (main repository)
Updated 43 minutes ago

That is also pretty good. Again, the text on colored backgrounds is less good than that on a white background. And one more from Twitter.

convert -resize 200% ./screenshots/date-23-07-2015-time-20-05-59.png ./screenshots/doubled.png

tesseract ./screenshots/doubled.png test
cat test.txt
rm test.txt ./screenshots/doubled.png
chEnected AIchE .-.ChEnected - 5h
C h E AlChE Academy: Take AlChE‘s Pneumatic Conveying 0! Bulk Solids Course and
learn practical design principles. r1m.ag/|IJcN

Overall pretty good. I am pretty convinced that the right screenshots could be an easy way to document some processes conveniently and that they could be searched in the future. Some testing would need to be done on how to do the OCR, and it does not appear that one recipe will work for all images.

Nevertheless, a program could pretty easily be developed to index a directory of screenshots, do the OCR on each image, and index the resulting text so it can be searched. It is even conceivable that several indexes could be made using different OCR methods that might work better for different types of images. Another improvement would be if some metadata could be associated with the screenshot including what instrument it is a shot of, who took it, etc… This starts to move in a direction that requires dedicated screenshot tools, and away from the simplicity of the PrintScreen button, but it adds value to the data that makes it easier to find later.

The beauty of the screenshot is that it is what you saw at the time, and it captures things as they were. It doesn't capture anything "behind the scenes" or off-screen of course, so there is certainly room to lose information this way. A careful analysis of what information is captured and what is lost would be important to assess the long-term value of capturing the process this way as opposed to any other way. There certainly is a balance of how much effort it takes to capture it and the value of the information in the future, and cost to rerun it if it isn't found in the future.

Copyright (C) 2015 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter

An xml representation of an org document for indexing with swish-e

| categories: search, emacs | tags:

Swish-e can index xml data, and enable searching by tag. Here we push our org-mode indexing idea a little further. Initially we indexed org files as text. Then, we exported it to html, and indexed the html. That enabled some richer searching. Now, we will create an xml representation of the org file for indexing. This will enable us to use a custom tag system and search for specific text in tables, or src-blocks, or in headlines, or for headlines with certain tags, todo state or properties.

Incidentally, this is a general strategy for indexing arbitrary files. You just make an xml representation of the file containing the data to be indexed, and use swish-e to index that xml.

Let us start with code to generate xml. I adapted this from some code in Land Of Lisp . First, a function that simply prints a tag with attributes.

(defun print-tag (name attrs &optional closingp)
  "Print an xml tag with symbol NAME and ATTRS (a cons list of (attribute . value)).
if CLOSINGP print the closing tag instead."
  (format
   "<%s%s%s>"
   (if closingp "/" "")
   name
   (if (and attrs (not closingp))
       (concat
        " "
        (mapconcat
         (lambda (x)
           (format "%s=\"%s\""
                   (car x)
                   (xml-escape-string (cdr x))))
         attrs
         " "))
     "")))

(print-tag 'html '((color . "blue") (label . "test")))
<html color="blue" label="test">

XML tags almost always come in pairs. We define a macro to make this happen here. The macro prints the opening tag, evaluates the body, and prints the closing body. Note that the body may contain other tags, or a string. The string should be escaped to avoid illegal xml characters.

(defmacro tag (name attributes &rest body)
  `(format "%s%s%s"
           (print-tag ,name ,attributes nil)
           (concat
           ,@body)
           (print-tag ,name nil t)))

;; example usage
(tag "xml" '((test . "id"))
     (tag "body" nil
          (tag "p" nil (xml-escape-string "paragraph & < 1"))
          (tag "p" nil "paragraph 2")))
<xml test="id"><body><p>paragraph &amp; &lt; 1</p><p>paragraph 2</p></body></xml>

Now, we can use this to get an xml representation of the source blocks, e.g.

(mapconcat 'identity
           (org-element-map
               (org-element-parse-buffer)
               'src-block
             (lambda (element)
               (tag
                'src-block
                `((language . ,(org-element-property :language element)))
                (tag 'contents ()
                     (xml-escape-string
                      (org-element-property :value element))))))
           "")
<src-block language="emacs-lisp"><contents>(defun print-tag (name attrs &amp;optional closingp)
  &quot;Print an xml tag with symbol NAME and ATTRS (a cons list of (attribute . value)).
if CLOSINGP print the closing tag instead.&quot;
  (format
   &quot;&lt;%s%s%s&gt;&quot;
   (if closingp &quot;/&quot; &quot;&quot;)
   name
   (if (and attrs (not closingp))
       (concat
	&quot; &quot;
	(mapconcat
	 (lambda (x)
	   (format &quot;%s=\&quot;%s\&quot;&quot;
		   (car x)
		   (xml-escape-string (cdr x))))
	 attrs
	 &quot; &quot;))
     &quot;&quot;)))

(print-tag &apos;html &apos;((color . &quot;blue&quot;) (label . &quot;test&quot;)))
</contents></src-block><src-block language="emacs-lisp"><contents>(defmacro tag (name attributes &amp;rest body)
  `(format &quot;%s%s%s&quot;
	   (print-tag ,name ,attributes nil)
           (concat
	   ,@body)
	   (print-tag ,name nil t)))

(tag &quot;xml&quot; &apos;((test . &quot;id&quot;))
     (tag &quot;body&quot; nil
	  (tag &quot;p&quot; nil (xml-escape-string &quot;paragraph &amp; &lt; 1&quot;))
	  (tag &quot;p&quot; nil &quot;paragraph 2&quot;)))
</contents></src-block><src-block language="emacs-lisp"><contents>(mapconcat &apos;identity
	   (org-element-map
	       (org-element-parse-buffer)
	       &apos;src-block
	     (lambda (element)
	       (tag
		&apos;src-block
		`((language . ,(org-element-property :language element)))
		(tag &apos;contents ()
		     (xml-escape-string
		      (org-element-property :value element))))))
	   &quot;&quot;)
</contents></src-block><src-block language="emacs-lisp"><contents>(let ((xml (tag &apos;root `((filename . ,(buffer-file-name))
			(indexed-on . ,(current-time-string)))
		;; map the headlines
		(mapconcat
		 &apos;identity
		 (org-map-entries
		  (lambda ()
		    (let* ((tags (org-get-tags))
			   (heading-components (org-heading-components))
			   (title (nth 4 heading-components))
			   (level (nth 0 heading-components))
			   (properties (org-entry-properties))
			   (elem (org-element-at-point))
			   (bp (org-element-property :contents-begin elem))
			   (ep (org-element-property :contents-end elem))
			   (content (buffer-substring bp ep)))
		      (tag &apos;heading `((level . ,level))
			   (tag &apos;title () (xml-escape-string title))
			   (tag &apos;tags () (mapconcat &apos;identity tags &quot; &quot;))
			   (tag &apos;properties ()
				(mapconcat
				 (lambda (x)
				   (tag &apos;property `((label . (car ,x))) (cdr x)))
				 properties
				 &quot;&quot;))
			   (tag &apos;content ()
				(format &quot;%s&quot; (xml-escape-string content)))))))
		 &quot;&quot;)

		;; map specific element types
		(tag &apos;source-blocks ()
		     (mapconcat
		      &apos;identity
		      (org-element-map
			  (org-element-parse-buffer)
			  &apos;src-block
			(lambda (element)
			  (tag &apos;src-block
			       `((language .
					   ,(org-element-property
					     :language element)))
			       (tag &apos;contents ()
				    (xml-escape-string
				     (org-element-property :value element)))))) &quot;&quot;))

		(tag &apos;tables ()
		     (mapconcat
		      &apos;identity
		      (org-element-map
			  (org-element-parse-buffer)
			  &apos;table
			(lambda (element)
			  (tag &apos;table ()
			       (when (org-element-property :caption element)
				 (tag &apos;caption ()
				(caaar (org-element-property :caption element))))
			       (xml-escape-string
				(buffer-substring
				 (org-element-property :contents-begin element)
				 (org-element-property :contents-end element))))))
		      &quot;&quot;))

		(tag &apos;paragraphs ()
		     (mapconcat
		      &apos;identity
		      (org-element-map
			  (org-element-parse-buffer)
			  &apos;paragraph
			(lambda (element)
			  (tag &apos;paragraph ()
			       (xml-escape-string
				(buffer-substring
				 (org-element-property :contents-begin element)
				 (org-element-property :contents-end element))))))
		      &quot;&quot;
		      ))
		)))
  (with-temp-file &quot;org2xml.xml&quot;
    (insert xml)))
</contents></src-block><src-block language="emacs-lisp"><contents>(xml-parse-file &quot;org2xml.xml&quot;)
</contents></src-block>

So, finally we can map the entries to get some information about them, e.g. the tags, properties, todo state, etc… Then we create xml representing all that information so we can have a more precise search. Instead of looking for a word, we can specify that the word be in a property for example. Then, we make xml representations of the tables, src-blocks and paragraphs.

I am going to follow the example here that we worked out before on html and create a filter function that takes an org-file and spits out xml at the command line.

:;exec emacs -batch -l $0 -f main "$@"
(require 'org)
(require 'xml)

(defun print-tag (name attrs &optional closingp)
  "Print an xml tag with symbol NAME and ATTRS (a cons list of (attribute . value)).
if CLOSINGP print the closing tag instead.
You should use `xml-escape-string' on text going into the attributes to avoid errors."
  (format
   "<%s%s%s>"
   (if closingp "/" "")
   name
   (if (and attrs (not closingp))
       (concat
        " "
        (mapconcat
         (lambda (x)
           (format "%s=\"%s\"" (car x) (cdr x)))
           attrs
           " "))
     "")))

(defmacro tag (name attributes &rest body)
  `(format "%s%s%s"
           (print-tag ,name ,attributes nil)
           (concat
           ,@body)
           (print-tag ,name nil t)))

(defun main ()
  (find-file (car command-line-args-left))
  (princ (tag 'root `((filename . ,(buffer-file-name))
                      (indexed-on . ,(current-time-string)))
              ;; map the headlines
              (mapconcat
               'identity
               (org-map-entries
                (lambda ()
                  (let* ((tags (org-get-tags))
                         (heading-components (org-heading-components))
                         (todo (nth 2 heading-components))
                         (headline (nth 4 heading-components))
                         (thislevel (nth 0 heading-components))
                         (properties (org-entry-properties)))
                    (tag 'heading `((level . ,thislevel))
                         (tag 'headline () (xml-escape-string headline))
                         (tag 'tags () (mapconcat 'identity tags " "))
                         (when todo
                           (tag 'todo () todo))
                         (tag 'properties ()
                              (mapconcat
                               (lambda (x)
                                 (tag 'property `((name . ,(xml-escape-string (car x))))
                                      (xml-escape-string (cdr x))))
                               properties
                               ""))))))
               "")

              ;; get file keywords, TITLE, authors, etc...
              (tag 'file-keywords ()
                   (mapconcat 'identity
                              (org-element-map (org-element-parse-buffer 'element) 'keyword
                                (lambda (keyword)
                                  (tag (xml-escape-string (org-element-property :key keyword)) ()
                                       (xml-escape-string (org-element-property :value keyword)))))
                              ""))

              ;; map specific element types
              (tag 'source-blocks ()
                   (mapconcat
                    'identity
                    (org-element-map
                        (org-element-parse-buffer)
                        'src-block
                      (lambda (element)
                        (tag 'src-block
                             `((language .
                                         ,(org-element-property
                                           :language element)))
                             (tag 'contents ()
                                  (xml-escape-string
                                   (org-element-property :value element)))))) ""))

              (tag 'tables ()
                   (mapconcat
                    'identity
                    (org-element-map
                        (org-element-parse-buffer)
                        'table
                      (lambda (element)
                        (tag 'table ()
                             (when (org-element-property :caption element)
                               (tag 'caption ()
                                    (format
                                     "%s"
                                     (org-element-property
                                      :caption element))))
                             (xml-escape-string
                              (buffer-substring
                               (org-element-property :contents-begin element)
                               (org-element-property :contents-end element))))))
                    ""))

              (tag 'paragraphs ()
                   (mapconcat
                    'identity
                    (org-element-map
                        (org-element-parse-buffer)
                        'paragraph
                      (lambda (element)
                        (tag 'paragraph ()
                             (xml-escape-string
                              (buffer-substring
                               (org-element-property :contents-begin element)
                               (org-element-property :contents-end element))))))
                    ""
                    )))))

We could do more, e.g. links, or images, but this is pretty good for now. Now, let's configure a swish indexer. We instruct swish-e to use some metanames, and attributes so we can search on them later.

# Example configuration file

# Tell Swish-e what to directories to index
IndexDir /Users/jkitchin/blogofile-jkitchin.github.com/_site

# where to save the index
IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/index-org2xml.swish-e

# What to index
IndexOnly .org

# Tell Swish-e that .txt files are to use the HTML parser.
IndexContents XML* .org

FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/org2xml.el

# index all tags for searching
UndefinedMetaTags auto
UndefinedXMLAttributes auto

And now, run the index command. I did this at the command line. There might be some problems with the script as there were some warnings about non-zero exits, but there was only a few so we ignore them for now.

swish-e -c swish-org2xml.conf

1 Examples of searching for org-files

1.1 Files with words in the filename

Here we look for filenames with the word "Extracting" in them.

swish-e -f index-org2xml.swish-e -w root.filename=Extracting
# SWISH format: 2.4.7
# Search words: root.filename=Extracting
# Removed stopwords:
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/02/19/Extracting-bibtex-file-from-an-org-buffer.org "Extracting-bibtex-file-from-an-org-buffer.org" 6094
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org "notes.org" 195515
.

Or, thanks to the date being in the path, we can find by year, How about July of 2012?

swish-e -f index-org2xml.swish-e -w root.filename="(2012/07)"
# SWISH format: 2.4.7
# Search words: root.filename=(2012/07)
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2012/07/15/Professor-Kitchin-was-awarded-the-Presidential-Early-Career-Award-for-Scientists-and-Engineers-(PECASE).org "Professor-Kitchin-was-awarded-the-Presidential-Early-Career-Award-for-Scientists-and-Engineers-(PECASE).org" 311
.

Interesting we have to use the parentheses here.

1.2 DONE Files with headlines containing a word

Now, lets find documents with "Compiled" in a heading title with level=2

swish-e -f index-org2xml.swish-e -w heading.level=2 title=Compiled -m5
# SWISH format: 2.4.7
# Search words: heading.level=2 title=Compiled
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-07-12-Org-mode-is-awesome/why-org-mode.org "why-org-mode.org" 13522
.

1.3 Headlines marked TODO

We can find documents with headlines marked TODO:

swish-e -f index-org2xml.swish-e  -w "todo=TODO" -m 5
# SWISH format: 2.4.7
# Search words: todo=TODO
# Removed stopwords:
# Number of hits: 12
# Search time: 0.000 seconds
# Run time: 0.008 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-01-27-Clocking-your-time-in-org-mode/blog.org "blog.org" 134160
624 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/02/16/A-dynamic-snippet-for-a-task-due-7-days-from-now.org "A-dynamic-snippet-for-a-task-due-7-days-from-now.org" 2587
425 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/02/16/END.org "END.org" 1531
269 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/02/01/Handling-multiple-selections-in-helm.org "Handling-multiple-selections-in-helm.org" 3290
269 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/01/30/More-adventures-in-helm---more-than-one-action.org "More-adventures-in-helm---more-than-one-action.org" 3236
.

1.4 For a table

so2-capacity-1

swish-e -f index-org2xml.swish-e -w table="energy"
# SWISH format: 2.4.7
# Search words: table=energy
# Removed stopwords:
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/08/21/Using-org-entries-like-a-database.org "Using-org-entries-like-a-database.org" 53035
633 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/07/04/Estimating-uncertainties-in-equations-of-state.org "Estimating-uncertainties-in-equations-of-state.org" 3117
.

1.5 Tagged headlines

Find entries with a "slide" tag.

swish-e -f index-org2xml.swish-e -w "tags=slide"
# SWISH format: 2.4.7
# Search words: tags=slide
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.009 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-07-12-Org-mode-is-awesome/why-org-mode.org "why-org-mode.org" 13522
.

Evidently there is one file where I talk about slides in org-show.

1.6 Headlines with a property

Here I find documents with headlines that have thermodynamics in the property "categories".

swish-e -f index-org2xml.swish-e -w "property.label=categories property=thermodynamics"
# SWISH format: 2.4.7
# Search words: property.label=categories property=thermodynamics
# Removed stopwords:
# Number of hits: 10
# Search time: 0.000 seconds
# Run time: 0.009 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/01/Water-gas-shift-equilibria-via-the-NIST-Webbook.org "Water-gas-shift-equilibria-via-the-NIST-Webbook.org" 10789
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/03/01/Gibbs-energy-minimization-and-the-NIST-webbook.org "Gibbs-energy-minimization-and-the-NIST-webbook.org" 5441
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/03/01/Finding-equilibrium-composition-by-direct-minimization-of-Gibbs-free-energy-on-mole-numbers.org "Finding-equilibrium-composition-by-direct-minimization-of-Gibbs-free-energy-on-mole-numbers.org" 6155
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/27/Reading-parameter-database-text-files-in-python.org "Reading-parameter-database-text-files-in-python.org" 3947
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/18/The-Gibbs-free-energy-of-a-reacting-mixture-and-the-equilibrium-composition.org "The-Gibbs-free-energy-of-a-reacting-mixture-and-the-equilibrium-composition.org" 8230
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/18/Calculating-a-bubble-point-pressure-of-a-mixture.org "Calculating-a-bubble-point-pressure-of-a-mixture.org" 3203
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/15/The-equal-area-method-for-the-van-der-Waals-equation.org "The-equal-area-method-for-the-van-der-Waals-equation.org" 5737
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/12/Using-constrained-optimization-to-find-the-amount-of-each-phase-present.org "Using-constrained-optimization-to-find-the-amount-of-each-phase-present.org" 5210
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/02/05/Constrained-minimization-to-find-equilibrium-compositions.org "Constrained-minimization-to-find-equilibrium-compositions.org" 5666
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/09/23/Generating-an-atomic-stoichiometric-matrix.org "Generating-an-atomic-stoichiometric-matrix.org" 3487
.

That seems about right, according to http://kitchingroup.cheme.cmu.edu/categories.html there are 9 documents. I am not sure why they don't totally agree, but I can live with it.

Here are documents containing headlines with the property "TOTAL_ENERGY"

swish-e -f index-org2xml.swish-e -w property.label=TOTAL_ENERGY
# SWISH format: 2.4.7
# Search words: property.label=TOTAL_ENERGY
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.008 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/08/21/Using-org-entries-like-a-database.org "Using-org-entries-like-a-database.org" 53035
.

1.7 Documents with a Python source block containing a word

Find org files with diffusion in a python source block.

swish-e -f index-org2xml.swish-e -w src-block.language=python -w src-block=diffusion
# SWISH format: 2.4.7
# Search words: src-block.language=python src-block=diffusion
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.011 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2013/04/02/Transient-diffusion---partial-differential-equations.org "Transient-diffusion---partial-differential-equations.org" 3660
.

1.8 An org-file with a UUID

swish-e -f index-org2xml.swish-e -w  property="(38FCCF3D-7FC5-49BF-BB77-486BBAA17CD9)"
# SWISH format: 2.4.7
# Search words: property=(38FCCF3D-7FC5-49BF-BB77-486BBAA17CD9)
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2014/11/23/Machine-gradable-quizzes-in-emacs+org-modex.org "Machine-gradable-quizzes-in-emacs+org-modex.org" 5743
.

Interesting, again the parentheses are necessary to find a match. I think because of the dashes. The next example is similar, but finds an entry with that bibtex key in a CUSTOM_ID property.

swish-e -f index-org2xml.swish-e -w  property="(mantina-2008-first-princ)"
# SWISH format: 2.4.7
# Search words: property=(mantina-2008-first-princ)
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.010 seconds
1000 /Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org "notes.org" 195515
.

2 Summary

This is pretty cool. There are still some bugs to work out in the indexing filter I think, but this demonstrates you can index org-files, and have pretty refined searches to find your files. There is still some thinking to do on how to schedule an incremental indexing, and whether we need more or better metanames. The indexing is not fast, but that is probably because I am running this through a FileFilter, rather than the -s prog option in swish-e. This is super promising to me though. Imagine building an agenda from files found with TODO headlines in them; a global todo list! Or, grabbing contacts from wherever they are. No more losing files you have not used in a while. Find all documents containing a citation. With some extra work, you could index links, citations, chemical formulas , or other types of identifiable content.

The logical conclusion of this work might be an ox-swish-e-xml export engine to render the org-file into xml, rather than the script I used here. It would be really great to get some refined output, e.g. rather than just get matching documents, get location information so you could open the document to the matching element. That might be out of reach for swish-e, but could be in reach for other programs like Sphinx that are more integrated with a database. There is a very interesting project here: https://github.com/wvxvw/sphinx-mode to integrate org-mode with the Sphinx search (http://sphinxsearch.com ) engine.

Copyright (C) 2015 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter

Using swish-e to index org files as html

| categories: search, emacs | tags:

When we wrote about using swish-e before , we just indexed the org files as text. This worked pretty well, but we lost some resolution, e.g. being able to search for text in a headline. that is more possible if we index html or xml. So, here we try indexing the org files as html. It will be slower to index because we will filter each org file through a command that exports it to html, but hopefully it will be worth it for the enhanced search capability.

We will need a filter shell command that takes an org-file and spits out html. This command is shown as an emacs-lisp script here. This is a pretty bare bones export, and would lack the export of all my custom links from org-ref. I tried this, but org-ref outputs a lot of stuff to stdout when it loads, and unless I can figure out how to suppress that I don't want it here for now.

:;exec emacs -batch -l $0 -f main "$@"
(require 'org)
;(add-to-list 'load-path "/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa")
;(add-to-list 'load-path "/Users/jkitchin/Dropbox/kitchingroup/jmax/org-ref")
;(setq package-user-dir "/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa")
;(package-initialize)
;(require 'org-ref)
(defun main ()
  (find-file (car command-line-args-left))
  (org-html-export-as-html nil nil nil t)
  (switch-to-buffer "*Org HTML Export*")
  (print (buffer-string)))

;; Local Variables:
;; mode: emacs-lisp
;; End:

We try it out here:

./org2html.el index-org-as-html.org
"<div id=\"table-of-contents\">
<h2>Table of Contents</h2>
<div id=\"text-table-of-contents\">
<ul>
<li><a href=\"#sec-1\">1. Using swish-e to index org files as html</a></li>
</ul>
</div>
</div>
<div id=\"outline-container-sec-1\" class=\"outline-2\">
<h2 id=\"sec-1\"><span class=\"section-number-2\">1</span> Using swish-e to index org files as html</h2>
<div class=\"outline-text-2\" id=\"text-1\">
<p>
When we wrote about using swish-e <a href=\"http://kitchingroup.cheme.cmu.edu/blog/2015/06/25/Integrating-swish-e-and-Emacs/\">before</a>, we just indexed the org files as text. This worked pretty well, but we lost some resolution, e.g. being able to search for text in a headline. that is more possible if we index html or xml. So, here we try indexing the org files as html. It will be slower to index because we will filter each org file through a command that exports it to html, but hopefully it will be worth it for the enhanced search capability.
</p>

<p>
We will need a filter shell command that takes an org-file and spits out html. This command is shown as an emacs-lisp script here. This is a pretty bare bones export, and would lack the export of all my custom links
</p>

<p>
cite:dauenhauer-2006-renew
</p>

<div class=\"org-src-container\">

<pre class=\"src src-emacs-lisp\">:;exec emacs -batch -l $0 -f main \"$@\"
(require 'org)
;(add-to-list 'load-path \"/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa\")
;(add-to-list 'load-path \"/Users/jkitchin/Dropbox/kitchingroup/jmax/org-ref\")
;(setq package-user-dir \"/Users/jkitchin/Dropbox/kitchingroup/jmax/elpa\")
;(package-initialize)
;(require 'org-ref)
(defun main ()
  (find-file (car command-line-args-left))
  (org-html-export-as-html nil nil nil t)
  (switch-to-buffer \"*Org HTML Export*\")
  (print (buffer-string)))

;; Local Variables:
;; mode: emacs-lisp
;; End:
</pre>
</div>


<div class=\"org-src-container\">

<pre class=\"src src-sh\">./org2html.el index-org-as-html.org
</pre>
</div>

<div class=\"org-src-container\">

<pre class=\"src src-text\"># Example configuration file

# Tell Swish-e what to directories to index
IndexDir /Users/jkitchin/blogofile-jkitchin.github.com

# where to save the index
IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/index.swish-e

# What to index
IndexOnly .org

# Tell Swish-e that .txt files are to use the text parser.
IndexContents TXT* .org

FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/org2html.el

# Ask libxml2 to report any parsing errors and warnings or
# any UTF-8 to 8859-1 conversion errors
ParserWarnLevel 9
</pre>
</div>
</div>
</div>
"

I think that looks good. Now, let's configure a swish indexer.

# Example configuration file

# Tell Swish-e what to directories to index
IndexDir /Users/jkitchin/blogofile-jkitchin.github.com

# where to save the index
IndexFile /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/index.swish-e

# What to index
IndexOnly .org

# Tell Swish-e that .txt files are to use the HTML parser.
IndexContents HTML* .org

FileFilter .org /Users/jkitchin/blogofile-jkitchin.github.com/_blog/swish-org/org2html.el

# Ask libxml2 to report any parsing errors and warnings or
# any UTF-8 to 8859-1 conversion errors
ParserWarnLevel 9

MetaNames class swishtitle
HTMLLinksMetaName links

PropertyNames author subjects

StoreDescription HTML <body>

And now, run the index command. I did this at the command line. A lot of output! mostly not being able to fontify source blocks because htmlize was not on the path, and a bunch of attribute parsing errors, and a few utf-8 errors.

swish-e -c swish-org-html.conf

And a test search for files with "selector" in a headline.

swish-e -f index.swish-e -x '%r\t%p\n' -w selector -t h
# SWISH format: 2.4.7
# Search words: selector
# Removed stopwords:
# Number of hits: 4
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000	/Users/jkitchin/blogofile-jkitchin.github.com/org/2015/03/14/A-helm-mu4e-contact-selector.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/03/14/A-helm-mu4e-contact-selector.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_deploy/org/2015/03/14/A-helm-mu4e-contact-selector.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog-2014.org
.

A phrase in a headline.

swish-e -f index.swish-e -x '%r\t%p\n' -w "information for all documents" -t h
# SWISH format: 2.4.7
# Search words: information for all documents
# Removed stopwords:
# Number of hits: 5
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog.org
921	/Users/jkitchin/blogofile-jkitchin.github.com/_blog/blog-2014.org
794	/Users/jkitchin/blogofile-jkitchin.github.com/org/2015/04/03/Getting-data-from-the-Scopus-API.org
794	/Users/jkitchin/blogofile-jkitchin.github.com/_site/org/2015/04/03/Getting-data-from-the-Scopus-API.org
794	/Users/jkitchin/blogofile-jkitchin.github.com/_deploy/org/2015/04/03/Getting-data-from-the-Scopus-API.org
.

Sweet. How about all documents containing this citation:

swish-e -f index.swish-e -x '%r\t%p\n' -w cite:kitchin-2004-modif-pt
# SWISH format: 2.4.7
# Search words: cite:kitchin-2004-modif-pt
# Removed stopwords:
# Number of hits: 3
# Search time: 0.000 seconds
# Run time: 0.008 seconds
1000	/Users/jkitchin/blogofile-jkitchin.github.com/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_site/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org
1000	/Users/jkitchin/blogofile-jkitchin.github.com/_deploy/media/2014-02-19-Extracting-bibtex-file-from-an-org-buffer/notes.org
.

Super nice.

Copyright (C) 2015 by John Kitchin. See the License for information about copying.

org-mode source

Org-mode version = 8.2.10

Discuss on Twitter