I needed a library to extract metadata and plaintext transcript from various file formats, for indexing purposes.
After looking around for a while, I found out that Apache Tika might be the right tool for the job (or, at least, it does quite a good job in extracting information from files).
Sadly though, that thing is written in Java.
At first, I tried it by running the jar via subprocess and then parsing the json output. I quickly discarded that approach, as:
- It required to launch the process twice (once for extracting metadata, and once to extract the plain-text version)
- Continuously re-launching JVMs is not exactly what you call "lightweight"
- The json output is sometimes malformed :( (I'll open an issue for that).
- I tried using the XML format, but then the returned "text format" is HTML, and I want plain text for putting in the indexer.
Then, I tried using it in --server mode, but the problems are pretty much the same: you should run two instances for metadata/text, or use the xml version that contains html tags, ...
So, I decided to try some method of directly calling the Java library from Python.
There are many ways to do this, including:
- Using jython but, even if it's only a worker and not the whole application, you lose the ability to use a bunch of Python modules, ..
- Creating a wrapper via jcc: I found some people that did that with older versions of Tika, but I wasn't able to quickly use that on version 1.3 (I'm pretty sure I did something wrong, but I'm not very expert in Java..)
- Using py4j: I had a look at py4j but.. it looks like sort-of a rpc library, talking a custom protocol over tcp.. wtf?
- Using jnius: finally, I remembered of a library I spotted once, called jnius, that should be made exactly for that purpose: using Java libraries from Python, without the need of wrappers, running the whole thing in a JVM, etc.. at the end, I opted for doing this way.
Setting up pyjnius
Setting things up was pretty straight-forward, as it was just a matter of:
pip install cython pip install git+git://github.com/kivy/pyjnius.git
Then, I downloaded the tika-app jar, and put it somewhere.
From that point, using the library was a breeze:
## If you put the jar in a non-standard location, you need to ## prepare the CLASSPATH **before** importing jnius import os os.environ['CLASSPATH'] = "/path/to/tika-app.jar" from jnius import autoclass ## Import the Java classes we are going to need Tika = autoclass('org.apache.tika.Tika') Metadata = autoclass('org.apache.tika.metadata.Metadata') FileInputStream = autoclass('java.io.FileInputStream') tika = Tika() meta = Metadata() text = tika.parseToString(FileInputStream(filename), meta)
That's it! Now, you can just access the text transcript from text, and the file metadata is stored in meta (have a look at the .names() and .get(name) methods).
Integrating this with django and celery tasks was straightforward.
Of course, have a look at the Tika API Documentation for more information on the available methods, signatures, etc.