Extract text from MS Office format .doc and .ppt files using the PHP Java Bridge and Apache POI.
Posted by nicholsr on January 27th, 2009. Other posts by nicholsr
Recently I was writing a sMash application where I needed to extract the text from Microsoft Word format .doc and Powerpoint .ppt files. The text was then going to be indexed using Apache Lucene but that is another story. I am not aware of a way to do this using standard PHP so I looked to the Java world. I found that using the sebring Java Bridge with Apache POI I could do this in just a few lines of PHP.
I thought I’d share the code. Here is what I did:
- Download Apache POI. I used the latest development stream code from here: http://www.apache.org/dyn/closer.cgi/poi/
I used the code from the dev/bin branch and dowloaded poi-bin-3.5-beta4-20081128.zip . - From this zip I extracted the three jars I needed where were
poi-3.5-beta4-20081128.jar poi-contrib-3.5-beta4-20081128.jar poi-scratchpad-3.5-beta4-20081128.jarI placed these three jars into the /lib directory of my sMash application.
EDIT Michael pointed out that this JARS are actually available in Maven so I could have pulled them in from there. - I added a dependency for zero.php to config/ivy.xml
- I created public/wordExtract.php with the following contents:
<?php // Demonstrate the use of POI HWPF to extract text from a ms-word format file. // Note that there are other methods of WordExtractor that can be used to iterate through paragraphs java_import("org.apache.poi.hwpf.extractor.WordExtractor"); $fs= new Java ("java.io.FileInputStream","c:/temp/poitest.doc"); $we = new WordExtractor($fs); $text=$we->getText(); echo $text;
- I created public/powerpointExtract.php with the following contents:
<?php // Demonstrate the use of POI HSLF to extract text from a powerpoint format file. java_import("org.apache.poi.hslf.HSLFSlideShow"); java_import("org.apache.poi.hslf.usermodel.SlideShow"); $fs= new Java ("java.io.FileInputStream","c:/temp/poitest.ppt"); $hsss=new HSLFSlideShow($fs); $ss= new SlideShow($hsss); foreach ($ss->getSlides() as $slide) { echo $slide->getTitle(); echo "<br><br>"; foreach ( $slide->getTextRuns() as $textRun) { echo $textRun->getRawText(); echo "<br>" ; } }
- I tested this code with a number of real world powerpoint and word documents that I had.



October 1st, 2009 at 1:03 pm
I tries this… I get the following error when trying to run the script…
Fatal error: Call to undefined function: java_import() in /home/thrive/public_html/pptext.php on line 6
how do I get the java_import() function accessible to the script?
October 1st, 2009 at 2:25 pm
Did you add PHP support to your CLI environment ? Refer to instructions in the last section of the Getting Started guide. Also, refer to PHP to Java bridge in the documentation for usage.
March 26th, 2010 at 2:17 am
Hi Nichol,
We are looking to implement something like this as a connector module which can take any document format as plug-in and get the text extracted.
If you have already done so, would you be able to share with me?
March 26th, 2010 at 11:28 am
Hey Bala,
We don’t have a generic connector module as you described. The team would be happy to see what you come up with in your project and will help you along the way. Please start a forum thread to track progress and continue the discussion.
Thanks,
Ryan B.