Skip Navigation Links

Recently I was writing a sMash application where I needed to extract the text from Microsoft Word format .doc and Powerpoint .ppt files. The text was then going to be indexed using Apache Lucene but that is another story. I am not aware of a way to do this using standard PHP so I looked to the Java world. I found that using the sebring Java Bridge with Apache POI I could do this in just a few lines of PHP.

I thought I’d share the code. Here is what I did:

  1. Download Apache POI. I used the latest development stream code from here: http://www.apache.org/dyn/closer.cgi/poi/
    I used the code from the dev/bin branch and dowloaded poi-bin-3.5-beta4-20081128.zip .

  2. From this zip I extracted the three jars I needed where were
    	poi-3.5-beta4-20081128.jar
            poi-contrib-3.5-beta4-20081128.jar
            poi-scratchpad-3.5-beta4-20081128.jar
    

    I placed these three jars into the /lib directory of my sMash application.
    EDIT Michael pointed out that this JARS are actually available in Maven so I could have pulled them in from there.

  3. I added a dependency for zero.php to config/ivy.xml
        
  4. I created public/wordExtract.php with the following contents:
    <?php
     
    // Demonstrate the use of POI HWPF to extract text from a ms-word format file. 
    // Note that there are other methods of WordExtractor that can be used to iterate through paragraphs
     
    java_import("org.apache.poi.hwpf.extractor.WordExtractor");
    $fs= new Java ("java.io.FileInputStream","c:/temp/poitest.doc");
    $we = new WordExtractor($fs);
    $text=$we->getText();
    echo $text;
  5. I created public/powerpointExtract.php with the following contents:
    <?php
     
    // Demonstrate the use of POI HSLF to extract text from a powerpoint format file. 
     
     
    java_import("org.apache.poi.hslf.HSLFSlideShow");
    java_import("org.apache.poi.hslf.usermodel.SlideShow");
     
    $fs= new Java ("java.io.FileInputStream","c:/temp/poitest.ppt");
    $hsss=new HSLFSlideShow($fs);
    $ss= new SlideShow($hsss);
    foreach ($ss->getSlides() as $slide) {
        echo $slide->getTitle();
        echo "<br><br>";
        foreach ( $slide->getTextRuns() as $textRun) {
        	echo $textRun->getRawText();
        echo "<br>" ;   	
        }
    }
  6. I tested this code with a number of real world powerpoint and word documents that I had.

4 Responses to “Extract text from MS Office format .doc and .ppt files using the PHP Java Bridge and Apache POI.”

  1. hhh Says:

    I tries this… I get the following error when trying to run the script…

    Fatal error: Call to undefined function: java_import() in /home/thrive/public_html/pptext.php on line 6

    how do I get the java_import() function accessible to the script?

  2. admin Says:

    Did you add PHP support to your CLI environment ? Refer to instructions in the last section of the Getting Started guide. Also, refer to PHP to Java bridge in the documentation for usage.

  3. Bala Says:

    Hi Nichol,

    We are looking to implement something like this as a connector module which can take any document format as plug-in and get the text extracted.

    If you have already done so, would you be able to share with me?

  4. admin Says:

    Hey Bala,
    We don’t have a generic connector module as you described. The team would be happy to see what you come up with in your project and will help you along the way. Please start a forum thread to track progress and continue the discussion.

    Thanks,
    Ryan B.