Wednesday, January 15, 2014

Free Form Data Ingestion with Pentaho Data Integration

There is a plugin available from the PDI Marketplace that you can use to bring in data free form document types such as pdf, word, PowerPoint.   This uses the open source TIKA project.  This is a perfect use case for Big Data as well…the vast amount of data that you can now extract…load into a Big Data Source (Hadoop), monitor for certain conditions (terms being used), in a financial institution maybe to monitor inside or confidential information that may be out on storage devices, emails, word docs etc…for government, looking for terrorist activity…skies the limit here (Our own Matt Burgess created this plugin for the Marketplace):

Here I am pointing to a directory and pulling out all the text within all pptx document types:

Then I am selecting what I want to capture…I can capture the data within the document or metadata about the document (Selecting JSON will return only metadata about the document):

Now we can see that there are only two fields coming back…File Content and File Size

In my example I am streaming it to a servlet and running it in a browser so I can demo the results easily:

I leave the Field Length Blank

Then I hit the transform in my browser…easy way to demo the output (for this to work you have to copy the plugin to the server as well…Marketplace only puts it in spoon)