There is a plugin available from the PDI Marketplace that you
can use to bring in data free form document types such as pdf, word,
PowerPoint. This uses the open source TIKA project. This is a
perfect use case for Big Data as well…the vast amount of data that you can now
extract…load into a Big Data Source (Hadoop), monitor for certain conditions
(terms being used), in a financial institution maybe to monitor inside or
confidential information that may be out on storage devices, emails, word docs
etc…for government, looking for terrorist activity…skies the limit here (Our
own Matt Burgess created this plugin for the Marketplace):
Here I am pointing to a directory and pulling out all the
text within all pptx document types:
Then I am selecting what I want to capture…I can capture the
data within the document or metadata about the document (Selecting JSON will
return only metadata about the document):
Now we can see that there are only two fields coming
back…File Content and File Size
In my example I am streaming it to a servlet and running it
in a browser so I can demo the results easily:
I leave the Field Length Blank
Then I hit the transform in my browser…easy way to demo the
output (for this to work you have to copy the plugin to the server as
well…Marketplace only puts it in spoon)