Monday, April 29, 2013

The Dirty Truth About Data and How To Clean It Using Pentaho!


 Anyone who has worked with data has been there.  You are trying to bring data into your organization in order to merge it with other data so that you can provide a complete picture of:
  • Your Organization
  • Your Customers
  • Your Industry
  • How all the above relate together
In order to achieve this complete picture, it will require you to rely on data that originated and exists outside of your organization.  Some examples may be bringing data in from Twitter, Facebook, LinkedIn, YouTube, Etc.  While we all know that the data within our own organizations is always clean ;) we all know that the data external is usually full of "bad" or "dirty" data.  What this demonstration will do for you is show you how you can use the power of Pentaho Data Integration to help clean your data as you merge, enrich and analyze it.

Setting the Stage

In this example, I am going to consume information from a flat file (csv) that has been provided to me through a third party vendor that I am paying to do sentiment analysis on my products.  This fictitious company, called Big Wireless, is a company that sells wireless products (cell phones, tablets, notebooks, etc) and services (cell phone, home line, etc).

The purpose of this exercise is to bring the data that is being provided by this third party (which I receive on a daily basis).  When processing the data, I need to capture any records that have bad or malformed data and report this back to the third party vendor.  In other words, I am paying for a service from them and this lets me verify that I am getting what I am paying for and can use this to make sure that they are living up to their QOS.  

Below is a recorded demonstration of the following (based on the information above):


  1. Read in the CSV file from my 3rd Party Vendor
  2. Keep track of any "dirty" data
  3. Validating the expected Sentiment
  4.  Doing a fuzzy lookup in order to standardize on my companies product names
  5. Enriching the data through several lookups"
    1. Look up Detailed Product Information
    2. Lookup Geocode on where the tweet originated
  6. Create some new time dimensions
  7. Put it in my data base for further analysis
(Please excuse the tunnel voice effect :)

 

No comments: