Anyone who has worked with data has been there. You are trying to bring data into your organization in order to merge it with other data so that you can provide a complete picture of:
- Your Organization
- Your Customers
- Your Industry
- How all the above relate together
Setting the Stage
In this example, I am going to consume information from a flat file (csv) that has been provided to me through a third party vendor that I am paying to do sentiment analysis on my products. This fictitious company, called Big Wireless, is a company that sells wireless products (cell phones, tablets, notebooks, etc) and services (cell phone, home line, etc).
The purpose of this exercise is to bring the data that is being provided by this third party (which I receive on a daily basis). When processing the data, I need to capture any records that have bad or malformed data and report this back to the third party vendor. In other words, I am paying for a service from them and this lets me verify that I am getting what I am paying for and can use this to make sure that they are living up to their QOS.
Below is a recorded demonstration of the following (based on the information above):
- Read in the CSV file from my 3rd Party Vendor
- Keep track of any "dirty" data
- Validating the expected Sentiment
- Doing a fuzzy lookup in order to standardize on my companies product names
- Enriching the data through several lookups"
- Look up Detailed Product Information
- Lookup Geocode on where the tweet originated
- Create some new time dimensions
- Put it in my data base for further analysis