Tuesday, July 12, 2016

Data OnBoarding

Data is growing at an exponential rate.  In fact, when President Obama was asked about concerns about the State Departments ability to keep information classified in light of the Hillary Clinton email scandal, here is what he said:

"That was an interesting question so -- first of all, with respect to the State Department, I am concerned.  And the challenge that we've got is primarily driven by the changing nature of how information flows.  Look, the advent of email and texts and smartphones is just generating enormous amounts of data.  Now, it is hugely convenient.  It means that in real time I'm getting information that some of my predecessors might not have gotten for weeks. But what it also is doing is creating this massive influx of information on a daily basis, putting enormous pressure on the department to sort through it, classify it properly, figure out what are the various points of entry because of the cyber-attack risks that these systems have, knowing that our adversaries are constantly trying to hack into these various systems.  If you overclassify, then all the advantages of this new information suddenly go away because it's taking too long to process.

And so we've been trying to think about this in a smart way.  And I think Secretary Kerry has got a range of initiatives to try to get our arms around this.  It reflects a larger problem in government.  We just recently, for example -- I just recently signed a bill about FOYA requests -- Freedom of Information Act requests that built on a number of reforms that we've put in place.  We're processing more Freedom of Information Act requests and doing so faster than ever before.  The problem is the volume of requests has skyrocketed.  The amount of information that answers the request has multiplied exponentially. "
There are many topics to cover when taking about this growth of data but today I want to focus on the ability to "On Board" the data.  What is data on boarding?  Simply stated, data on boarding is the process in which organizations ingest data sources into their systems in order to turn it into information.  Sounds simple but as President Obama put it is that there are many difficulties and challenges with bringing in that data such as classifying the data properly and then know which data needs immediate attention and who needs to pay attention to it.  

One big issue is that every time there is a new source of data, a process has to be written to handle that specific source.  There is a direct relationship between the number of data sources and the number of processes that have to be developed and maintained.  This is where I am glad to be working for a company like Pentaho.  Pentaho gives the ability to make these processes much smarter by allowing them to be somewhat automated based on the data that is being consumed, despite where it is coming from.

Below is an introductory demonstration of this capability.

Wednesday, January 15, 2014

Free Form Data Ingestion with Pentaho Data Integration

There is a plugin available from the PDI Marketplace that you can use to bring in data free form document types such as pdf, word, PowerPoint.   This uses the open source TIKA project.  This is a perfect use case for Big Data as well…the vast amount of data that you can now extract…load into a Big Data Source (Hadoop), monitor for certain conditions (terms being used), in a financial institution maybe to monitor inside or confidential information that may be out on storage devices, emails, word docs etc…for government, looking for terrorist activity…skies the limit here (Our own Matt Burgess created this plugin for the Marketplace):

Here I am pointing to a directory and pulling out all the text within all pptx document types:

Then I am selecting what I want to capture…I can capture the data within the document or metadata about the document (Selecting JSON will return only metadata about the document):

Now we can see that there are only two fields coming back…File Content and File Size

In my example I am streaming it to a servlet and running it in a browser so I can demo the results easily:

I leave the Field Length Blank

Then I hit the transform in my browser…easy way to demo the output (for this to work you have to copy the plugin to the server as well…Marketplace only puts it in spoon)

Monday, April 29, 2013

The Dirty Truth About Data and How To Clean It Using Pentaho!

 Anyone who has worked with data has been there.  You are trying to bring data into your organization in order to merge it with other data so that you can provide a complete picture of:
  • Your Organization
  • Your Customers
  • Your Industry
  • How all the above relate together
In order to achieve this complete picture, it will require you to rely on data that originated and exists outside of your organization.  Some examples may be bringing data in from Twitter, Facebook, LinkedIn, YouTube, Etc.  While we all know that the data within our own organizations is always clean ;) we all know that the data external is usually full of "bad" or "dirty" data.  What this demonstration will do for you is show you how you can use the power of Pentaho Data Integration to help clean your data as you merge, enrich and analyze it.

Setting the Stage

In this example, I am going to consume information from a flat file (csv) that has been provided to me through a third party vendor that I am paying to do sentiment analysis on my products.  This fictitious company, called Big Wireless, is a company that sells wireless products (cell phones, tablets, notebooks, etc) and services (cell phone, home line, etc).

The purpose of this exercise is to bring the data that is being provided by this third party (which I receive on a daily basis).  When processing the data, I need to capture any records that have bad or malformed data and report this back to the third party vendor.  In other words, I am paying for a service from them and this lets me verify that I am getting what I am paying for and can use this to make sure that they are living up to their QOS.  

Below is a recorded demonstration of the following (based on the information above):

  1. Read in the CSV file from my 3rd Party Vendor
  2. Keep track of any "dirty" data
  3. Validating the expected Sentiment
  4.  Doing a fuzzy lookup in order to standardize on my companies product names
  5. Enriching the data through several lookups"
    1. Look up Detailed Product Information
    2. Lookup Geocode on where the tweet originated
  6. Create some new time dimensions
  7. Put it in my data base for further analysis
(Please excuse the tunnel voice effect :)


Monday, April 22, 2013

The Future of Business Analytics Changes Today!

Pentaho Acquires Dashboard and UI Specialist Partner Webdetails
Portugal-based consultancy provides visual development expertise, consulting services and a new community leader

  • Pentaho is hiring and seeking superstars worldwide. Visit our careers page to learn more.

Orlando, Fla — April 22, 2013 — Delivering the future of analytics, Pentaho announced today that it has completed the acquisition of its Portugal-based consulting partner Webdetails. Pentaho will benefit from Webdetails’ visual interface development expertise and international consulting services provided by its 20-strong team. Webdetails’ founder Pedro Alves is a high-profile member of Pentaho’s open source community and will take on the new role of Senior VP, Community for Pentaho. As both parties are privately-held, the financial terms of the deal are undisclosed.

Raising Pentaho’s “visibility”

The Webdetails acquisition will complement and accelerate Pentaho’s research and development plans to enrich the user experience for both IT and business users of its business analytics platform and big data integration tools. This will include expanding Pentaho’s range of data visualizations available in dashboards, making visual development tools like Instaview even easier to use and delivering new visual interfaces to help new customers get started.

Webdetails has been designing plug-ins for Pentaho for several years, most notably its Community Tools or “CTools” series for creating and managing dashboards and reports. Last November, Webdetails collaborated with Pentaho to launch the Pentaho Marketplace, a destination on github where developers can share, install and load cool plug-ins.

Meeting growing demand for Pentaho’s consulting services

As demand for advanced, big data and embedded analytics services continues to soar, Webdetails provides an experienced, international team to bolster Pentaho’s existing consulting services. Webdetails provides services worldwide, with most of the revenue stream coming from US and Europe. List of clients include 4SightBI, St. Antonius Hospital, and Pentaho’s award-winning customer Stonegate Senior Living.

Redoubling community support

In addition to continuing his role as General Manager for Webdetails’, founder Pedro Alves will take on the new role of Senior VP, Community. In this latter role, Alves will be the chief advocate and interface to Pentaho’s active open source developer community.

Doug Johnson, EVP and COO, Pentaho commented, “Everything about Webdetails perfectly complements our operations as we continue to scale to meet demand fueled by the big data revolution. Webdetails’ expertise in high-end visualizations brings capabilities to help customers roll out exceptional visualizations with all data sources, particularly in Big Data. With Webdetails joining the Pentaho family we gain visual development talent, international consulting services and a highly respected open source community leader in Pedro.”

Pedro Alves (@pmalves), commented, “After five years as consultants and advocates for Pentaho in the business and open source communities, my team is incredibly proud to be officially joining the company. On a personal note, I am delighted to be taking on the role as community leader and look forward to the opportunity and challenge that this presents.”

Webdetails will continue doing business under its existing brand, but as a Pentaho company.

Sunday, April 14, 2013

Pentaho Big Data Forum - Washington D.C.

Featured Speakers:
  • Michael Lazar, Senior Systems Engineer, Cloudera
  • Will LaForest, Senior Director, 10gen
  • Ruhollah Farchtchi, Director of Federal Systems, Unisys
  • Wayne Johnson, Sales Consultant, Pentaho
  • Will Gorman, VP Chief Architect, Pentaho
  • Matt Casters, PDI Architect, Pentaho

Join Pentaho for a half-day big data forum in Washington D.C. Do not miss out on the opportunity to connect with key Pentaho leaders and hear the latest big data hot topics from our featured partners, Cloudera, 10gen and Unisys.

Tuesday, April 23, 2013
1101 Wilson Blvd.
Arlington, VA 22209

For questions or more information, please contact Laura Tuohy at ltuohy@pentaho.com
Time                            Agenda Item                 
8:00 a.m. - 8:30 a.m.         Breakfast & Registration    
8:30 a.m. - 9:30 a.m.Pentaho Big Data Update
9:30 a.m. - 10:15 a.m.  Cloudera Big Data Presentation   
10:15 a.m. - 10:30 a.m. Pentaho Business Analytics Update & Demo   
10:30 a.m. - 10:45 a.m. Coffee Break
10:45 a.m. -11:30 a.m.10gen Big Data Presentation
11:30 a.m. - 12:00 p.m.Unisys Big Data Presentation
12:00 p.m. - 1:00 p.m.Lunch & Kettle Presentation on PDI for Big Data  

Friday, July 27, 2012

Olympic Analysis

In honor of the Olympics starting today...here is an Analytical Dashboard I created to compare and contrast the medaling countries from 1976 to 2008.  Enjoy