Tuesday, July 12, 2016

Data OnBoarding

Data is growing at an exponential rate.  In fact, when President Obama was asked about concerns about the State Departments ability to keep information classified in light of the Hillary Clinton email scandal, here is what he said:

"That was an interesting question so -- first of all, with respect to the State Department, I am concerned.  And the challenge that we've got is primarily driven by the changing nature of how information flows.  Look, the advent of email and texts and smartphones is just generating enormous amounts of data.  Now, it is hugely convenient.  It means that in real time I'm getting information that some of my predecessors might not have gotten for weeks. But what it also is doing is creating this massive influx of information on a daily basis, putting enormous pressure on the department to sort through it, classify it properly, figure out what are the various points of entry because of the cyber-attack risks that these systems have, knowing that our adversaries are constantly trying to hack into these various systems.  If you overclassify, then all the advantages of this new information suddenly go away because it's taking too long to process.

And so we've been trying to think about this in a smart way.  And I think Secretary Kerry has got a range of initiatives to try to get our arms around this.  It reflects a larger problem in government.  We just recently, for example -- I just recently signed a bill about FOYA requests -- Freedom of Information Act requests that built on a number of reforms that we've put in place.  We're processing more Freedom of Information Act requests and doing so faster than ever before.  The problem is the volume of requests has skyrocketed.  The amount of information that answers the request has multiplied exponentially. "
There are many topics to cover when taking about this growth of data but today I want to focus on the ability to "On Board" the data.  What is data on boarding?  Simply stated, data on boarding is the process in which organizations ingest data sources into their systems in order to turn it into information.  Sounds simple but as President Obama put it is that there are many difficulties and challenges with bringing in that data such as classifying the data properly and then know which data needs immediate attention and who needs to pay attention to it.  

One big issue is that every time there is a new source of data, a process has to be written to handle that specific source.  There is a direct relationship between the number of data sources and the number of processes that have to be developed and maintained.  This is where I am glad to be working for a company like Pentaho.  Pentaho gives the ability to make these processes much smarter by allowing them to be somewhat automated based on the data that is being consumed, despite where it is coming from.

Below is an introductory demonstration of this capability.


