To work in the software industry world it is often necessary to think radically, or “out of the box” in order to innovate, solve problems, or stay ahead of the competition. Today we start a series of interviews with some of the thinkers who are reshaping the software industry.
I welcome your thoughts—please leave a comment below, and I’ll respond as soon as I can.
In Dan’s words, a brief overview of his data warehouse/data management experience:
Over 25 years in IT, with more than 20 years in data warehousing and data management. CBIP Master. Worked for and in government agencies around the world, including FAA, FDA, USDA, CMS (Center for Medicare/Medicaid), DoD, and others. Also worked for many commercial organizations, including AAA, Nike, Washington Mutual, Expedia, Pepsi Bottling Company, JP Morgan Chase, and others. Experience in petabyte (PB)-sized systems design and architecture, performance and tuning, and data architecture. Taught at DAMA, TDWI, and Informatica conferences.
Here’s the rest of our interview:
JG: Could you briefly explain the Data Vault Methodology and its main features?
I created and designed the Data Vault Model and Methodology originally to meet the needs of the National Security Agency (NSA) while working for the US Government. The data vault models’ real name is “Common Foundational Warehouse Modeling Architecture.” It is a hybrid design based on third normal form constructs, and dimensional constructs—combining the best of breed from both modeling techniques to provide a flexible, scalable, and auditable solution.
The design is tied directly to the notion of horizontal business keys, and therefore is tied to business processes and business performance management (BPM)/business activity monitoring (BAM), and uses key performance areas (KPAs)/key performance indicators (KPIs) to manage, measure, and monitor the results. The resulting data model can scale to the petabyte level, and has been implemented on Oracle, Teradata, SQLServer, MySQL, DB2 UDB, and Netezza.
JG. What was the main issue to be solved with the creation of the Data Vault Methodology?
The main issue to be solved with the creation of the Data Vault Model was to enable flexibility of highly classified systems, to absorb all the commercial world changes (loading code, model, data, etc.) without impacting any of the “classified extensions” made to the model, data, and loading code behind the scenes. A secondary (but just as important) reason was to provide audit trails for all the raw data sets arriving from both internal and external feeds. The main issue to be solved with the creation of the data vault methodology is to enable the use of Software Engineering Institute (SEI)/Capability Maturity Model Integration (CMMI) principles, Six Sigma, Total Quality Management (TQM), and so on at the project level for data warehousing. The methodology is built to manage the project, and enable the people within the project to be lean, measurable, and repeatable. We had to repeat the successful builds of data warehouses across different organizations, over and over again—we needed a methodology to make that happen.
JG. How do you think the Data Vault Methodology fits in terms of how it deals with unstructured or semi-structured types of data (especially coming from social media sources) and the semantic Web model?
It’s not the methodology that fits—it’s the Data Vault Modeling techniques. The flexibility of the Data Vault Model allows you to take the results of unstructured data mining operations, and connect them with the structured world, easily and quickly. The flexibility also allows you to organically change the data vault model itself (and how this data is “hooked in” or related to) the existing structured data sets, at any time, without losing history. A good data vault model will have been built using ontologies from the lines of business that the business is currently engaged in. Any data coming from social media, or the semantic Web must first be mined.
Once the results of the mining operations are available, you can easily hook those results in with the structured world. Raw data from video feeds is not useful unless you have a program that “reads the format,” interprets the patterns, and can then hook those patterns to the structured data set in the existing data vault model.
JG. For companies that have traditional data warehouse strategies in place, what would be the first thing to consider in moving to a data vault-based infrastructure?
The first thing to consider is: SCOPE. Downsize the “scope” of the project, never boil the ocean. Find the point with the most pain (either the most pain in absorbing changes, or pain in absorbing real-time, or pain in volume, or pain in maintenance costs)…
Once the pain point is identified, scope the data vault model to be built around solving that one specific pain point. The Data Vault Model is flexible in this nature, and can be organically scaled over time, replacing the other inflexible/brittle, non–growth-oriented, or problematic architecture over the allotted time frame set by the business. The ease of handling changes will come on the heels of the second project (adding to the existing data vault model). The IT team will be able to do this more rapidly than before, prompting the users to ask: wow, how did you do that so fast? This is where the real wins of having a data vault model in place begin to show themselves. Use maintenance dollars (when the business says; fix this, this is where the pain is)… take advantage of that pin-point, to add to the data vault, or build an initial data vault model.
The point is, at the end of the day, the data vault model is built to be a back-office data warehouse. The users should not know nor care that the model is built for processing and flexibility advantages, they should however feel the new speed that IT can offer in building data marts downstream for business disbursement.
JG. Can the Data Vault Methodology help companies to address specific need such as real-time needs, and complex data integration processes?
Yes. The Data Vault Methodology provides rules around how to load, when to load, along with standards—for the model, the project plan, the extract-transform-load (ETL) layers, etc… all the ingredients for the recipe to be successful in implementation. Because the Data Vault Methodology talks about moving the business rules downstream (pulling data from the data vault/enterprise data warehouse (EDW), transforming it, then putting it into data marts), the IT team has the best of all worlds.
Because the data vault standards dictate “raw but integrated by business key” be loaded to the data vault model, we are now free to load real-time data as fast as it arrives. Furthermore, the complex data integration process is now broken down into bite-sized solvable chunks. All the details about arrival timing, cross-system dependencies, data availability pretty much disappear. Parallel loading techniques are introduced, and this leaves the IT professionals to truly focus on what matters to the business: business rule implementation, pulling data from a single source (the data vault) and building data marts as they see fit. We got to a point within our team, where we could turn around a 2-page requirements document, and build a new data mart for business inside of 45 minutes. That was in 1997, just imagine what you can do with the tools today.
JG. What is your opinion in regard to cloud computing, especially regarding its use for data warehouse functions like data integration (like Informatica Cloud) or data analysis (Vertica, Infobright)? Advantages and disadvantages? Is the cloud inevitable?
Cloud computing needs to overcome privacy concerns, and hacking/data breaches that plague it today. However, it is making headway in this direction. I see it as the future landscape for businesses to utilize the best-of-breed of new hardware. It’s the laws of supply and demand hard at work, the business wants to pay for computing power only when it uses it. It also wants to lower overhead costs, and outsource much of its enterprise software (including maintenance and upgrades). Who better to maintain the upgrades of the software than the vendor themselves? This is where cloud computing makes a lot of sense.
The advantages are clear: IF the security issues can be fixed, or the data does not need to be “that secure,” then cloud computing makes perfect sense. I believe the cloud is inevitable, especially with the rise of personal computing devices (like mobile platforms)—they will continually have “slightly more computing power,” but ever more connectivity. These mobile devices are pushing businesses to move specific functions to the cloud, today it might be communications, tomorrow it might very well be extract-transform-load/extract-load-transform (ETL/ELT), and database functionality. I know there are vendors already there, and the customers are rolling in that direction.
JG. Large companies are struggling to manage extremely large amounts of information and there is a lot of noise about “big data” problems. How do you think companies should start facing this problem? What elements should be considered in order to put a strategy in place?
I’ve been in systems and architecture design for a long long time. I’ve dealt with “big data problems” for many years—in fact in 1997 we thought “big data” was 5 to 50 terabytes. Today, it’s becoming common to see 1 PB, 2 PB and more being discussed as big data. Now let me back up and define what I mean by big data… I’m not talking about “data that was loaded, and just sits there accumulating”—there is plenty of that, and yes, while managing disks with stagnant data is a problem, I don’t define this as big data.
Big data for me means: it’s active, it’s on the move, it’s being used. In other words, for a data warehouse: it’s being loaded, it’s being queried, it’s being compressed/decompressed, analyzed, delivered to users, rolled up, etc… Companies who are facing this problem should put the pressure on their database vendors (whether it’s traditional DB, NoSQL data store, cloud vendor) or file management systems.
These vendors need to step up and deliver hardware-based compression/decompression (and some are already doing this, they just need to get better at it). Unfortunately regulations require that big data be with us for some time to come. Now, another step the companies should do is institute a good data governance and data management program. This is an absolute MUST. If the data is “dead”/not accessed and there’s no legal regulation to keep it, then DUMP IT, back it up, and get rid of it…. it MUST be managed! Elements include: training people on data governance and data management, changing service level agreements (SLAs) with your vendors so that they feel the need to improve compression for you, seek out those vendors (even in the cloud) who already “get it” and shift some critical responsibility to those vendors to assist with handling the big data (off-load/outsource the data management, BUT PICK A GOOD FIRM WITH SOLID KNOWLEDGE for this). There are consulting companies out there who claim to have “good data governance and management” but when it comes down to brass tacks, it’s just marketing spin (unfortunately).
I would say: go back to basics, examine the use of the data, and try to find out WHY it’s being stored, and IF it’s being accessed/used—and if not, then again: dump it, clean it out… It’s like getting rid of junk that has sat in the attic of your house for years… You have it, you have to manage it, it collects dust, but you never use it… so why is it there? Managing big data is all about due diligence, governance, and asking the question: how much money is it costing me to keep this around, and is that cost justified?
Thank you very much for very valuable informations.
Good day very nice site!! Guy .. Excellent .
. Superb .. I’ll bookmark your web site and take the feeds also? I am happy to search out numerous useful info right here within the post, we’d like develop extra techniques in this regard, thanks for sharing.
. . . . .