I think many data warehousing shops get off on the wrong foot in defining their architecture. When we set out to create the Northwestern Medical Enterprise Data Warehouse, we identified three principles that govern everything we do. These were not developed in a vacuum. They were provided to us by Dale Sanders who really kick started the project. He’s had the benefit of years of experience that he could pass down, which turned into our principles. Only very rarely and deliberately do we violate those principles, and even then, we often live to regret the decision.
The three principles are:
- Always load an exact copy of the data
- Never change any data
- Always show your work
This is the first in a series discussing those principles, including the why’s and wherefore’s.
Always load an exact copy of the data
We duplicate the tables, the columns, and the data from the source system into the EDW. We replicate the entire thing, including weirdo naming conventions, null data, binary data, etc. We put a lot of thought (read: debate) about whether to bring over stored procedures and views, but typically bring those over as well. Doing so provides a number of advantages. You accelerate your time to go-live with the data (though the data will often be harder to use), you get cleaner and more easily understood data, and you get data structures that folks on the source system teams can relate to and help document.
Shorter time to go-live
What we end up doing is loading a copy of the source system into the EDW, table for table, row for row. It’s a tweak of the traditional ETL process, instead performing ELT. We call the destination an Operational Data Store (ODS), but others might disagree with that terminology. More importantly, we don’t call it staged data; every ODS is part of the EDW and people with the right level of need are granted to it for everyday, traditional reporting. By including an exact copy of a source system directly in the EDW we facilitate a number of things:
1) We get a jump on making the data available. It may not be pretty, but people can immediately report on what they need by querying the ODS version of the data. This isn’t just PR (though that is an added bonus) – the day you can report on the data is the day you begin adding value. There is definitely a long road ahead in terms of making the data truly analytics-friendly, but having the data available is the first step of the journey.
2) We can very easily spin out transformed, value-added data marts. In many cases, those data marts are populated by a single (albeit very complex) query, but it lets the database do what it does best: make the data dance. Additionally, we can more easily explore the data for those data marts by just writing SQL inside the EDW. This facilitates the spaghetti-against-the-wall phenomenon where a developer can show a proof of concept to his or her peers and get quick feedback. Because this is the real value-add of a data warehouse, the capacity to quickly create data marts is critical for the success of a project.
3) We can track down oddball issues very easily. Suppose a user says that a report is missing a row. In doing the investigation as to why the row is missing, one question that can come up is “was it brought over to the EDW in the first place?” By having an exact copy, we’re well positioned to answer that. Even something as simple as a rowcount can help one understand the scope and nature of potentially missing data. Ownership of these problems becomes clear as well. If we’re missing a row of data that exists in the source system, we know it’s our challenge and responsibility to document and resolve the problem.
One of the biggest challenges fledgling data warehousing projects face is missed deadlines. While data definitions and project scope are stuck in committee, the clock marches on, and projects are placed as risk because of a lack of delivery. By loading an exact copy of the data into the warehouse and manipulating the data there, you can mitigate the largest risks to the success of this kind of project.
Provide data structures your source system teams can immediately relate to
Equally tempting is the thought of renaming tables and columns when bringing them into a data warehouse. One of the systems we use on the campus has a naming convention where a column will be ID_504. This column has a corresponding dictionary lookup table called DICT_504. There is a rhyme and reason to the naming convention, though it’s certainly not something I’d select. More importantly, these numbers actually do mean something to many folks who use the system. This particular system is used for registration, like when you call to set up an appointment with your physician. A few years ago while on the phone with the registration department, I had to change my home phone number and address on file. When the registrar was finished, she confirmed that she had successfully updated my “ID_35 and ID_60.” I was stunned – they actually use those numbers in a meaningful way. It’s not difficult to extrapolate how this might play out: A year later, someone from registration will need a report of all the ID_35 changes. If we had renamed ID_35 to something else, we now have the added challenge of tracking down what it was called in the first place, which somewhat renders the act of renaming the column pointless. And ultimately, documenting what the column contains is a job for metadata, not the column name itself (though if designing a system de novo, I would absolutely expect it to have a sensible name).
There is an added benefit here: the team that supports the source system in question already understands the data model. Because of this, they can naturally use the data in the EDW. There are 2 value-adds here: 1) Take the reporting load off the production system, and 2) Begin to query integrated data sets. The source system team can quickly transition to providing analytics across the enterprise if necessary.
Cleaner and more easily understood data
A more fundamental concept underlies all three of our principles: What exists in the EDW should also exist in the source system. There’s a reciprocal nature to this. If a datum exists in the source system and an end user can point to it on the screen, they should be able to find it in the EDW. They should be empowered to say, “give me this.” The reverse is also true, if we report a datum to a user, we have the responsibility to be able to point to it on the screen of the source system. By loading a complete and identical copy of a source system into the EDW, we are naturally positioned to perform both of those actions.
Consider the age-old debate among database professionals about the meaning of null. Off the top of my head, null could mean “no”, “unanswered”, “unknown”, “undefined”, or “not asked.” Each of these possible definitions means something slightly different. And each of the definitions could mean something different depending on the source application. So if a source system has a column full of nulls, one could argue that there’s no value in loading it. We’ve come to disagree with that notion. Null data means something, though it likely takes a domain expert to identify its meaning. It’s wholly appropriate for a domain expert to weigh in and provide metadata on such a column so that all other users of the data can understand what null means in that case. (Sidebar: People defining version 3 of HL7 call this “flavors of null” which I think is a fantastic term.)
An application with a column full of nulls often means that a feature is not yet implemented or enabled. When that feature is turned on, it will be turned on for a reason and users will invariably want to start reporting on that new data point. If you’ve specifically excluded it in your ETL, you now need to go back and retrofit your code to accommodate the new column, which will take time and add complexity to the project.
We also explored excluding certain columns that we presumed were not required. At the end of the day, who are we to tell our end users what data are useful to them? Our rule of thumb is that if a user is going to take the time to enter data, we should be prepared to report on those data. That’s not to say that we would have a data mart dedicated to the data to facilitate easy and quick reporting of it, just that it would be available. I’d much rather have to tell an end user, “Sure, we have it, but it’s going to take a week to transform it into a useful format,” than “Unfortunately, we don’t have that in the EDW. I’ll fill out a request form on your behalf with the source system team and get an estimate for when they can amend the process. I’ll get back to you when I hear back from them.”
The painful alternative
A handful of times, we’ve backed down from this principle. This is usually for systems that are a bit more arcane in nature (mainframes), or systems for which there is concern about us having direct access. In those cases, we’ve been provided with flat files representing the changes to the data. And in those cases, we’ve always come to regret it. Every. Single. Time.
Flat files are inherently dangerous and troublesome. You’re now relying on someone else to decide what data you need, how to represent it, and to send it to you in a reliable time frame. We’ve seen files that don’t have the corresponding lookup entries so we end up having patients with a discharge disposition of “19″. We’ve also had files that weren’t updated to reflect new data being collected in the source system. This is a particularly thorny issue: six months after new data have been collected, you realize that you’re missing it in the flat file extracts. You now need to go back and ask the source system team to give you the historical data plus amend the extract going forward to include the new data. That’s not just a pain, it’s also dangerous with regards to accurately representing all the data.
Our technical implementation of this principle
First and foremost, we use SSIS to connect directly to the source system and pull data using queries using OLEDB, ODBC, or whatver options we have at our disposal. We’ve gone through three main iterations of this process.
Iteration 1 was the basic approach: We wrote SSIS packages by hand to pull over all the data. It was boring. It was tedious. It was difficult to maintain. We used this approach from 2006-2009.
Iteration 2 was far more advanced: Eric Just (EDW alum) wrote a tool called PackageWriter that allowed us to point a tool at the source and programmatically write all the ETL. It generated SSIS packages for all the tables in question. This was a huge step forward for us, but also had some maintenance issues. More importantly, it didn’t handle incremental loads, which we had to write by hand. Officially, we’re still using this approach. It will soon be replaced by:
Iteration 3 is ETL Assist, which Eric Whitley just blogged about here. In short, it not only performs full loads of tables, it also handles incrementals, and all from a single interface with minimal code. It’s basically all config-file driven.
Next up is Principle #2: Never change your data.