In a team discussion this morning we decided to incur some technical debt. This particular example is a nice illustration of the concept, so I thought I would share it.
The Context
We’re working on a proof-of-concept project that will extract some data from a file input, transform this dataset by adding data from various other places, and then load the enhanced data into an output file. (The eagle-eyed may have spotted that this is a classic extract–transform–load (ETL) pattern.)
The input data will be coming to us in a single, large file of comma-separated values (CSV). To deal with the volumes of data, we will be reading this file into a database for processing.
The additional data are extremely heterogeneous: some sources give us a single numeric value per record, one gives us more than a thousand datapoints. We expect to preserve the rough shape of the data throughout the process, but there may be some translation between the domain languages of the systems we interact with. These datasets are delivered to us in JavaScript Object Notation (JSON) format.
The output data format will also be CSV, but as the data will be significantly larger and more complex, we haven’t yet agreed how many files will be produced, or exactly how they will be structured.
There are currently plans for just one successful run of this process. As soon as we have an output file we are happy with, the project will be complete, and no further use of this code is planned. For this reason, we are not building this as a production system, but rather as a tool that can be run on a development machine. However, there’s a strong possibility that, if this project is successful, we’ll be asked to carry out similar work in future, so we want to create decently designed, adequately tested code so we can pick it up again. We also hope that this proof-of-concept will create demand for a production system with these responsibilities, so we would like to use this project as a learning exercise.
The Question
As I mentioned, one of the external datasets contains more than a thousand data points. We receive the data as a set of code-value pairs, with the same set returned for every record. We needed to decide how to store this dataset in our database.
If this dataset had had five data points, then this wouldn’t have needed discussion: we would have decided to create five new fields in the appropriate table. However, maintaining even a dozen additional fields increases the risk of errors, as you need to be able to guarantee consistency in data models, SQL creation scripts, object-to-SQL mappings, test data, and so on. Adding another data point presents further work, as you may need to make updates in all of these locations. (If our axis of changes is the set of supported data points, then this design breaks the open-closed principle, as extension of the system’s capabilities entails modification of existing implementation.)
If these are problems for a dataset with a dozen data points, then a dataset with over a thousand data points doesn’t bear thinking about. We needed to weigh up our options.
The Options
We identified three options, including our default choice for comparison:
- Create a field in the appropriate table for each code.
- Create a linked table with foreign key,
code
andvalue
fields. - Store the data in a blob.
We weighed these options by both their Cost to Implement, and the potential Cost of Change. We know we will incur the cost to implement, whereas the cost of change is more speculative. There is a chance that we will revisit this code in future, and that new data points may be added to the dataset, but we don’t know when or whether this will happen.
As I’ve already explained, Option 0 has a high cost to implement and a high cost of change.
Option 1 has a moderate cost to implement, as it means creating an additional table, and code for mapping code-value pairs from the application model into this database schema. On the other hand, the cost of change is fairly small: we might want to add a value or two to our test data sets. (This design observes the open-closed principle, extending the behaviour of the system to support another code can be accommodated without modifying any other code.)
Option 2 has low cost to implement, as we can create a single field and simply dump the data in there. Its cost of change seems to be low as well, as any additional data points would automatically get passed through. (It’s worth noting that this might be a terrible design for a long-running production system, as any previously stored data would be missing the new data points, and it would be pretty messy to update them; for a one-off run this isn’t a problem, and the design is perfectly acceptable.)
Based on these criteria, Option 3 seems to be a winner: it has low cost to implement, and low cost of change. What’s not to like?
Mixing Concerns
Option 2 manifests a significant design smell: it mixes the concerns of data retrieval, data storage and data presentation — the Transform and Load phases of the ETL pipeline. This smell is not shared by options 0 and 1, as they don’t make any reference to the output structure.
We receive the raw dataset as a JSON file, and will be outputting is as CSV. However, we haven’t yet agreed the exact structure of this CSV, and a huge dataset like this may raise another set of questions about file structure and normalisation. We don’t want to delay this work while these conversations continue.
Furthermore, the raw data we receive use a set of short codes that only make sense if you refer to the documentation. The recipients of our output data may be happy to receive the data in this format, but they may want us to use more explicit codes.
This has various implications:
We know for certain that our output format is different from the raw format of the dataset, so some mapping will have to take place.
We have three choices for mapping the data:
- We could map the raw dataset to the output format before we store it, and then add it verbatim to the output. This keeps the Load phase of the ETL job very lightweight, but introduces output logic to the Transform phase. It also means the Transform phase becomes aware of the output model, which is a mixing of concerns
- We could store the raw dataset, and then map it before adding it to the output. This keeps the Transform phase lightweight, and introduces mapping logic into the Load phase. It also means the Load phase becomes aware of the raw format, which is another mixing of concerns.
- We could map the raw dataset to a third format before storing it, and then map it from that format before outputting it. This avoids the mixing of concerns, but at the expense of creating two sets of mapping.
All of this mapping is an extra layer on top of the object-to-SQL mapping we would already be doing, and so it adds to the cost of change for any alterations to the output structure. As we can only guess at the planned output structure, the likelihood we incur this cost is close to 100%.
Taking out the Loan
We had quite a conversation about these risks. Was this guaranteed cost of change enough to swing us towards another option?
We decided it wasn’t. Even with the additional cost of adapting this implementation to our preferred output structure, the overall cost of option 2 is less than that of the other options.
This doesn’t remove the fact that we’re incurring technical debt: we can guarantee that we will have to change this code, and that there will be rework — waste — when we do so; but we’re confident that what we gain now is worth the small additional cost we’ll pay in a couple of weeks’ time.