Deliberate Technical Debt: an example

Bank notes and coins

In a team discussion this morning we decided to incur some technical debt. This particular example is a nice illustration of the concept, so I thought I would share it.

The Context

We’re working on a proof-of-concept project that will extract some data from a file input, transform this dataset by adding data from various other places, and then load the enhanced data into an output file. (The eagle-eyed may have spotted that this is a classic extract–transform–load (ETL) pattern.)

The input data will be coming to us in a single, large file of comma-separated values (CSV). To deal with the volumes of data, we will be reading this file into a database for processing.

The additional data are extremely heterogeneous: some sources give us a single numeric value per record, one gives us more than a thousand datapoints. We expect to preserve the rough shape of the data throughout the process, but there may be some translation between the domain languages of the systems we interact with. These datasets are delivered to us in JavaScript Object Notation (JSON) format.

The output data format will also be CSV, but as the data will be significantly larger and more complex, we haven’t yet agreed how many files will be produced, or exactly how they will be structured.

There are currently plans for just one successful run of this process. As soon as we have an output file we are happy with, the project will be complete, and no further use of this code is planned. For this reason, we are not building this as a production system, but rather as a tool that can be run on a development machine. However, there’s a strong possibility that, if this project is successful, we’ll be asked to carry out similar work in future, so we want to create decently designed, adequately tested code so we can pick it up again. We also hope that this proof-of-concept will create demand for a production system with these responsibilities, so we would like to use this project as a learning exercise.

The Question

As I mentioned, one of the external datasets contains more than a thousand data points. We receive the data as a set of code-value pairs, with the same set returned for every record. We needed to decide how to store this dataset in our database.

If this dataset had had five data points, then this wouldn’t have needed discussion: we would have decided to create five new fields in the appropriate table. However, maintaining even a dozen additional fields increases the risk of errors, as you need to be able to guarantee consistency in data models, SQL creation scripts, object-to-SQL mappings, test data, and so on. Adding another data point presents further work, as you may need to make updates in all of these locations. (If our axis of changes is the set of supported data points, then this design breaks the open-closed principle, as extension of the system’s capabilities entails modification of existing implementation.)

If these are problems for a dataset with a dozen data points, then a dataset with over a thousand data points doesn’t bear thinking about. We needed to weigh up our options.

The Options

We identified three options, including our default choice for comparison:

  1. Create a field in the appropriate table for each code.
  2. Create a linked table with foreign key, code and value fields.
  3. Store the data in a blob.

We weighed these options by both their Cost to Implement, and the potential Cost of Change. We know we will incur the cost to implement, whereas the cost of change is more speculative. There is a chance that we will revisit this code in future, and that new data points may be added to the dataset, but we don’t know when or whether this will happen.

As I’ve already explained, Option 0 has a high cost to implement and a high cost of change.

Option 1 has a moderate cost to implement, as it means creating an additional table, and code for mapping code-value pairs from the application model into this database schema. On the other hand, the cost of change is fairly small: we might want to add a value or two to our test data sets. (This design observes the open-closed principle, extending the behaviour of the system to support another code can be accommodated without modifying any other code.)

Option 2 has low cost to implement, as we can create a single field and simply dump the data in there. Its cost of change seems to be low as well, as any additional data points would automatically get passed through. (It’s worth noting that this might be a terrible design for a long-running production system, as any previously stored data would be missing the new data points, and it would be pretty messy to update them; for a one-off run this isn’t a problem, and the design is perfectly acceptable.)

Based on these criteria, Option 3 seems to be a winner: it has low cost to implement, and low cost of change. What’s not to like?

Mixing Concerns

Option 2 manifests a significant design smell: it mixes the concerns of data retrieval, data storage and data presentation — the Transform and Load phases of the ETL pipeline. This smell is not shared by options 0 and 1, as they don’t make any reference to the output structure.

We receive the raw dataset as a JSON file, and will be outputting is as CSV. However, we haven’t yet agreed the exact structure of this CSV, and a huge dataset like this may raise another set of questions about file structure and normalisation. We don’t want to delay this work while these conversations continue.

Furthermore, the raw data we receive use a set of short codes that only make sense if you refer to the documentation. The recipients of our output data may be happy to receive the data in this format, but they may want us to use more explicit codes.

This has various implications:

We know for certain that our output format is different from the raw format of the dataset, so some mapping will have to take place.

We have three choices for mapping the data:

  • We could map the raw dataset to the output format before we store it, and then add it verbatim to the output. This keeps the Load phase of the ETL job very lightweight, but introduces output logic to the Transform phase. It also means the Transform phase becomes aware of the output model, which is a mixing of concerns
  • We could store the raw dataset, and then map it before adding it to the output. This keeps the Transform phase lightweight, and introduces mapping logic into the Load phase. It also means the Load phase becomes aware of the raw format, which is another mixing of concerns.
  • We could map the raw dataset to a third format before storing it, and then map it from that format before outputting it. This avoids the mixing of concerns, but at the expense of creating two sets of mapping.

All of this mapping is an extra layer on top of the object-to-SQL mapping we would already be doing, and so it adds to the cost of change for any alterations to the output structure. As we can only guess at the planned output structure, the likelihood we incur this cost is close to 100%.

Taking out the Loan

We had quite a conversation about these risks. Was this guaranteed cost of change enough to swing us towards another option?

We decided it wasn’t. Even with the additional cost of adapting this implementation to our preferred output structure, the overall cost of option 2 is less than that of the other options.

This doesn’t remove the fact that we’re incurring technical debt: we can guarantee that we will have to change this code, and that there will be rework — waste — when we do so; but we’re confident that what we gain now is worth the small additional cost we’ll pay in a couple of weeks’ time.

i10

The other week, during some time we had off work, Peer and I went to did a cultural expedition to the two Tates to see the Henry Moore, Chris Ofili and van Doesburg exhibitions. The last of these gave us the most inspiration: Peer was transfixed by the animated films, and I found plenty of typography and print design to keep me excited.

So here’s a quick attempt to reproduce one of the designs; it’s César Domela’s cover for the 4th edition of i10:

And here’s my attempt:

i10 as a web page

http://playground.matthewbutt.com/i10.htm

No terribly exciting techniques in this one: just web fonts and some use of nth-child.

Acknowledgments

Books

Fonts

Constructivism Part III

Today’s study is the cover of Generation X’s 1977 single, Your Generation:

Black, red and grey construction with 'Generation X' in bold white text

Here is my version:

http://playground.matthewbutt.com/generation.htm

Please forgive any inconsistencies in the font rendering: I couldn’t find an open-licensed font with quite the right geometry, so I fell back on Helvetica, which may render unpredictably on different platforms.

This composition uses a handful of blocks with multiple backgrounds:

  • The upper part of the image, which looks a little like the Polydor logo, is a pseudo-element with four backgrounds, from top to bottom:
    • The small red circle
    • A white masking area, which covers the lower halves of:
    • The large black circle
    • The red block on the side.
  • The title strip is actually an h1 (‘Generation’) with a span inside (‘X’). The stripes are hand-coded linear gradients, and the cross-hatching in the small square is two repeating gradient stripes laid on top of each other.
  • The red triangle is another pseudo-element, with a single, angled gradient background.

The key lesson from this exercise was how tricky it can be to get webkit gradients right. The -webkit-gradient syntax is much less intuitive than the -moz-xxx-gradient syntaxes, and the repeated gradient declaration is also something of a fiddle. As to angling the red triangle, I couldn’t be bothered with more trig, so I just used trial and error.

Acknowledgments

Books

CSS Seesaw

It’s Friday night, so not a big one tonight.

I thought I would have a quick play with CSS transitions, and nothing seemed a better demonstration than a seesaw.

Here it is (warning: this only works in Webkit browsers at the time of writing):

A red seesaw with a black ball on it, all balanced on a black pivot

http://playground.matthewbutt.com/seesaw.htm

This little animation uses two transitions: one to tip the seesaw back and forth, and the other to roll the ball from one end to the other. There were only two slightly complex matters here: I needed to brush up on my A-level trig to get all the distances right, and a little sleight of hand was in order to create the triangular pivot (fire up Firebug if you want to see how I did it).

There are a few small bugs: poor aliasing, and a ghost top border on the seesaw element, but they can wait for another day.

And now, goodnight!

Look mum: no images!

One of the downsides of specialising is leaving behind areas of knowledge you used to love. As I have specialised in the programming side of web development, I have found myself getting further and further away from working with HTML and CSS, and letting my markup skills get rather rusty.

Of course, over the last couple of years, all sorts of exciting things have been happening with HTML and CSS, as browsers implement more features of HTML 5 and CSS 3, and the skills and techniques I used to be able to boast of are now anything but cutting edge.

So I thought I should do something about it, and there’s no better way to deal with something like this than to have a good play around. The following pages are rendered entirely using HTML and CSS; no images were harmed in the preparation of these pages.

Note: these pages render correctly in Firefox 3.6 and Safari 4.0.4. They probably don’t work in older versions and are most definitely NSFIE.

Landscape

First, then, I thought I would have a bit of a play with web fonts, gradients and transformations. Here’s the result:

Pale blue sky with yellow sun made from the word 'sunshine' repeated 13 times; green earth fades to brown with 'earth' in large brown letters

http://playground.matthewbutt.com/landscape.htm

It took a little adjustment to get the rays of sunshine correctly positioned: the key was to position the transform origin at 50% of the height of the text.

Constructivism

Next I thought I would try my hand at a little faux-constructivist design. The simplicity and clear colours of constructivist design are perfect for online material, but the jaunty angles have always posed a major problem: either render the text as images, or give up. With CSS 3’s transformations, this is no longer a problem, and there’s scope for a Russian revival:

'Construction site' in red and black crosses at an angle with 'a page of experimental stuff'; six yellow-and-black blocks float above

http://playground.matthewbutt.com/construction.htm

Again, getting everything to line up took a little getting used to, and it boiled down to the same issue: getting the transform origin of every element in the same place, and then rotating around that point.

A real-world example: Magazine’s Touch and Go

Ok, my constructivist sketch isn’t exactly high design, so I thought I would find something that had already been done, and have a go at copying it.

Here is the cover of Magazine’s 1977 debut single, Touch and Go:

Five red and black blocks stand side by side but at different vertical positions; 'Magazine' runs through them; underneath 'touch and go' and 'goldfinger' are arranged asymmetrically

And here is a quick HTML version:

http://playground.matthewbutt.com/magazine.htm

The only advanced technique here is the use of web fonts, although I’ve made liberal use of the :nth-child pseudo-class to apply styles to the coloured panels. I have to confess to using a few non-essential spans for this one, but I think the result is pretty pleasing for an image-free page.

And that’s my lot for today. There’s plenty more excitement in the latest implementations of HTML and CSS, so I’ll post some more experiments when I have a moment.

Acknowledgments

Fonts

Books