Every start-up gets to this point. At first the design was perfect. Perfect for the MVP, and you know what? it was even perfect for the full prototype. Then things didn’t went as planned – some adjustments were made for the test phase product, and then again for the first user group of the soft release, which were (of course) nothing next to the changes you did after the user results came back. Now you have a deployment to a big (and paying) client and the entire thing is out of whack – it now takes eight fully trained engineers and half of an EC2 Region just to keep the whole thing running in quarter speed during the on-site integration phase. You became too big too fast, and you NEED a redesign.
on the move
First of all, You’ve done nothing wrong and the stress (which I’m sure you are felling) will not help. What will help, oodly enough, is putting a pen to some paper. write down (or draw) – the journey that of single user input (for every user input you have). What is it doing? Where is it going? What is it wating on and where? all the way through the system where it becomes output again (consider long-term DB storage as output).
You want to see the different data flows through your system. See what tasks are in common (to all? to some?), and which are unique. See were different flows unite and when one flow is waiting for the other. All these are friction points, and we will have to address them sooner or later.
understanding the enemy
In this series of articles we are going to look at a case study – a news agency – the Null Media Enterprise (yes. N.M.E.)
This up and coming news agency is having some problems with the time it takes to write articles on the latest subjects, and even greater problems answering queries from reporters trying to get these articles for their publisher.
The flows seem something like:
We can see some friction points, and we can see some loops. Although there is some room for improvement by rebuilding some databases, or doubling the resources on the tasks that work hardest (twice for every message, for instance) or seem to wait the longest, the only way to get effective results from addition of resources is if it is done precisely at the right task that’s jamming the system (a.k.a the bottleneck).
Breaking it to bits
Most cloud providers today will allow you to auto-scale the overloaded parts of you system, but for that they need to be scalable themselves. This means we will be able to add or remove computing units (which we will call nodes from now on) from the problem without harming the other nodes doing the task. In other words – each task should be a self-contained process. Sometimes it’s easier said than done, and yet the basic idea is to find where different events/messages do have some interaction and moving it as far of the compute-critical areas as possible.
In this process, some good leading questions might include:
- how big is the interaction between two random messages? Does this change if their closer in time? and if the arrived at the same exact time?
- what parts of the system have no interaction at all?
- what is the compute-critical part of the system? (=what tasks are the most complex (computationally) parts of the system)
- is any part of my system doing more than one task?
- should my datastores be fast-in-slow-out? or slow-in-fast-out?
- must I have a datastore, or would a cache do just fine? can the entire data needed on a node fit into its memory? if not, what would it take to make it fit?
- what happens if I fail to answer on a small part of these? what is the most I can fail to answer?
We will use the answers received, and the conclusions we drew from them, to draw a second system diagram – which you can use to make amendments to your system right away (and probably even see some results). But the really intresting part – the entire change of concept behind the system, that we will address in the next part of the series
(to be published shortly published!)