It occurred to me this morning that there are actually quite a few parallels between functional programming and infrastructure design and management.
It all started by what I realized that I said while talking about environments: Production is meant to go from one stable, working, vetted version of code to another stable, working, vetted version of code. Any state between those two is invalid and should (preferably) never occur.
If you cycle on that again, you start to see that most deployment processes you know about violate this One Basic Rule(tm).
I posit that if you are deploying new code to currently running hosts that are handling traffic, you are doing it wrong.
Think about it like this: what is the one core feature of every highly scalable functional programming language? Every one has (or has developed patterns which essentially create) immutable values.
So when we scale this out of software and apply it to infrastructure, your code is the value of your server. If you are changing the value of your server while other processes are trying to access it, you're going to run into concurrency issues. Ask any developer about sharing data between threads, and they'll quickly tell you it's difficult. Why, then, do we improperly share data between releases of our software?
The simple answer is that you have two options for atomic deployments that follow the rules of immutability:
- Drop the servers you are deploying to out of the flow of traffic. This is the easiest, but still fails to honor the spirit of immutability because the value of the server is still changing, it's just changing while nobody is looking.
- Spin up new instances, and slowly work them into live traffic, confirming along the way that you are in fact getting the expected behavior out of the code.
Now, I know this is all hand-wavy because it glosses over the important aspect of data migration: I don't have an answer there, yet. I suspect the true answer to that part of the solution would be something to the effect of being able to seamlessly decouple your entire system from write traffic (using a request proxy which could 'pause' calls) for some period of time while data updates are done.
What if, to create a truly fault tolerant design, you simply create a nearly 100% asynchronous API. All requests come in and go into a process queue, and are handled from there. This way you are never required to turn off traffic to do an atomic update of your software because you can simply tell it to stop processing while the update progresses.
Thoughts?