Systems Design

I'm a software engineer who is interested to learn from the experience, knowledge and perspectives of others. 
Site Reliability Engineer at Box. Website here. Into DevOps strategy, distributed systems, Python, coffee. vim > emacs, zsh > bash.
Follow this Taaalk

2 followers

602 views

EJ (she/her)
22:21, 10 Aug 22
Hi hi! This has been a while coming and I'm glad we're getting into it. I wanted to start us off by talking about what we want to accomplish here. Previously we discussed titling this Taaalk as "The Philosophy of Robust Systems Design"; I think that still works as a pretty good subtitle and gives us a good direction for this. I'd like for us to first discuss systems design and site reliability engineering in general (since I think you can't really talk about one without involving the other) and then drill in a bit to the systems design of taaalk.co itself using the lessons & practices from the aforementioned discussion, if that's ok with you!
Joshua Summers
13:42, 12 Aug 22
That is ok with me!
At the moment I don't know much about these topics at all. Starting off with things at there most basic, what is systems design?
EJ (she/her)
23:27, 12 Aug 22 (edit: 23:29, 12 Aug 22)
So at a high level, systems design is the study and practice of building software systems—so not just a single application or service in itself, but the ecosystem required to function—to meet product requirements. I phrase this as “meeting product requirements” because 1. Integrating reliability, scalability, etc into product requirements is a good thing and 2. systems design isn’t necessarily focused solely on reliability or any other -ility.
This is to say: systems design is about trade-offs and the balancing of requirements, and knowing what those trades-offs are and when to make certain ones. A really common one comes in the form of the CAP theorem. The CAP theorem says that a requirements triangle forms between consistency (“every read after a given write will return information from that write”), availability (“every request will be successfully handled”), and partitionability (“the system will still function properly if there is a failure in the network”) and, most importantly, you can only build a system to handle two of these. So you can have a highly consistent and available system, or a highly consistent and partitionable system, or a highly available and partitionable system, but not all three.
There’s two important things to take away from the CAP theorem in my opinion: the first is that you will never be able to build a completely reliable network, so your system must be able to handle network failures; and therefore your trade-off is really between consistency and reliability/availability. The second is that these really should be treated as spectrums rather than just discrete categories; so there’s trade-offs between the degree of consistency and the degree of availability. Financial systems, for example, will need to be highly consistent as to not have duplicate writes, but maybe can make sacrifices on availability. Likewise a news website or status page application would need to be highly available, but probably doesn’t need to be as consistent as a financial system.
The CAP theorem is not the only trade-off we encounter; there’s the classic project management maxim of “Between good, cheap, and fast, you can only pick two”, and various relations between security, usability, maintainability, simplicity, cost (again), performance, latency, and a few other things I’m not mentioning. Some of these trade-offs aren’t very linear like the CAP theorem is, some of these are false dichotomies and aren’t trade-offs at all, and all of them are spectrums.
But back to systems design itself real quick. Given all this information, systems design in practice has typically meant learning how to select the appropriate configuration of servers, load balancers, proxies, caches, and databases to meet requirements for a given app. Increasingly this also means selecting CI/CD systems and deployment techniques for this, and building in observability & security concerns into the architecture of your system as well. 
Joshua Summers
16:27, 20 Aug 22
Ok, so what I take away from your answer is that a system must be partitionable. And the rest of it (consistency/availability) is a spectrum which is down to you/business requirements. 
As partitionability seems to be at the bedrock of systems design, it would great to hear it explained in its most basic form and to hear why it is an essential requirement.
EJ (she/her)
23:51, 07 Sep 22
That's an excellent takeaway, yes!
Partitionability (or partition tolerance) is at its most basic the ability to withstand a partition. In this context we're talking about a network partition, that is, a situation in which a network is divided, or partitioned, into two or more subnets, which are semi-independent networks within a network. This can happen because that's simply how the network is designed or because of a failure in the network. When talking about partitionability in a distributed systems or systems design sense we're referring to the latter type of partition, one caused by failure.
So for example if we send traffic to server A, which then sends traffic to server B through a network switch, a partitionable system is one that still can function if the network switch between A and B fails. Either there's a redundant network switch, or a server C that A can send traffic to instead, or A automatically starts doing the work B was doing, or one of many other failure strategies (or realistically a combination of these).
Being able to withstand a network partition is essential because, simply, hardware fails. Cables get cut, switches go down, backbones lose power. The fallacies of distributed computing say it best: a network is never going to be reliable. You're going to have to deal with fault tolerance for any system at scale because outages just happen.
Joshua Summers
08:27, 24 Sep 22
So, at the moment this feels very theoretical. As a software engineer who is used to working on the application itself, rather than the system behind it, I can't imagine how system design happens in practice.
Is it something that you do in a server room somewhere? Or is it something you configure on in a yaml file, or perhaps a bit of both?
When you are thinking about the first bit of code you touch as a system designer, what language is it in? And what framework is it part of?
EJ (she/her)
22:09, 11 Oct 22
That's a good question and I hate to give it a lawyer's answer but truthfully: it depends. But let's use your question about code as a jumping off point for a theoretical systems design process at an enterprise ready level.
The first bit of code you'll touch is Markdown, because the first thing you need to do is figure out what you need to do. Requirements gathering is step 0 of systems design. Like I said before a lot of your decisions are going to be decided on context, so it's basically mandatory to have an idea of the context and what's required. You'll start thinking about what the system needs to do at a high level, what some of the functional and non-functional requirements are, and what the acceptance criteria for these are. If this sounds like how you start off with software architecture it's because it is; there is ideally no difference between the process for architecting the system and architecting the software itself because they really should happen at the same time. This collaboration between the people who build the software and the people who run the software is what "DevOps" is all about, but I digress.
From your requirements document you might start writing some UML or Mermaid markup or Python using the Diagrams package to work out control flow and the high level architecture of the system. You might bundle this all into a 4+1 architecture doc to handle some specific views of the system or you might not and start implementing your architecture.
At the implementation point you'll want to start on capacity planning, and maybe write some more Markdown or create a growth model using Excel or Google Sheets or Python or R. This will inevitably spiral into yet more Markdown and spreadsheets as you translate growth & usage into server cores, GBs of memory, and GBs of disk space.
While this is going on you'll probably start building out some infrastructure. This will undoubtedly happen as Infrastructure as Code. Terraform tends to be the default choice nowadays with its YAML/JSON/Go template mashup of a language and declarative approach that isn't locked into a specific cloud provider, but Pulumi is very nice and lets you write in the same declarative multi-cloud way but with an actual programming language. You might also need to pull in Packer to create machine images and will def be writing Helm charts or Kubernetes manifests or Docker Compose specs or something more exotic like cdk8s to get your software running on the infrastructure and will prob also be troubleshooting or guiding Dockerfile specs (or whatever alternate container system you're using).
If you never ever ever are going to move off your cloud provider AWS, GCP, and Azure each offer SDKs to provision and control infrastructure in at least Python, JS/TS, Java, Go, and .Net, each with their own paradigms. Digital Ocean and Heroku and I'm sure other cloud providers have SDKs as well.
You'll shove this in source control ideally and prob write some CI/CD config in YAML or whatever the hell a Jenkinsfile is depending on your CI/CD choices, and after bootstrapping some IAM roles manually, ideally you'll just start creating PRs and merging them and your infrastructure will deploy itself! And if you're cool you'll also start autogenerating docs and architecture diagrams based on what you actually deployed and seeing how they match up with your initial diagrams and adjust your implementation as needed before anything bad happens.
Now, if you're unlucky you'll get stuck writing Puppet modules (weird JSON, basically) or Ansible playbooks (all YAML) and having to talk with your DC Engineering team about hardware choices. Or be caught in the middle of migrating all the legacy Puppet
Follow this Taaalk

2 followers

602 views

Start your own Taaalk, leave your details or meet someone to Taaalk with.