Systems Design

I'm a software engineer who is interested to learn from the experience, knowledge and perspectives of others. 
Site Reliability Engineer at Box. Website here. Into DevOps strategy, distributed systems, Python, coffee. vim > emacs, zsh > bash.
Follow this Taaalk

2 followers

205 views

EJ (she/her)
22:21, 10 Aug 22
Hi hi! This has been a while coming and I'm glad we're getting into it. I wanted to start us off by talking about what we want to accomplish here. Previously we discussed titling this Taaalk as "The Philosophy of Robust Systems Design"; I think that still works as a pretty good subtitle and gives us a good direction for this. I'd like for us to first discuss systems design and site reliability engineering in general (since I think you can't really talk about one without involving the other) and then drill in a bit to the systems design of taaalk.co itself using the lessons & practices from the aforementioned discussion, if that's ok with you!
Joshua Summers
13:42, 12 Aug 22
That is ok with me!
At the moment I don't know much about these topics at all. Starting off with things at there most basic, what is systems design?
EJ (she/her)
23:27, 12 Aug 22 (edit: 23:29, 12 Aug 22)
So at a high level, systems design is the study and practice of building software systems—so not just a single application or service in itself, but the ecosystem required to function—to meet product requirements. I phrase this as “meeting product requirements” because 1. Integrating reliability, scalability, etc into product requirements is a good thing and 2. systems design isn’t necessarily focused solely on reliability or any other -ility.
This is to say: systems design is about trade-offs and the balancing of requirements, and knowing what those trades-offs are and when to make certain ones. A really common one comes in the form of the CAP theorem. The CAP theorem says that a requirements triangle forms between consistency (“every read after a given write will return information from that write”), availability (“every request will be successfully handled”), and partitionability (“the system will still function properly if there is a failure in the network”) and, most importantly, you can only build a system to handle two of these. So you can have a highly consistent and available system, or a highly consistent and partitionable system, or a highly available and partitionable system, but not all three.
There’s two important things to take away from the CAP theorem in my opinion: the first is that you will never be able to build a completely reliable network, so your system must be able to handle network failures; and therefore your trade-off is really between consistency and reliability/availability. The second is that these really should be treated as spectrums rather than just discrete categories; so there’s trade-offs between the degree of consistency and the degree of availability. Financial systems, for example, will need to be highly consistent as to not have duplicate writes, but maybe can make sacrifices on availability. Likewise a news website or status page application would need to be highly available, but probably doesn’t need to be as consistent as a financial system.
The CAP theorem is not the only trade-off we encounter; there’s the classic project management maxim of “Between good, cheap, and fast, you can only pick two”, and various relations between security, usability, maintainability, simplicity, cost (again), performance, latency, and a few other things I’m not mentioning. Some of these trade-offs aren’t very linear like the CAP theorem is, some of these are false dichotomies and aren’t trade-offs at all, and all of them are spectrums.
But back to systems design itself real quick. Given all this information, systems design in practice has typically meant learning how to select the appropriate configuration of servers, load balancers, proxies, caches, and databases to meet requirements for a given app. Increasingly this also means selecting CI/CD systems and deployment techniques for this, and building in observability & security concerns into the architecture of your system as well. 
Joshua Summers
16:27, 20 Aug 22
Ok, so what I take away from your answer is that a system must be partitionable. And the rest of it (consistency/availability) is a spectrum which is down to you/business requirements. 
As partitionability seems to be at the bedrock of systems design, it would great to hear it explained in its most basic form and to hear why it is an essential requirement.
EJ (she/her)
23:51, 07 Sep 22
That's an excellent takeaway, yes!
Partitionability (or partition tolerance) is at its most basic the ability to withstand a partition. In this context we're talking about a network partition, that is, a situation in which a network is divided, or partitioned, into two or more subnets, which are semi-independent networks within a network. This can happen because that's simply how the network is designed or because of a failure in the network. When talking about partitionability in a distributed systems or systems design sense we're referring to the latter type of partition, one caused by failure.
So for example if we send traffic to server A, which then sends traffic to server B through a network switch, a partitionable system is one that still can function if the network switch between A and B fails. Either there's a redundant network switch, or a server C that A can send traffic to instead, or A automatically starts doing the work B was doing, or one of many other failure strategies (or realistically a combination of these).
Being able to withstand a network partition is essential because, simply, hardware fails. Cables get cut, switches go down, backbones lose power. The fallacies of distributed computing say it best: a network is never going to be reliable. You're going to have to deal with fault tolerance for any system at scale because outages just happen.
Joshua Summers
08:27, 24 Sep 22
So, at the moment this feels very theoretical. As a software engineer who is used to working on the application itself, rather than the system behind it, I can't imagine how system design happens in practice.
Is it something that you do in a server room somewhere? Or is it something you configure on in a yaml file, or perhaps a bit of both?
When you are thinking about the first bit of code you touch as a system designer, what language is it in? And what framework is it part of?
Follow this Taaalk

2 followers

205 views

Start your own Taaalk, leave your details or meet someone to Taaalk with.