How do system programmers think?

Question

I have a lot of experience with Haskell and a little bit of experience with C. The latter is my language of choice when it comes to hardware and hardware-adjacent tasks. In particular, I have used C to program microcontrollers and to interface with Vulkan and Wayland. I have no issue writing C programs in principle, but I find it hard to manage interactions with the underlying hardware or hardware-adjacent systems. I think there is a bit of wisdom I am missing.

The recurrent problem I face every once in a while when programming in C is that my program crashes or has no effect without any indication of error from the compiler. This is radically different from Haskell, where I can expect a program that compiles to work — so long as I do not use general recursion or error calls, I have a guarantee that my functions will compute something. How do I achieve this with Haskell? By making sure there are no conditions on inputs. A function that breaks down on some values of its arguments is considered a bad function, and I avoid such functions. And, of course, a function in Haskell cannot read any kind of state. This guarantee is impossible to offer in the world of hardware — hardware is operated in no other way than by changing its state.

For example, consider the procedure zwlr_layer_shell_v1_get_layer_surface from a Wayland extension for making status bars and backgrounds:

After creating a layer_surface object and setting it up, the client must perform an initial commit without any buffer attached. The compositor will reply with a layer_surface.configure event. The client must acknowledge it and is then allowed to attach a buffer to map the surface.

wlr layer shell protocol | Wayland Explorer

This is really a condition on the inputs to wl_surface_attach, defined in the official Wayland protocol. So, presumably, wl_surface_attach can be used safely unless you use it with a surface that happens to be a layer surface, in which case you need to have done the dance hinted at above.

This is an example of conditions on state. Apparently, a surface in general fulfills the conditions needed to attach a buffer, but in the particular case of a layer shell surface it does not necessarily do so. Why? I shall never know. The state is hidden from me. There is not even a procedure that would check if my surface is good to draw on. The only way for me to be sure my program will work is to make sure certain requests are sent to the underlying system in a certain order.

There are two hardships here:

It is hard to know what is required of me. The conditions are written in plain text (if at all) and scattered across the whole massive of documentation to the system I am interacting with. Enumerating all the relevant conditions and precisely understanding what they mean is an impossible task.
It is hard to know whether my program matches the requirements. A non-trivial program can run in infinitely many different ways, leaving behind one of infinitely many execution traces. So, formally speaking, I must make a judgement on an infinite language of execution traces. This is again an impossible task.

There are some ways I have managed to make progress.

I use the integrated validation and tracking facilities where they are available. Vulkan offers a validation layer, Wayland has tracing that can be enabled with an environment variable, and my own code can be instrumented with -finstrument-functions, letting me record an execution trace.
I use the assert macro to check all inputs and also the state I have access to. If my program has state, all procedures that depend on it will be covered by assertions. I have assertions before the body of the procedure to check requirements and assertions just before the return statement to sanity check the procedure itself. This adds up to a lot of lines of code, but it is the only way to detect broken state early.

I can approach the problem of zwlr_layer_shell_v1_get_layer_surface by reflecting the relevant hidden state in my own state, by adding flags like «initial commit performed» and «configure event acknowledged». Then I can wrap wl_surface_attach in assertions and at least know whether my program crashed because it has not performed an initial commit or has not acknowledged the configure event. But this still does not tell me how to construct my programs in such a way that they never hit any assertions.

What else can I do? How do people who write major system programs think about this kind of problems?

Since the underlying system is, practically, not reliable, we cannot ask for perfection. But we can ask for either of these two criteria:

synchronous correctness If the program crashes, I can argue that it is not my fault.

Maybe there is an exception in the underlying system that is not documented. Maybe cosmic rays have gotten in the way. But I can decisively argue in the court of law that my program performed the initial commit and acknowledged the configure event, or would have done so if not for outside issues. The question then is how I should argue in my defense on a case by case basis.

With Haskell, I address this criterion by writing folds instead of recursion where possible, and proving termination by hand where recursion cannot be avoided. This takes care of non-termination. Exceptional cases, meanwhile, are enumerated and dealt with one by one until none are left.
diachronous correctness If there is a fault in my program, I can decisively fix it while introducing, on average, less than 1 new fault.

If I introduce more faults than I fix, then my program will eventually become too broken to be useful. But if I fix at least a little bit more than I introduce, then, given enough time, I can achieve any level of quality. The question then is how to stay on the good side of the introduced to fixed fault rate as the code base evolves.

With Haskell, I address this criterion by avoiding global state, so that my program is effectively made of many small, completely independent programs that are hierarchically wired together. If there is a fault, it is either in one of the small programs or in a layer of wiring. Either way, it can be localized and then synchronous correctness methods can be applied to the isolated part.

So, one possible answer to my question would be to offer ways to address these criteria in the setting of system programming. Perfection not required!

I did not vote down, but this question is very broad and open-ended and does most likely not have a generally acceptable answer. What C programmers do is probably "deal with it", i.e. if hardware can't be used safely without reading all the obscure documentation then you need to read that documentation. With Rust, you are still pretty close to the hardware but its type and lifetime systems may protect you a little more. — Hans-Martin Mosner
– Hans-Martin Mosner, Commented Apr 2 at 8:58
@Steve not really, programmer mistakes lead to weird application behavior more often than to exploitable weaknesses. If the type system is used to distinguish between an uninitialized piece of hardware and an initialized one, and the problematic functions are only available for initialized hardware, the compiler can catch possible bugs. — Hans-Martin Mosner
– Hans-Martin Mosner, Commented Apr 2 at 10:08
I'm not sure what you mean by "hardware". Neither vulkan nor wayland are hardware. These are just libraries, APIs. They ultimately interact with hardware, but what doesn't? Also it is irrelevant to the caller. It is no different to working with any other lib. And doesn't require different skills. And C is inherently unsafe, regardless of what lib you use. — freakish
– freakish, Commented Apr 2 at 15:04
@freakish By hardware I mean something that cannot practically be programmed in a high level language like Haskell. — Ignat Insarov
– Ignat Insarov, Commented Apr 2 at 15:15
This question is not about systems programming, but more about bad C APIs. — corvus_192
– corvus_192, Commented Apr 4 at 9:49

Caleth · Accepted Answer · 2026-04-02 10:54:39Z

11

Since the underlying system is, practically, not reliable.

Whilst there are unreliable systems, it doesn't sound like you are interacting with that much unreliability. What you are interacting with is complexity. The job of a programmer is to manage complexity.

So, formally speaking, I must make a judgement on an infinite language of execution traces. This is again an impossible task.

You will have to categorize the infinite language of execution traces into a finite set of categories. Whilst there are infinitely many ways of doing this categorization, you will have to pick one that fits your requirements.

Enumerating all the relevant conditions and precisely understanding what they mean is an impossible task.

It's not impossible. It may be lengthy, but for every system I have interacted with there is a finite amount of documentation.

If what you observe isn't covered by documentation, record what you did to make what you observed. That is a start to making your own documentation.

In almost all projects, you don't need to do this all up front. Unless your project is safety critical, you can try something, see if it meets you requirements, and tweak that.

edited Apr 2 at 10:54

answered Apr 2 at 10:49

Caleth

12.6k2 gold badges30 silver badges46 bronze badges

So, what you are saying is that I should indeed analyze execution traces and scrutinize documentation. Right?

Ignat Insarov
– Ignat Insarov

2026-04-02 10:51:29 +00:00
Commented Apr 2 at 10:51
Yes. Writing tests is a particularly good way of analyzing execution traces.

Caleth
– Caleth

2026-04-02 10:52:24 +00:00
Commented Apr 2 at 10:52
(1) Then, please, write this in bold font somewhere in your answer, so the point does not get lost. (2) What kind of tests do you have in mind here? How might such a test look concretely?

Ignat Insarov
– Ignat Insarov

2026-04-02 11:13:39 +00:00
Commented Apr 2 at 11:13

Add a comment |

Ewan · Accepted Answer · 2026-04-02 13:43:27Z

9

What else can I do? [to avoid runtime errors] How do people who write major system programs think about this kind of problems?

Unit tests.

Conceptually the specify example you give could theoretically be amenable to static checks on your code, ie a better compiler.

The other common way that you already address, is to wrap the problematic code ensuring that the exception is handled, avoided or otherwise would generate a compile time error. ie perhaps with a builder class.

C is a fairly low level language so its lacking in some of the compile time abstractions that haskel or an OOP language has.

But in general, You can just add a unit/integration test to check that no error is thrown at runtime for given sets of parameters.

You can limit the range of these parameters by wrapping the direct function calls in units which restrict the ways in which they can be used and only calling and testing those units rather than the underlying functions.

For instance with your example: (excuse my pseudo code)

get_layer_surface takes the problematic parameter surface object<wl_surface> which must be uninitialized. But we could wrap the call in a function which instead takes a new class: wl_surface_uninitialised written by ourselves.

This class would have its own methods to attach/commit a buffer which would return a wl_surface_initialised object.

wrap all this up in some immutability, add unit tests to make sure it works and you now have the compile time checks you were looking for.

edited Apr 2 at 13:43

answered Apr 2 at 13:29

Ewan

86.5k5 gold badges92 silver badges195 bronze badges

8

Nitpick: Tests in general, not specifically only unit tests.

Caleth
– Caleth

2026-04-02 16:02:44 +00:00
Commented Apr 2 at 16:02
7

Unit tests are absolutely useless when working on the are OP is talking about. It is all about integrating two disparate complex systems. There is no way to mock anything out without adding (usually invalid) assumptions to a Unit test.

Basilevs
– Basilevs

2026-04-03 08:16:36 +00:00
Commented Apr 3 at 8:16
4

@AndrewHenle you've missed my point. Tests that verify nothing are useless. Unit tests do not help with reliability and determinism here, because they establish reliability and determinism of mocks. Components certainly will crash in production even if unit tests do not.

Basilevs
– Basilevs

2026-04-03 11:38:18 +00:00
Commented Apr 3 at 11:38
2

@Ewan To test a Wayland client, one either needs a real Wayland compositor or its mock. If a real compositor is used, it is not a Unit test. And mock is as hard to implement correctly as a real compositor.

Basilevs
– Basilevs

2026-04-03 15:49:18 +00:00
Commented Apr 3 at 15:49
3

General point: integration problems are not tested with Unit tests.

Basilevs
– Basilevs

2026-04-03 15:54:20 +00:00
Commented Apr 3 at 15:54

| Show 10 more comments

freakish · Accepted Answer · 2026-04-06 14:56:27Z

So first of all, I don't think this question has much to do with hardware. Vulkan or Wayland are not different from any other C lib, in the sense: from the caller's perspective you just call functions and you are expected to follow rules. It looks to me like this is more about differences between langauges like Haskell and C.

The C, C++ and to some extend Rust (as in: unsafe Rust) are very different from languages like Haskell, Java, C# (except for unsafe C#) or Python. The big difference is that these low level languages have a built-in concept of undefined behaviour. The latter languages do not. Every line of code written in Haskell, Java, C# or Python has a well defined behaviour. This is an extremely useful property, which C does not have.

C does not protect you from doing stupid things. For example you are allowed to go outside of the array bounds. The reason for this is performance: the compiler will always assume that you (as a programmer) behave as expected, so that it can properly optimize the code. Without this assumption every time you access an element of an array by index, the compiler would have to generate code that verifies that the index is inside bounds. But that requires additional book keeping and takes non-zero time. And indeed, higher level languages actually do that. And that is one of the reasons they are slower than C, C++ or Rust.

But this has a nasty side effect: the compiler and runtime are less helpful. Code crashing is actually a good situation. You at least know you have a bug. But going outside array does not always mean crash unfortunately. Sometimes the code looks like its doing exactly what expected. Until it runs on production. And that is very bad.

This means that C, C++ and Rust demand a lot more from the software engineer. They demand lots of discipline. And sure, tests help. But they won't catch all undefined behaviours. Sure, analyzers like valgrind help. But they won't catch all undefined behaviours. In general, the "does my code contain undefined behavior" problem is undecidable (for example because of pointer arithmetic). Even though there are tools that catch most common cases.

What else can I do? How do people who write major system programs think about this kind of problems?

They have lots of years of experience, and they try to avoid common pitfalls. They write tests wherever it is possible. And they use tools, e.g. static and dynamic analyzers. That's all.

But don't be misguided: even hardcore coders make mistakes all the time. In 2025, the Linux kernel recorded around ~5000 security-specific bugs. Half of them due to concurrency issues.

You can also try to isolate problematic pieces of code, and actually call them from a higher level language through ffi. For example Python does this all the time with all the AI and numerical libs, which are actually implemented in C.

Rust was created to address some of those issues, while still being at at low enough level. If you can use Rust instead of C then I strongly encourage you to do that. Both Vulkan and Wayland have Rust bindings. With Rust you can create safe abstractions around unsafe code. In a way that is impossible with C. For example Rust references are always valid in safe Rust. Rust will also prevent you from shooting yourself in many many situations (e.g. the borrow checker). Not all though.

The main point stands though: this is simply hard.

synchronous correctness If the program crashes, I can argue that it is not my fault.

You absolutely should never assume that. That's delusional.

diachronous correctness If there is a fault in my program, I can decisively fix it while introducing, on average, less than 1 new fault.

Well, that would be ideal and tests help here a lot. But such thing can never be guaranteed.

Given the fact that “I’m using a library used by 100 people, combined with my own code, and it doesn’t work”, it’s much more likely your fault than the libraries fault. Most likely you fix it by stating “I’m doing something wrong. What am I doing wrong?” — gnasher729
– gnasher729, Commented Apr 6 at 12:43

Basilevs · Accepted Answer · 2026-04-03 15:54:56Z

TLDR: How to manage complexity? With intuition, incorrect abstractions and grit.

I find appeal to language or technology naive.

A real problem here is the task. When dealing with external systems, communication protocol has to fit both systems and is always low level. There are plenty of serialization, message passing and other IPC tools, but those never address the problem of state space and there is always a need to bridge the abstraction gap between low-level protocol and high-level ideas programmer would like to work with.

State is only one subset of the communication protocol parameter space. Sure, adding a single bit of state doubles API surface, but so does a boolean parameter.

You have never observed similar problems in Haskell, because the average task in Haskell is plain and stupid in comparison. Compare web-page rendering, where there is one input and one output with handling an IO interrupt that could be triggered by any of the devices in the system and has to produce drastically different results depending on the situation, and most actions result in crash and you will see my point. Alternatively imagine Linux kernel state as an object in an IO monad - that also illustrates the difference nicely.

Do not get me wrong - web devs do deal with complexity, sure. But it comes from different problem space of business logic and usability, not from protocol complexity or state management.

And here comes the annoying truth - software engineers are hired to manage complexity and the normal tools and recipes apply to manage your particular example. Usually the main tool is abstraction layering.

External system has state? Wrap it in an idempotent layer at cost of minor performance hit.
Too many parameters? Hide them behind a simplified layer, that has none at cost of reduced flexibility.
Too many preconditions? Introduce validation layer, take performance hit and reduce usability.

The list goes on and is ridiculously well-trodden.

With each layer of abstraction, you get less flexible, but more robust system. There is a great risk to select a wrong abstraction for a particular job and that's why we are being paid to understand the problem space in enough detail and to redo the systems over when we inevitably choose wrong.

There are plenty of books and opinions on selection of abstractions and layering, but unfortunately, they are either too vague or too specialized to be useful for a particular task. We read them to train our intuition and broaden horizons.

My main point here is - system engineers deals with complexity just like everybody else, they just have a different problem space.

"Fundamentally, most programming languages are Turing complete, so should be able to handle any task." That's a fundamental misunderstanding of what Turing complete means. It literally just means it can perform any mathematical computation. That's all. Most definitely it does not mean it can perform any task. For example your language might be Turing complete and yet not allow any hardware access. Which means it is useless, you won't be able to do networking, file or any i/o, including printing to terminal. — freakish
– freakish, Commented Apr 3 at 11:41
Secondly, ergonomics and ease of use is not a thing one should ignore. Machine code is Turing complete. So what? Probably noone writes raw machine code, maybe some masochists. People at least use assembly languages. And finally: performance matters. And it has nothing to do with Turing completeness. For example Python is inherently slow, by design. And nothing will change that. The differences between languages are not trivial, and should not be waved off with "everything is the same". It is not. — freakish
– freakish, Commented Apr 3 at 11:55
@freakish We all know what Turing complete is and how does it apply to English expressions. Performance is not a consideration for the problem as stated by OP. Eronomics is a part of the process of selecting a proper abstraction. It includes the choice of language. — Basilevs
– Basilevs, Commented Apr 3 at 13:33
(1) English expressions? I have no idea what you are are trying to say here. Turing completeness hardly means anything useful. In practice you won't even see a non-turing complete programming language. But that doesn't mean that all languages can do same things, that's bs. (2) As for performance: it is always relevant. Regardless of whether it is explicitly mentioned or not. That's the whole point of the industry. (3) I don't understand your "choose a language" argument. So languages are actually different? And that choice is relevant? You literally just tried to say it doesn't matter. — freakish
– freakish, Commented Apr 3 at 14:18
@IgnatInsarov I hoped for opposite effect actually. The answer was intended to demonstrate that usual practices of software engeniiring apply and there is no silver bullet. I will revisit specific concerns in your question and try to go deeper, but it might be more efficient to ask a more specific question. — Basilevs
– Basilevs, Commented Apr 4 at 9:58

Simon Richter · Accepted Answer · 2026-04-03 07:19:15Z

I can represent state either explicitly, with another variable next to it, implicitly by the sequence of operations, or, if I'm feeling adventurous, implicitly by some other state that I already know will be updated at a similar time.

So, I could write the setup for the surface object either as a sequence

send commit without buffer attached
wait for reply
acknowledge reply
configure buffer according to the demands of the compositor

at the end of the sequence, I have a working connection to the compositor, and because nothing can interrupt that sequence, I do not have to care about the intermediate state.

I could also build an event-driven program, where the reply is handled in the generic event handler, and any drawing functions check whether the buffer is already allocated -- if it isn't, then I don't even know what the output viewport size is going to be, so any attempt at drawing would be premature.

I can combine these approaches and set up the rendering pipeline including the final viewport size as part of the sequence, and drawing functions check for the presence of a working pipeline; the presence of the buffer then becomes an implementation detail that lets me abstract the drawing away from the presentation.

In an event-driven program I can also express the state of the rendering pipeline by proxy, for example by only subscribing to events that cause me to begin rendering when it is safe to do so.

That has some potential for bugs if the proxy is not consistent with the state of the system (e.g. I have an event handler registered that tries to draw, but I don't have a surface), but a lot of people prefer this implicit style over silently absorbing inconsistent states, at least for debug builds, because it makes errors more visible.

What do you mean by «set up the rendering pipeline including the final viewport size as part of the sequence» and by «express the state of the rendering pipeline by proxy»? — Ignat Insarov
– Ignat Insarov, Commented Apr 4 at 1:23
I'm using your example, but extrapolating the extra step of "now that we have an output buffer, and we know the dimensions we're rendering to, we need to do more setup on our end, so we can attach that to the sequence, so the (ready/not ready) state of the buffer and the remaining setup becomes synchronized for the rest of the program, and only the current step in the sequence encodes it for the brief duration where these go out of sync. — Simon Richter
– Simon Richter, Commented Apr 4 at 4:10
The "by proxy" approach instead stores this state inside the compositor, in the form of "don't send me any events that would cause me to draw anything, because I would have to ignore them, and it is a waste of resources to generate them." This kind-of absolves you from tracking the state yourself, because no decisions need to be made based on it. — Simon Richter
– Simon Richter, Commented Apr 4 at 4:13

John Dallman · Accepted Answer · 2026-04-04 09:00:34Z

A practical approach to this problem is to adopt a different paradigm for expressing your problem.

You are not dealing with clean mathematical abstractions. You are operating very finite machinery: it happens to be electronic, rather than mechanical, and it operates very fast, but it still has a bunch of moving parts that can fail.

An effective way to write code for this is to think of your task as a search for errors, which happens to accomplish work as a side-effect.

This is very different from Haskell, where a complete lack of side effects is, I understand, a major point of the language. This may have something to do with Haskell, and other functional languages, being quite rare in any kind of programming for commercial purposes.

Steve · Accepted Answer · 2026-04-02 13:46:58Z

So, tldr, real-world programming is hard? (I'm the upvoter btw, not the downvoter.)

The reality of computer programming is not to produce something that works correctly for the life of the universe, but to assist human activity for the time being and to "save labour", and do so with enough reliability to be useful.

Haskell programs may be guaranteed to run in their own terms, but the hardware those programs drive is not guaranteed to work correctly, and the program itself is not guaranteed to be correct (in the sense of doing what it should, or striking a reasonable balance in the circumstances between competing and irreconcilable goals).

You've clearly acquired involvement in programming via academia (hence the use of Haskell), where there has been decades of lack of common sense about what real-world programming is about.

You might therefore have found that you are talented in ways that won't make you a good industrial programmer, or you might be perfectly capable of being a good industrial programmer but are about to find that an academic course of study (and association with academics) has conveyed you very little relevant experience and competency and has in fact twisted your concepts and expectations, and you are now about to start again.

There is no simple way of boiling down a professional area of practice, but amongst various things that professionals do do, is managing the complexity of solutions so as not to exhaust the available budget of programmer labour available (i.e. your labour), and controlling projects which (when kept within a budget) will work too unreliably to be useful (computers can be unreliable either because they produce wrong results too often or with too great consequences, or because they stop processing and require intervention and alteration too frequently or urgently).

Doing these tasks skilfully is what the job consists of.

There's obviously lots more that can be said, but obviously one of the main mistakes you're making is assuming your programs should be truly perfect and that programming computers will be easy, and worrying that this is not the case. Experienced computer programmers don't think like that.

EDIT:

On the points added to the question, the problem is that a "crash" is only conventional in definition. You can avoid a "crash" simply by removing or swallowing all the exceptions from your program.

Proving that a program "terminates" is rarely useful. Some programs are supposed to run indefinitely. Others must not merely terminate before the death of the Sun, but within an acceptable budget of time.

As for reducing bugs by decomposition, the problem is that you can solve every problem with another independent component, except the problem of too many independent components.

And the vast majority of useful programs have to be globally stateful, respond to external events, and evolve their states in ways that are not fully determined at the outset of operation.

What you're mentioning are the common canards peddled by academics about functional languages like "global state", "side effect" data flows, and so on, and experienced practioners know (to be blunt) it's all a load of nonsense!

Professional practice is about working competently with what they dub "global state" and "side effects", including managing the complexity of these and moderating their usage to what is reasonably essential (rather than eliminating them on the theory they are inessential).

I agree with you on your last sentence. My question is, more or less, about how exactly I can manage the complexity of global state. Particularly in situations where this state is mostly hidden (implicit hardware state, memory hidden in libraries) and library procedures depend on this hidden state in non-trivial ways. — Ignat Insarov
– Ignat Insarov, Commented Apr 2 at 13:59
@IgnatInsarov, a lot of it is about being self-aware enough to notice when the complexity is exceeding your own grasp. When this happens is to some extent sensitive to talent, experience, and your degree of cognitive fitness (i.e. the extent to which you've been dealing with that particular kind of task heavily and recently). If the complexity is specifically associated with "global state", then what you typically have are data flows (also considering the associated control flows/causation/"drivetrains" for these data flows) that are too difficult to analyse. (1/3) — Steve
– Steve, Commented Apr 3 at 9:46
The complexity arises not intrinsically from mere non-locality of the flow, but from the number of flows that gather in one place from many other places for many different reasons, or which scatter from one place and have many disparate consequences, or both (and perhaps both recursively). Good design decomposes a program into modules and ensures that the only non-local flows between modules are essential ones. There are always essential flows, otherwise connected modules could be severed into completely independent programs. (2/3) — Steve
– Steve, Commented Apr 3 at 9:46
@Steve I don't really understand what is your point. It sounds to me like some philosophical debate. Yeah, obviously everything about computers, or any engineering originates in human actions and decisions. Its not really helpful though. Yeah, the term "crash" is just a label - again, not helpful. Yeah, there's nothing about anything that intrinsically requires anything - again, not helpful. And yes, we all aim at minimizing errors - but again, that is not helpful. How any of that adresses OP's questions and concerns? — freakish
– freakish, Commented Apr 3 at 21:56
@Steve I think the point you are making about complexity of data flows is particularly valuable. Please put it into the answer itself, so that it does not get lost or deleted. — Ignat Insarov
– Ignat Insarov, Commented Apr 4 at 1:36

Stack Exchange Network

How do system programmers think?

7 Answers 7

Your Answer

Hot Network Questions

How do system programmers think?

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions