I have a lot of experience with Haskell and a little bit of experience with C. The latter is my language of choice when it comes to hardware and hardware-adjacent tasks. In particular, I have used C to program microcontrollers and to interface with Vulkan and Wayland. I have no issue writing C programs in principle, but I find it hard to manage interactions with the underlying hardware or hardware-adjacent systems. I think there is a bit of wisdom I am missing.
The recurrent problem I face every once in a while when programming in C is that my program crashes or has no effect without any indication of error from the compiler. This is radically different from Haskell, where I can expect a program that compiles to work — so long as I do not use general recursion or error calls, I have a guarantee that my functions will compute something. How do I achieve this with Haskell? By making sure there are no conditions on inputs. A function that breaks down on some values of its arguments is considered a bad function, and I avoid such functions. And, of course, a function in Haskell cannot read any kind of state. This guarantee is impossible to offer in the world of hardware — hardware is operated in no other way than by changing its state.
For example, consider the procedure zwlr_layer_shell_v1_get_layer_surface from a Wayland extension for making status bars and backgrounds:
After creating a layer_surface object and setting it up, the client must perform an initial commit without any buffer attached. The compositor will reply with a layer_surface.configure event. The client must acknowledge it and is then allowed to attach a buffer to map the surface.
This is really a condition on the inputs to wl_surface_attach, defined in the official Wayland protocol. So, presumably, wl_surface_attach can be used safely unless you use it with a surface that happens to be a layer surface, in which case you need to have done the dance hinted at above.
This is an example of conditions on state. Apparently, a surface in general fulfills the conditions needed to attach a buffer, but in the particular case of a layer shell surface it does not necessarily do so. Why? I shall never know. The state is hidden from me. There is not even a procedure that would check if my surface is good to draw on. The only way for me to be sure my program will work is to make sure certain requests are sent to the underlying system in a certain order.
There are two hardships here:
- It is hard to know what is required of me. The conditions are written in plain text (if at all) and scattered across the whole massive of documentation to the system I am interacting with. Enumerating all the relevant conditions and precisely understanding what they mean is an impossible task.
- It is hard to know whether my program matches the requirements. A non-trivial program can run in infinitely many different ways, leaving behind one of infinitely many execution traces. So, formally speaking, I must make a judgement on an infinite language of execution traces. This is again an impossible task.
There are some ways I have managed to make progress.
- I use the integrated validation and tracking facilities where they are available. Vulkan offers a validation layer, Wayland has tracing that can be enabled with an environment variable, and my own code can be instrumented with
-finstrument-functions, letting me record an execution trace. - I use the
assertmacro to check all inputs and also the state I have access to. If my program has state, all procedures that depend on it will be covered by assertions. I have assertions before the body of the procedure to check requirements and assertions just before the return statement to sanity check the procedure itself. This adds up to a lot of lines of code, but it is the only way to detect broken state early.
I can approach the problem of zwlr_layer_shell_v1_get_layer_surface by reflecting the relevant hidden state in my own state, by adding flags like «initial commit performed» and «configure event acknowledged». Then I can wrap wl_surface_attach in assertions and at least know whether my program crashed because it has not performed an initial commit or has not acknowledged the configure event. But this still does not tell me how to construct my programs in such a way that they never hit any assertions.
What else can I do? How do people who write major system programs think about this kind of problems?
Since the underlying system is, practically, not reliable, we cannot ask for perfection. But we can ask for either of these two criteria:
synchronous correctness If the program crashes, I can argue that it is not my fault.
Maybe there is an exception in the underlying system that is not documented. Maybe cosmic rays have gotten in the way. But I can decisively argue in the court of law that my program performed the initial commit and acknowledged the configure event, or would have done so if not for outside issues. The question then is how I should argue in my defense on a case by case basis.
With Haskell, I address this criterion by writing folds instead of recursion where possible, and proving termination by hand where recursion cannot be avoided. This takes care of non-termination. Exceptional cases, meanwhile, are enumerated and dealt with one by one until none are left.
diachronous correctness If there is a fault in my program, I can decisively fix it while introducing, on average, less than 1 new fault.
If I introduce more faults than I fix, then my program will eventually become too broken to be useful. But if I fix at least a little bit more than I introduce, then, given enough time, I can achieve any level of quality. The question then is how to stay on the good side of the introduced to fixed fault rate as the code base evolves.
With Haskell, I address this criterion by avoiding global state, so that my program is effectively made of many small, completely independent programs that are hierarchically wired together. If there is a fault, it is either in one of the small programs or in a layer of wiring. Either way, it can be localized and then synchronous correctness methods can be applied to the isolated part.
So, one possible answer to my question would be to offer ways to address these criteria in the setting of system programming. Perfection not required!