Hello, welcome. I hope things are good for you.
This is the website of Brian L. Troutwine, or blt depending on if you know me in the flesh or online. I like to play optimization games and program computers and have managed to make a living, so far, at doing both. I keep busy with a bunch of side projects and research interests that don't necessarily involve computers. This website started out as a labs notebook -- taking inspiration from from James Munns who took inspiration from whitequark -- but has evolved with time to encompass most of my public writing.
Debugging
I've got a particular way I go about debugging systems, less about tools and more about process. I'm not totally sure how to go about explaining it, that said, hence this note.
What Debugging Is
Debugging is the process of repairing your mental model of a software system. That is, you might believe that your piece of software behaves some way for some inputs but can observe that it does not. Coming to an understanding of exactly how and why your mental model differs is the end goal of debugging. You may, also, want to come up with a way of adjusting the software to match your incorrect model, but that's a secondary goal. Implicit in this are some assumptions:
-
Software and the computers they run on are machines and can be understood as such. They have no self-animation and, while not deterministic, obey a limited set of rules in their operation.
-
Software and the computers they run on are human artifacts and were meant to be understood by humans.
-
Empiricism is the name of the game. Your reasoning -- the mental model -- has to be informed by experimental data. Your reasoning is faulty and can't be trusted in isolation. Your experiments are also faulty but in a different degree.
-
Magical thinking is a curse and we all participate in it. While computers are simply machines following rules their complexity is staggering and everyone, at some point, has to cut an abstraction line and consider everything past that line a magical black box with behaviors that may or may not be reflective of the function of that part of things. If your mental model of one of these black box areas is wildly wrong but involved in your mental model for the component under discussion you're going to have a bad time.
-
Debugging is best done as a social activity. My areas of magical thinking are not yours; my mental model is not yours. Involving other people in debugging work -- especially if they are familiar with but not close to the problem -- almost always bears fruit.
-
Mental models are good, testable models are better. In so far as is practical if you're investing in testing try to invest in randomized model approaches, like QuickCheck. Done right you'll be able to translate your mental model into code and lean on the computer to find defects in it.
How I Debug
I debug following this rough process.
-
Get a statement from people familiar with the system about how things are going wrong, ideally something you can reproduce the issue with but, if not, at least something to point you in the general direction of things being goofy.
-
Isolate known good areas of the program from known bad. Add tests, telemetry or drop into a live debugger to find seams in your program where your model for the program holds versus where it does not. If you're lucky the resulting deviant bit is pretty small, but more often than not it isn't. At this point you should have the ability to reproduce an issue -- maybe not the issue -- against the isolated part of the program, preferably in an automated way.
-
Talk with people that wrote the deviant bit of the program, if it isn't you. Are they surprised by its behavior? If not, they may have intentionally programmed the isolated chunk to act this way intentionally. Why? If they're surprised, see if they have ideas for how to isolate parts of the chunk further. Ask for a tour by them of the chunk. Authors have a lot of deep insight into a program.
-
Read the source code for the program chunk, taking special care to note how your inputs from your reproducible issue are flowed through the code. This, hopefully, will give you rough ideas on where the program can be further subdivided.
-
If you found areas to subdivide, go back to step 1. Some of your new sub-chunk programs will work perfectly in accord with your mental model for them, others will not. Ideally, only one will not, in which case there is only one deviance, but more often that not you'll find several working in concern. Be sure to work only one avenue of exploration at a time. Have the discipline to change only one thing at a time to keep your experiments clean.
-
If you're stuck, take a walk. Take a nap. Pack it away for the day. Our minds need time to rest and there's only so much that can be done in one continuous marathon, however much we mythologize the lone, sleepless hacker. If these things don't work, explain your results so far to someone else. The "confessional method" of debugging will often trigger new ideas and, more, the person you're talking with will have fresh notions.
-
If you've subdivided the program into fine parts and have a reliable, ideally automated ways of demonstrating how your mental model is off, congratulations, you've debugged the program. If you find that the "leaves" of this subdivision process are perfectly fine but it's some integration of them together that's dodgy also congratulations. Software faults often occur at interfaces and you should, at this point, have some method for demonstrating how these combined components don't quite fit right together.
-
Tell folks about your findings. It may be that the program is actually behaving correctly and the sense of what the program should do is off, it might be that it's just buggy and needs to be repaired. Usually by demonstrating an issue you've also demonstrated some kind of avenue for fixing it, but laying out your results to other people will often bear fruit in this regard that you, now very close to the program, will not have considered.
Some Stray Thoughts
I cannot emphasize enough how important it is in this process to do only one thing at a time. Explore one chunk of the program at once, change only one thing at a time, keep as many of the conditions in which failure is first described as intact as possible until you can demonstrate they're co-incident and not essential features of the problem. It's often tempting to change multiple things at once because you "know" that the program won't be affected, but it's just plain wrong to assume that, sure, my model of how the program work is wrong over here but over there it's accurate. I can't tell you how many debugging sessions I've entered into with people and requested, just for giggles, to probe something they "know" is reliable to find that it's actually the thing that's goofed. I can't tell you how many times I've done this to myself.
Because we're dealing with rule bound machines it might happen that you need to unbox the mechanism in one of your black box areas, especially if you don't have access to a domain expert to talk with.
Things take the time they take. In a world infected with Taylorism there is a strong sense in this kind of exploration work that the "clock is ticking" and you'd better get results fast. May be that there's time pressure but that's an independent and uncontrollable concern from the time you need to reason and perform experiments, chat with colleagues. If you're under no external time pressure but still feel an urgent need to get results, maybe think on that. We are not, ourselves, repeatable mechanisms.
Branchless Programming
Modern processors have very deep pipelines, making branch mispredictions costly. Compilers are pretty good at removing branches but not perfect at it. Here's a list of interesting things I've read on "branchless" programming:
- Branchless programming, an article
- Branchless Programming, a Github repo
- A loopless and branchless
O(1)
algorithm to generate the nextDyck word - Branchless Equivalents of Simple Functions
- Branchless Rust
- Branchless Conditionals (Compiler Optimization Technique)
- Making Code Faster: Taming Branches
- Chess Programming Wiki: Avoiding Branches
The Chess Programming wiki is probably the most obsessive piece in this list with additional references around the internet. I mean this in a good way. Chess folks do some of the most unusual high-performance programming around.
Allocators
An "allocator" is a library that sets aside an area in computer memory -- called
the "heap" as opposed to the "stack" and "static" memory -- for use by some
other program. An allocator may or may not keep track of these "allocations" in
its private mechanism and it may or may not allow for callers to "return"
allocations to the allocator, thus freeing up that memory for us by another
caller. If you've ever called malloc
, free
et all in C you've interacted
with an allocator. If you've used a language with a runtime that manages memory
you have done so indirectly. If you've only programmed for embedded devices that
manage memory themselves and allocate only at startup then, well, kudos.
The third chapter of my new book introduces the memory heirarchy of a computer, starting with static memory, working into the stack and out to the heap. By the end of the chapter the reader's got the ability to swap in new allocators but no ability to write them, on account of the book hasn't built up any expertise in thread-safe programming. Next chapter aims to teach that very thing and, in so doing, will build up several allocators. We'll see how successful I am. Anyhow, I've been going back through the literature on allocators. Here's what's interesting I've turned up so far, in no particular order:
- Andrei Alexandrescu: Policy–Based Memory Allocation
- David Gay et al: Memory Management with Explicit Regions
- Emery Berger et al: Composing High-Performance Memory Allocators
- Jason Evans: A Scalable Concurrent
malloc(3)
Implementation for FreeBSD - Jeff Bonwick: Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources
- Jeff Bonwick: The Slab Allocator: An Object-Caching Kernel Memory Allocator
- Maged Michael: Scalable Lock-Free Dynamic Memory Allocation
- Paul Lietar et al: snmalloc: A Message Passing Allocator
- Trishul M. Chilimbi et al: Cache-Conscious Structure Definition
Rust Allocators
These are interesting Rust allocators that I'm aware of:
- bitpool
- mycelium's buddy
- wee_alloc
- sp-allocator
- basicalloc
Concurrency
This is probably my favorite area of systems programming. It's a niche that rewards an understanding of hardware, algorithms and the internal life of operating systems. Great fun.
Exclusion
Measuring Performance
One of the more difficult aspects of systems programming is figuring out how to optimize work, and sometimes to even define what "optimize" means. Do we want more clear code, less memory consumption, lower latency responses, less variant real-time responses, better throughput etc etc. What happens when optimization goals conflict?
Anyhow, here's a list of interesting resources for measuring performance:
- Performance Engineering Requires Stable Benchmarks
- CI for performance: Reliable benchmarking in noisy environments
- Criterion.rs v0.3.4 And Iai 0.1.0
- Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation
- Always-on Profiling for Production Systems
- flamegraph-rs
- COZ: Finding Code that Counts with Causal Profiling
The flamegraph-rs
README
has an excellent section on performance work.
Here's a list of interesting reading on performance:
- Actix Web: Optimization Amongst Optimizations
- fasthello
- Notes on io-uring
- An introduction to Data Oriented Design with Rust
Testing
Testing and software correctness are different but related things. "Correct" software is proven true in some important sense, like seL4. Correctness is something our software culture still struggles to achieve. Testing is a process to demonstrate the absence of error for some inputs and previous state. It's a piecemeal process, where "correctness" is global.
I'm very fond of testing tools that use randomness or exhaustiveness to explore the state space of a program, since human beings are (and I am especially) bad at coming up with failure causing inputs for programs. Here's some:
- Automated property based testing for Rust (with shrinking)
- disorderfs
- Concurrency permutation testing tool for Rust
Vector Leaks Memory
The vector project is written in Rust, a
memory-safe systems language. Unless you're writing Rust code with unsafe blocks
or liberal std::mem::forget
statements the compiler keeps track of allocations
and where they've got to, ensuring that every allocation gets a free if that
allocation goes out of scope. Memory leaks in Rust are, then, generally
expressed either when there's unsafe code involved, there's atomic code involved
or someone keeps allocating into a collection without freeing that
collection. That last possibility is something that Rust shares with garbage
collected languages.
Anyhow, on March 9 we got an error report from @karlmartink that vector was leaking memory with the Kubernetes logs source enabled. A little background on vector. Vector's a data movement tool, something that is intended to take data from a small army of "sources", normalize into an internal schema, transform the data as configured by the end user and egress it into another small army of "sinks". Vector's got a handful of niches it fits in. A tool like vector allows the infra wing of an engineering group to vary the details of the observability stack without impacting application software, lets you pre-process observability data prior to it crossing the network boundary (either for PCI management or because network IO is costly in your environment) or lets you move your observability data around without being particularly greedy about system resources. We aim to make vector best in the field for throughput and do so in a minimal amount of space. But sometimes that goes sideways.
The bug report was interesting. The boon to end-users of vector's configurability ends up being a challenge for testing and triage. We offer a handful of build variants, mostly to do with what libc vector's linked to, and so the main thing we need to understand what vector's up to is the user's configuration, or something representative. Very helpfully Karl-Martin gave us multiple example configurations tested across the variants we publish. The major variation between the configs were which sink the logs terminated into, seen here. When terminating into the http sink -- meaning vector ships the data out to some http speaking thing -- vector's memory use gradually climbs and does so until vector is killed for consuming too much memory. Classic leak, right? Now, when logs were sank into the blackhole -- vector does nothing with the data, merely deallocates it -- vector's memory consumption was pretty well flat, so at least that was working.
Unfortunately it took me a few days to give the bug report some proper attention since I had other work ongoing when the report came in. My thoughts immediately went to the http sink being buggy and I asked Karl-Martin for telemetry relating to that sink and then spent the rest of the day reading the source code for the http sink, here. If you've done any HTTP work in Rust / Tokio with hyper the sink code here should look fairly boring, which is a good thing I'd argue. I read the code with an eye out toward any state with a lifetime that lasted longer than a single request but didn't notice any. Hopeful that the requested experiments would turn up interesting info I set this ticket back down and waited.
Three days later we got more data but the results ruined any kind of hypothesis I had. There was clearly some bug and it got worse if we flipped on vector's internal telemetry but there was nothing to suggest that the http sink was at fault, either in the new data or my reading of the source code. I had ran vector through valgrind dhat by this point with a config that aped the user's but the results were murky. At this time vector still used pre-1.0 tokio -- another engineer on the team was busy making the upgrade to 1.x when this ticket came in -- and our version of tokio made many small, short-term allocations that made it difficult to understand what I was seeing. There were three main areas of allocation according to dhat:
- tokio timers (we use a lot of timers and they used to imply allocation)
- smallish
BTreeMap
instances (the vector internal data model stores metadata in these) - metrics-rs histograms
The last was a surprise to me and I discounted it as noise, especially because the paths into these allocations came through tokio. Vector relies on metrics-rs to supply our internal telemetry, anything that comes through the "internal metrics" source. Whether or not this source is enabled those metrics are collected, combined as well with tracing information. We generate a non-trivial amount of telemetry for every event that comes into vector, a situation I had looked at previously to reduce costs there. I'd fixated here on the tokio signal and resolved to pick the bug back up once the upgrade to tokio 1.x was completed, which finally happened on April 1. Once this upgrade was in vector master I re-ran my tests and found that the timer allocations were gone from dhat output, as expected, but the overall problem remained.
A day later I had managed to get a minimal example put together but:
I can observe that vector does gradually increase its memory by a few bytes here and there which does seem roughly correlated to the number of requests out that vector makes, which is why I have the batch size set so low in my config. Unfortunately there's not a direct correlation and the accumulation process does take a while as your graphs show, so we do apologize for the slow progress here. We've just finished an upgrade to vector's tokio which, as @lukesteensen pointed out to me, should resolve some known sources of fragmentation. I'll be running vector with the above config and setup under valgrind; this profile should be clearer now but it will take some time to get results.
When I wrote this my test rig was already running and I full expected to need 24 hours to get results. Not so! The work the tokio folks put in to reduce allocations in their library really made the issue pop. After a few hours I had a lead:
The upgrade to tokio 1.0 has really paid off. @jszwedko relevant to our recent work, if you pull this massif dump open you'll see its recording vector at 1.5Gb of heap and the majority of that is in the tracing subsystem.
Massif output had suddenly become explicable after the upgrade and wow was it clear where memory was going:
1.5GiB spent in metric_tracing_context
histogram implementation. My
assumption at this point was that metric-rs had a bucketing histogram, an
array of counters for points that fall within some range and we'd goofed our
integration with metric-rs. I had used their prometheus
integration
before without observing similar leak behavior and took a peak at how their
exporter drained histograms. We called a function read_histogram
that leaves
the underlying storage of the histogram untouched where their prometheus
exporter called read_histogram_clear
that's drains the underlying storage. At
this point I was still working with the understanding that metrics-rs used an
array of counters internally, though there's a hint in these two functions that
this understanding was not at all accurate. But, when something's a black-box
it's hard to shake incorrect beliefs even when in retrospect it's clear that
belief is wrong. I adjusted the way vector calls metrics-rs and didn't get
hardly anywhere.
It's here I realized I needed to understand how metrics-rs actually
functioned. Now, metrics-rs has this notion of a
Recorder
, a kind
of global thing that sits in your program and catches changes to
telemetry. Vector at this point had its own but as a wrapper over the
Handle
concept from metrics_util
. All we really did was keep track of a "cardinality
counter", the total number of unique keys being tracked, and then deferred
entirely into metric-rs' code. That said, I wasn't at all familiar with the
metrics-rs code base -- even if I had made a PR to it -- and it was clearly time
to fix that. Turns out, I was totally wrong about how the histogram works:
We're calling
read_histogram
here which leaves the underlying samples in place -- the metrics-rs histogram is a linked list of fixed sized blocks of samples that grows as you add more samples -- where metrics-rs own exporters callread_histogram_with_clear
, a function that clears up the internal space of the histogram. Experimentation shows this doesn't quite do the trick but it does help some. We're behind current metrics-rs and may be hitting a but that upstream has fixed.
I had figured that the histogram in use was this
one
or something like it but in fact if you trace the code paths through Handle
a
histogram is an
Arc<AtomicBucket<f64>>
!
And that AtomicBucket
? It's a pretty bog-standard atomic linked-list of
Block<T>
instances:
pub struct AtomicBucket<T> {
tail: Atomic<Block<T>>,
}
where Atomic
is from
crossbeam-epoch. What's
a Block<T>
? It's
this:
/// Discrete chunk of values with atomic read/write access.
struct Block<T> {
// Write index.
write: AtomicUsize,
// Read bitmap.
read: AtomicUsize,
// The individual slots.
slots: [UnsafeCell<T>; BLOCK_SIZE],
// The "next" block to iterate, aka the block that came before this one.
next: Atomic<Block<T>>,
}
Every node in the linked list has a fixed sized slots
that stores T
instances, in this case always f64
. As vector pushes data into a histogram
metrics-rs will silently allocate more underlying storage as needed. Reclamation
happens in
clear_with
. Now,
freeing memory in atomic code is a Hard Problem. You can only free the once but
part of the point of writing atomic code is to reduce coordination between
threads. Embedded in the statement "you can only free the once" is an implicit
coordination. Something has to be the thing that frees and it has to be sure
that when it does so there is nothing that will interact with the storage. The
metrics-rs approach is to use epoch based reclamation, explained by
crossbeam-epoch like so:
An interesting problem concurrent collections deal with comes from the remove operation. Suppose that a thread removes an element from a lock-free map, while another thread is reading that same element at the same time. The first thread must wait until the second thread stops reading the element. Only then it is safe to destruct it.
Programming languages that come with garbage collectors solve this problem trivially. The garbage collector will destruct the removed element when no thread can hold a reference to it anymore.
This crate implements a basic memory reclamation mechanism, which is based on epochs. When an element gets removed from a concurrent collection, it is inserted into a pile of garbage and marked with the current epoch. Every time a thread accesses a collection, it checks the current epoch, attempts to increment it, and destructs some garbage that became so old that no thread can be referencing it anymore.
Epoch based reclamation is cool. It's used in the Linux kernel and is high performance as these things go. This reclamation technique is similar to "Quiescent State Based Reclamation", the primary differentiator being that it's the responsibility of the application to declare a "quiescent period" in QSBR -- when the thread(s) are done processing and cleaning up their garbage is cool -- whereas EBR has to detect quiescent periods without application input. There's an obvious benefit to EBR if you're implementing a collection that just happens to be full of atomic goodies. The downside of these quiescent approaches -- however that quiescence is detected -- is that they'll keep accumulating garbage until such a quiet time happens. If a quiet period never happens then you'll keep allocating. A leak!
At this point I figured we'd found it. I opened up a
PR to adjust our code to call
read_histogram_with_clear
and get a playground to run more experiments in. We
pinged @tobz with our findings so far and they
confirmed we were using the correct read mechanism as well as saying:
In practice, under benchmark scenarios, there are still "quiescent" periods. As soon as you read/clear the histogram, and those blocks are no longer being read, they become eligible for reclamation. If you only have one "task" doing the read/clear, then you have no concurrent readers, and thus no contention, and thus they're almost immediately reclaimed.
That's sort of what I expected. Quiescent approaches can be memory hungry but they're only rarely leakers. We were behind current metrics-rs and I figured we'd upgrade to confirm that the issue was present on that project's tip. It was.. I did manage to put together a smaller project that demonstrated the issue and, more importantly, didn't demonstrate it. When you use metrics-rs without incorporating tracing metadata everything works fine but as soon as you do the histograms accumulate without end. Bummer.
I'm no Aaron Turon but I'm a reasonably competent at atomic data structures and figured I'd need to spend the next few days understanding what the tracing integration does and then diagnosing the odd-ball behavior I was seeing. Maybe I'd add some loom tests to the code base to help figure it out. I was not looking forward to it. Atomic programming is hard and debugging these kinds of odd conditions is even more hard. All the while I'm debugging this Karl-Martin's still seeing vector eat up RAM, so it's not an academic exercise I can play with. Moreover, there was an idea growing at the edge of my conscious thought. So, I took my dog for a walk, had dinner with my wife and went to bed early.
When I woke up the idea had been birthed: vector doesn't need perfect sampling. My teammate Bruce had put the notion in my mind, pointing out that the sole consumer of this data were our sinks. I had seen that the sinks took the sample stream coming out of metrics-rs and processed it into buckets. It makes sense for metrics-rs to collect every sample, it can't know how the data is going to be exported, but vector absolutely knows how it will export the data. So, why not just collect the histogram data in the way we're going to export it? I felt reasonably confident that the problem was in the metrics-rs histogram and that if we stopped using that histogram the problem would go away.
I spent the rest of the day doing just that. While metrics-rs' loose coupling
can make its use a little confusing the first time you encounter the library
it's astonishingly easy to swap bits and pieces out as needed. I replaced
metrics-rs' Handle
with a vector specific
Handle
. The
gauge and counter implementations are quite similar to upstream but the vector
histogram is a boxed fixed-size array of atomic
integers. Even
if the type ends up being a little wide in the end it's still a guaranteed fixed
size per histogram, no matter how un-quiescent vector is.
To my relief, it worked out. Vector went from allocating without bound to consuming just over 6MiB pretty constant, the bulk of that actually being in pre-compiled regexes.
Our own custom Handle
didn't make vector any
faster but
it now sips at RAM. When I'd disabled the code paths that enabled tracing
integration I'd see vector commonly use just shy of 200MiB in tests, so down to
6MiB is a substantial win for our end users.
As I write this I'm still waiting for results back from Karl-Martin but I have high hopes. At the very least one of the bugs contributing to that issue is solved.
Fun stuff.
Writing and Talking
Over the years I've wrote and talked about a number of things, mostly technical. I haven't been very careful about keeping an index of these things, since to me they feel like a kind of rocket exhaust. Anyhow.
Writings
"Build Good Software: Of Politics and Methods"
I wrote this essay in 2017 to accompany a keynote I was invited to give at Lambda Days. I think a good deal of the software industry is dominated by a kind of unreflective techno-utopianism which I find to be both deleterious to the public good and harmful to the process of becoming good at writing software as a civilization. There are seeds in this work of my later interest in communal anti-mammon Christian social thought, though I didn't really have a sense of this at the time of writing. That would only come after a serious reading of Pr. Dietrich Bonhoeffer's "Nachfolge".
Anyhow, the essay is still something that I'm proud of and I'd probably approach it in a broadly similar fashion today.
You can read it online here
"Hands-On Concurrency with Rust"
This is my first book for Packt. You can find it here. I wrote about the process of writing it here, why and how I did it. Fun project.
"Systems Programming for Normal People"
This is my in-progress book for No Starch press. The idea is to introduce systems programming for practicing software engineers that may not have had opportunity to work down-stack but would like an inroads. I am also trying to orient the book toward what I see as the reality of systems programming today: migration of complex kernel workflows into a hybrid userspace model, minimization of memory copies, containerization and distribution.
Talks
The Charming Genius of the Apollo Guidance Computer
This is probably my most watched talk. I had previously spoken at Erlang Factory conferences on, well, Erlang things and was invited by the Code Mesh folks to let loose a little. I was very interested in the hardware of the Apollo project at the time -- as a result of David Mindell's "Digital Apollo" book -- and spoke about the architecture of the Apollo guidance computer, a quirky little machine even for the time. There are some factual errors in the talk but overall I'm still a fan. London is quite a bit around the world from Berkeley, California and I was very tired. I've subsequently learned how to hide my notes from the audience, allowing me to avoid weird little goofs, but I hadn't hit on them here yet. Ah well, such is the peril of live performance. Amusingly this is one of the few talks I've given twice and the second time I had hit on my note method but the secondary screen failed, denying me my notes yet again.
You can view this talk online. The second version is here.
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
This is my favorite talk I've ever given. In 40 minutes I give a history of aeronautics from the Wright Brothers' flight through to the Moon landing. I recall my goal at the time was to describe how long-term technological change happens -- something I'm very interested in -- as well as push just how much information I could deliver to an audience in one sitting. If I remember correctly there are some 200 slides in the talk, so it's more like a very slow animation. You can view this talk online.
I later gave a PechaKucha variant of this talk, but not in public.
Build Good Software: Of Politics and Methods
More detail in this section.
You can view this talk online.
Why Things Fail
Bruce Tate and I were doing a video series for a while called "Why Things Fail". One of my research interests is the failure of complex systems and this series is me and Bruce having a conversation about one serious failure in-depth. We've recorded three videos so far. We had planned to talk about the Dust Bowl but that got interrupted.
- Why Things Fail: The Bug Heard 'round the World
- Why Things Fail: The Great Molasses Flood of 1919
- Why Things Fail (and what we can do about it)
andweorc
- a causal profiler
2021/12/17
Causal profiling, at least to my knowledge, was introduced by Curtsinger, Berger in their 2015 paper COZ: Finding Code that Counts with Causal Profiling. Traditionally profilers measure the CPU time taken by code and report on that. This works well when CPUs are relatively deterministic and programs aren't multi-threaded in themselves. On a modern machines all programs are parallel programs, in the sense of Amdahl's law applying. Even a totally serial program -- no threads -- is still running on a superscalar, out-of-order device, hence tricks around loop fission and what not. The "serialization" point of our programs have an outsized impact on total program performance in a way that is related to but not totally explained by CPU time. A very sleepy part of a program might drive its overall runtime while a busy part of the program lights up a traditional profiler. Think of a busy-loop in one thread waiting on an atomic bool to flip before causing the program to exit and a sleep followed by a bool flip in another thread. Causal profiling aims to not indicate where a program spent its CPU time but, instead, explain to what degree performance of the program would change if such and such line's performance were changed. Thataway we, the software engineers, learn where to focus our effort from the computer, rather than guessing if such and such change would materially improve total-program performance.
Does causal profiling work?
It does. You can experiment with the
coz
program yourself. This software
depends on libelfin which doesn't
understand more recent DWARF versions, but if you use an old-ish Ubuntu version
you should be alright. The paper linked above describes the experience of using
coz, as well, if you can't get it to function.
How does a coz-like causal profiler work?
How a coz
-like causal profiler works is really interesting, but, more
generally, how does a causal profiler work? The key insight to causal profiling
is that, while you can't speed up program sub-components you can slow them
down, and if you refrain from slowing down some sub-component while slowing all
the rest down you've "sped" that sub-component up. A causal profiler uses this
insight plus some mechanism to slow down the sub-components -- line, function,
module etc -- you're investigating, a table of "speedup" factors to apply, some
"progress points" to track during execution and some coordinator to keep track
of which bits have been "spedup" and their impact on the progress points, with
enough repeats of the same speedup to get statistically interesting results.
The way coz
achieves this is by managing experiments with a "global
profiler". This thing runs in its own OS thread, polls the progress points and
keeps a record of the speedup experiments already run and their effect on the
program. Through an LD_PRELOAD
override coz
injects overwritten POSIX and
pthread functions to
- make select functions participate in ongoing experiments (by sleeping for an experimentally determined time, ie, slowing down) and
- start a per-thread profiler complete with timer interrupt.
This per-thread profiler lives at the "top" of every thread, collecting Linux
perf samples -- instruction pointer, callchain -- and sleeping in coordination
with the global profiler, reporting back results to the global profiler. The
interrupted thread goes back to normal operation once the interrupt is
handled. Because of the use of Linux perf coz
can only understand whatever an
instruction pointer points to that can also be resolved into a sybmol, thus a
line. You can imagine an alternative implementation that fiddles with functions
at the compiler level to insert a delay period, or more invasive
application-level changes to get the same result, so long as you can preserve
the interrupt timer notion.
What is andweorc
?
The andweorc
project is my attempt to adapt the coz
idea into a workflow
that looks something more like what we have with
criterion. I focus on single-machine
performance concerns at Datadog and have been
working pretty single-mindedly on
Vector since I joined on. The Vector
team is excellent and we've made some serious strides in improving Vector's
throughput performance. For instance, we have statistically stable, reasonably
fast integrated performance monitoring and regression detection for each PR,
called soak
tests. Super
useful. They chuck out comments like
this
that inform us whether our PR changes have adjusted performance and with what
statistical certainty. Very handy, but it doesn't tell us why or what to do
about why. That's the job of a causal profiler.
For causal profiling to work and work well for something like the Vector project
it has to hook up well with our CI and to hook up with our CI it needs to be
cargo friendly. I have a hunch that by building a causal profiler that relaxes
some of the coz
constraints we can get a really useful tool for Vector and
Rust generally. Specifically I'm thinking:
- support of non-Rust languages is a non-goal
- CI / CLI mungable results are key
- automatic diffing in profile-points between versions is key
- requiring the user to modify their program is o-kay.
So far I've got to the point in the repo where I have all the hard bits and bobs proved -- save one, see below -- and roughly linked together. The code is still very, very rough and the dream of it is still mostly in my mind, I think, but the outline's clear in a way it wasn't, say, a year and change (and two abandoned repos) ago when I first started seriously thinking about this attempt.
Why can't I use andweorc
now?
Well, there's two important things missing in the implementation, three if you cound the cargo runner. The first is I don't have the "experiment" notion built, but the global-profiler exists and the tracking for that should be relatively easy to piece together. I've already proved out resolving callchains to symbols to my satisfaction, so what remains is setup. The really hard, missing piece is interruption. I need to set up a timer on an interval per thread, have that timer send a signal to the thread and then delay (potentially), collect samples and ship them up to the global-profiler. That's missing.
And! It turns out it's kinda hard to do that today in Rust. Relevant bugs against nix and libc:
- https://github.com/nix-rust/nix/issues/1424
- https://github.com/rust-lang/libc/issues/2576
Anyway, progress is halted on andweorc
while I poke at those. The eagle-eyed
reader of the codebase will note that I'm also using a
branch of perf-event to pull
samples from Linux perf, but that seems less rough to manage than the signal
thing.
Build Good Software: Of Politics and Methods
Thank you to Hope Waggoner and Mike Sassak for their kind review of this essay. It wouldn't be half what it is without their help.
I'd like to speak a word for good software systems. I would like here to discover the meaning behind "good" and put forward my idea of how we can go about achieving it within the context of which we work. I take here as inspiration Henry David Thoreau, an American philosopher. Thoreau worked in the 18th century, before my nation's Civil War. His contemporaries held him to be a crank, an idler who lived in an odd manner and did very little work. Though the first half is true I take issue with the second. Thoreau's work was Abolition, done for the most part in private on account of its illegality. We can forgive Thoreau's contemporaries for confusing him with a bean cultivating idler. It seems to me that Thoreau -- in his views on civil society, individual behavior and the influence of invention on both -- is an exceptionally important philosopher for an age of techne.
Thoreau's most influential essay is his "On the Duty of Civil Disobedience". Thoreau posits that government is a machine of sorts, mankind -- voluntarily or not -- used as the works. The start to this essay is well-known:
I heartily accept the motto, "That government is best which governs least;" and I should like to see it acted up to more rapidly and systematically. Carried out, it finally amounts to this, which also I believe, "That government is best which governs not at all;" and when men are prepared for it, that will be the kind of government which they will have.[1]
You'll find this excerpted, trotted out in defense of the "shrinking" of government or its anarchical overthrow, depending. Excerption loses the importance that Thoreau places on "when men are prepared for it."
Government is at best but an expedient.[2]
Governments are a tool, "the mode which the people have chosen to execute their will". Government, to the American notion of it, is a set of norms and laws that loosely bind a people together and to it. It is the Will of the People, viewed in the ideal American fashion, that seeks out Justice and Freedom. Yet, this is not so in practice. The flaw of Government is the flaw of the People, especially as it is "a sort of wooden gun to the people themselves".
The authority of government, even such as I am willing to submit to (...) is still an impure one: to be strictly just, it must have the sanction and consent of the governed.[3]
Government, in the Thoreauvian sense, exists to carry forward the norms of the People. It does so without examination of these norms.
But a government in which the majority rule in all cases cannot be based on justice, even as far as men understand it.[4]
The norms of a political body, in Thoreau's analysis, move forward into the future by their own means, disconnected from moral impulse. To make one's self subordinate to the political body is to make one's self subordinate to these means, to smother your own sense of right from wrong. "We should be men first, and subjects afterward," Thoreau declares. That is, we ought, as individuals, to seek out Good as we know it. To effect Good we must have the help of others, some kind of political body to pool resources and action. Mass enough people into this political body and you'll find the outlines of a Government. To that end:
I ask for, not at once no government, but at once a better government. Let every man make known what kind of government would command his respect, and that will be one step toward obtaining it.[5]
What has this got to do with software systems? Well, when we talk about ourselves, we speak of "communities". Do we not organize among ourselves to pool our resources? I see here a faint outline of a body politic. In Thoreau's spirit of making known, I would like to examine two fundamental questions in the development of software today:
- How do we make software that makes money?
- How do we make software of quality?
There is a tension here, in which tradeoffs in one reflect in the other. A dynamic balance between risk and profit and craft is at play when we cozy up to our keyboards. I wager that we all, at some point in our careers, have faced obligation to ship before completing a software project to our satisfaction. I've shipped software that I did not have complete confidence in. Worse, I've shipped software that I did not believe was safe. This, for want of testing or lived experience, driven by deadlines or a rush to be first to market. Compromise weighted with compromise. "How do we make software that makes money?" embeds the context we find our work placed in: economic models that tie the safety of our lives to the work of our hands. Every piece of software that we write -- indeed, every engineering artifact generally -- is the result of human creation. This creation is produced by the human culture that sustains and limits it, the politics of its context. What is made reflects the capability of those that made it and the intentions of those that commissioned it. The work of our hands holds a reflection of the context in which we worked.
Two instances of this come to mind, both well-studied. I'll start near to my lived experience and move out. The Bay Area Rapid Transit or BART is the light-rail train in the San Francisco Bay Area, meant,
to connect the East Bay suburban communities with the Oakland metropolis and to link all of these with San Francisco by means of the Transbay Tube under San Francisco Bay.[6]
per the Office of Technological Assessment's study "Automatic Train Control in Rail Rapid Transit". Automation, where applied to tedious, repetitive tasks, eliminates a certain class of accident. This class of accident has in common failure owing to sudden loss of focus, mistaken inputs or fatigue: these are human failures. Automation, when applied to domains needing nuanced decisions, introduces a different kind of accident: inadequate or dangerous response to unforeseen circumstance. What automation lacks is nuance; what humans lack is endurance. Recognizing this, systems that place humans in a supervisory role over many subsystems performing tedious, repetitive tasks are designed to exploit the skill of both. Automation carries out its tasks, reporting upward toward the operator who, in turn, provides guidance to the executors of the tasks. Such systems keep humans "in the loop" and are more safe that those that do not. The BART, as designed, has a heavy reliance on automation, giving human operators "no effective means" of control. The BART operator is only marginally less along for the ride than the train passengers
... except to bring the train to an emergency stop and thus degrade the performance (and perhaps the safety) of the system as a whole."[7]
The BART's supervisory board disregarded concerns with over-automation in a utopian framing common to California. It is axiomatic to this technical culture that technology is, in itself, a Good and will bring forward Good. Irrigation greens the desert, bringing fertile fields and manicured cities out of sand. The Internet spans us all, decentralizing our communication from radio, books, newspapers and TV, democratizing it in the process. Technology, the thinking goes, applied in a prompt manner and with vigor, will necessarily improve the life of the common man. Even death can obsoleted! Never mind, of course, the Salton Sea blighting the land, made for want of caution. Never mind communication re-centering on Facebook, becoming dominated again from the center but by now opaque voices. Progress is messy!
Concerns centered especially around the BART's Automatic Train Control system. The ATC controls the movement of the train, its stopping at stations, its speed and the opening of its doors. The Office of Technological Assessment study declares the ATC to be "basically unsafe". Holger Hjortsvang, an engineer for the in-construction BART, said of the ATC's specification:
[it] was weakened by unrealistic requirements . . . "terms like: 'The major control functions of the system shall be fully automatic . . . with absolute assurance of passenger and train safety, high levels of reliability . . . and 'the control system shall be based on the principles which permit the attainment of fail-safe operation in all known failure modes.' This is specifying Utopia!"[8]
To demand no realistic safety norms for a system invites a kind blindness into all involved in its construction. The technical side of the organization will tend to view the system with optimism, becoming unable to see modes of failure. The political side of the organization will devalue reports of possible failures requiring reconsideration of the said system. This is a general pattern of techno-political organizations. True to this, the Board of Supervisors devalued reports of the ATC's unreliability.
As early as 1971, the three BART employees in question became concerned with the design of the system's ATC (automatic train control). As the story unfolded, these engineers' fears eventually became public and all three were fired.
The BART manageÂment apparently felt that its three critics had jumped the gun, that the bugs in the system were in the proÂcess of being worked out, and that the three had been unethical in their release of information. [9]
Inconvenient truths are conveniently pushed aside by denying the validity of the messenger and thereby the message. The employment relationship offers an immediate method of devaluation: termination. This is huge disparity of power between employee and employer. Such disparity lends weight in favor of the "this is fine" political narrative. Yet, the system retains its reality, independent of the prevailing narrative.
... less than a month after the inauguration of service when a train ran off the end of the track at the Fremont Station. There were no fatalities and only minor injuries, but the safety of the ATC system was opened to serious question.[10]
No one at the BART set out to make a dangerous train. It happened because of the nature of the BART's governance. A techno-political organization that is not balanced in its political / technical dynamic will lurch from emergency to emergency. The reality of the underlying system will express itself. The failure of the ATC was not a disaster: no one died. But, the failure of the BART's decision making process were made open to the public in a way that it was not previous to the accident. It is very hard to hide a train that has gone off the rails. The BART supervisors wish to deliver a train to the public that voted to build it would be punished by the same public for being late. Blindness had set in and the train would be safe-enough. The engineers were not under the same pressure from the public and instead were seeking to deliver a safe train, ideally on time but late is better than dead. Once public, this imbalance in the BART was addressed through strengthening the technical staff of the BART politically and by introducing redundancies into the mechanism of the ATC. Yet, to this day, the BART remains a flawed system. Failures are common, limiting efficient service during peak hours. No one dies, of course, which is the important thing. An introspective organization -- like the BART -- will recognize its flawed balance and set out to correct itself. Such organizations seek a common understanding between its technical and political identities, even if the balance ultimately remains weighted toward the political end. The technical is "in the loop": the nexus of control is not wholly in the political domain. In a perfect world this balance would exist from the outset and be reflected in the technical system. But, late is better than dead.
Organizations which achieve some balance between the technical and political are the ideal. Such organizations allow the underlying technical system to express its real nature. That is, the resulting system will only be as safe as its design allows it to be. Every system carries in its design a set of inevitable accidents. This is the central thesis of Charles Perrow's "Normal Accidents: Living with High-Risk Technologies".
The odd term normal accident is meant to signal that, given the system characteristics, multiple and unexpected interactions of failure are inevitable. This is an expression of an integral characteristic of the system, not a statement of frequency. It is normal for us to die, but we only do it once. System accidents are uncommon, even rare; yet this is not all that reassuring, if they can produce catastrophes.[11]
The "system characteristics" Perrow mentions are quite simple: interactive complexity and tight coupling between system components. Every system viewed with omniscience is comprehensible, in time. No one person is omniscient, of course, and operators must make due with a simplified model of their system. System models are constructed of real-time telemetry, prior domain expertise and lived experience. They are partially conceived in the design stage of the system and partially a response to the system as it is discovered to be. Ideally, models capture the important characteristics of the system, allowing the operator to build an accurate mental model of the running system. The accuracy of this mental model determines the predictability of the system by a given operator. It is by prediction, and prediction alone, that we interact with and control the things we build. Mental models for simple systems -- say, a dipping bird hitting a button on a control panel -- are straightforward. Consider only the dipping bird and the button and we have high confidence in predictions made about the system. Consider also the system under the control of our dipping bird -- say, a small-town nuclear power plant -- and our predictive confidence drops. Why? The power plant is complex. It is a composition of many smaller subsystems interacting semi-independently from one another. The subsystems, individually, demand specialized and distinct knowledge to comprehend. The interactions between subsystems demand greater levels of knowledge to comprehend and, worse, may not have been adequately explored by the designers. Linear interactions -- where one subsystem affects the next which affects the next -- are ideal: they are straightforward to design and reason about. Linear interactions predominate in well-designed systems. Non-linear interactions often cannot be avoided.
[T]hese kinds of interactions [are] complex interactions suggesting that there are branching paths, feedback loops, jumps from one linear sequence to another because of proximity (...) The connections are not only adjacent, serial ones, but can multiply as other parts or units or subsystems are reached.[12]
Or, more succinctly:
Linear interactions are those in expected and familiar production or maintenance sequence, and those that are quite visible even if unplanned.
Complex interactions are those of unfamiliar sequences, or unplanned and unexpected sequences, and either not visible or immediately comprehensible.[13]
Of note here is the implicit characteristic of unknowing. Complex interactions are not designed but are an emergent system property with unknown behavior. Complex interactions in a system restrict the human operators' ability to predict the system's reaction in a given circumstance. Of importance is the coupling between subsystems, a familiar concept in the construction of software.
Loose coupling (...) allows certain parts of the system to express themselves according to their own logic or interests. (...) Loosely coupled systems, whether for good or ill, can incorporate shocks and failures and pressures for change without destabilization.[14]
Loosely coupled subsystems are not independent but have a tolerance for error that tightly coupled subsystems do not. Interdependence around time, invariant sequencing and strict precision in interaction make for tightly coupled subsystems. Coupling has great importance for recovery from failure.
In tightly coupled systems the buffers and redundancies and substitutions must be designed in; they must be thought of in advance. In loosely coupled systems there is a better chance that expedient, spur-of-the-moment buffers and redundancies and substations can be found, even though they were not planned ahead of time.[15]
It is not possible to plan for every failure a system will encounter. A well-designed system will make the probability of a system accident low, but that is the best that can be done. A complex system is one in which no one person can have a perfect mental model of said system. Complex systems are not necessarily a function of bad design. Rather, they are complex because they address some complex social need: power generation, control of financial transactions, logistics. Complex systems are an artifact of a political decision to fund and carry out the construction of some solution to a perceived need. They are what C. West Churchman called solutions to "wicked problems":
(...) social problems which are ill formulated, where the information is confusing, where there are many clients and decision-makers with conflicting values, and where the ramifications in the whole system are thoroughly confusing. (...) The adjective âwicked' is supposed to describe the mischievous and even evil quality of these problems, where proposed âsolutions' often turn out to be worse than the symptoms.[16]
This brings us around to the second example of the political context of a system affecting its operation. The Reaktor Bolshoy Moshchnosti Kanalnyy (RBMK) nuclear reactor is a Soviet design, intended to use normal water and graphite control rods to moderate a natural uranium fission reaction. This design is cheap -- explaining why many were built -- but suffers from a serious defect: the reactor requires active cooling. Without power, unless otherwise specially prepared, the reactor enters a positive void coefficient feedback loop. The water moderator heats, flashes into steam, lowering the moderation of the reaction. This flashes more water into steam, further lowering the moderation. This cycle continues until the reactor containment vessel is breached.
The most famous RBMK reactor is no. 4 in the Chernobyl complex. This reactor exploded on 26 April 1986, having been driven by a combination of political demand and operator action into an explosive feedback loop. The accident occurred during an experiment into the generation of electricity for the purposes of emergency cooling. As Grigori Medvedev explains in his "The Truth About Chernobyl":
If all power is cut off to the equipment in a nuclear power station, as can happen in normal operations, all machinery stops, including the pumps that feed cooling water through the reactor core. The resulting meltdown of the core is a nuclear accident of the utmost gravity.
As electricity must be generated by any means available in such circumstances, the experiment using the residual inert force of the turbine is an attempt to provide a solution. As long as the turbine blades continue to spin, electricity is generated. It can and must be used in critical situations.[17]
The techno-political administration responsible for drawing up output plans for generation facilities in the Soviet Union were "staffed by some well-trained and experienced people" but key decision makers were people working in an unfamiliar field after "prestige, money, convenience."
Yu. A. Izmailov, a veteran of Glavatomenergo, the central directorate for nuclear power, used to joke about it: "Under Veretennikov it was practically impossible for us to find anyone in the central directorate who knew much about reactors and nuclear physics. At the same time, however, the bookkeeping, supply and planning department grew to an incredible size."[18]
The Chernobyl facility and other generation facilities were administered with a combination of sloppy ineptitude or by those cowed into silence for fear of being replaced by more compliant technicians. The experimental program drawn up for the Chernobyl no. 4 test was done not with an eye toward safety but toward political success. A successful test of the inertial spin-down method would indicate the superior management the daring and proper spirit of the plant's chief engineer, M. N. Fomin. The experimental program intentionally switched off safety systems prior to engaging the test to give "pure" results. From Medvedev:
- The protection systems triggered by the preset water levels and steam pressure in the drum-separators were blocked, in an attempt to proceed with the test despite the unstable condition of the reactor; the reactor protection system based on heat parameters was cut off.
- The MPA protection system, for the maximum design-basis accident, was switched off, in an attempt to avoid spurious triggering of the ECCS during the test, thereby making it impossible to limit the scope of the probable accident.
- Both emergency diesel-generators were blocked, together with the operating and start-up/standby transformers, thus disconnecting the unit from the grid... [19]
The RBMK reactor was designed to fill a planned need for cheap electricity and the compromises inherent in its design to achieve this aim were irreparable. There is trouble with unsafe systems made "safe" by augmentation, rather than fundamental redesign. Operators can, at their discretion or through coercion, disable safety devices. Perrow notes in Normal Accidents: "Safety systems (...) are necessary, but they have the potential for deception. (...) Any part of the system might be interacting with the other parts in unanticipated ways." A spacecraft's emergency escape system may be accidentally triggered by an elbow, say. Or, a software threshold alarm might fire during a promotion due to increased customer demand but lead operators, unaware of the promotion, to throttle traffic. Procedures go out of date or are poorly written from the start. From David E. Hoffman's "The Dead Hand":
One (Chernobyl) operator (...) was confused by the logbook (on the evening before the 26 April experiment). He called someone else to inquire.
"What shall I do?" he asked. "in the program there are instructions of what to do, and then a lot of things crossed out."
The other person thought for a minute, then replied, "Follow the crossed out instructions."[20]
The BART ATC, though flawed, was made safe by incorporating non-negotiable redundancies into its mechanism. Such an approach cannot be taken with the RBMK reactor. Correction of its flaws requires fundamental redesign of the type. Such systems persist only so long as the balance between technical and political is held and even then this is no guarantee that a low probability event will not occur. Chernobyl had no such balance.
Perrow advocates for a technical society which will refuse to build systems whose catastrophic risk is deemed too high. This is admirable but, I believe, ultimately unworkable given the employment issue discussed above, in addition to an implied separation between political and technical aims which does not exist in practice. When asked to construct something which, according to political constraints, will not be fit for purpose I might choose to refuse but someone else may not. My own hands being clean does not mean good has been done. More, the examples given above are outsized in their scope -- a faulty train for a metro area, a nuclear volcano -- and the implications of their failure are likely beyond the scope of what most software engineers work on. Private concern for the fitness of some small system might be kept private with the perception that its impact will be limited. There are also development ideologies that stress do now, think later approaches, most typified by the mantra "Move Fast and Break Things". These objections are valid in the small but contribute to a slow-motion disaster in aggregate. Consider how many legacy software systems there are in the world which are finicky, perform their function poorly and waste the time of users by crashing. How many schemes are made to replace such systems -- to finally do things right -- only for this aim to be frustrated by "temporary" hacks, tests that will come Real Soon Now or documentation that will never come? What's missing here is a feeling for what Hans Jonas in his "Imperative of Responsibility" called the "altered nature of human action":
All previous ethics (...) had these interconnected tacit premises in common: that the human condition, determined by the nature of man and the nature of things, was given once and for all; that the human good on that basis was readily determinable; and that the range of human actions and therefore responsibility was narrowly circumscribed. [21]
Jonas argues that the tacit premise of human action existing in an inviolable world has been broken by the effective scale of modern technology. Humanity -- able to remake its environment on a lasting, global scale -- has made inadequate existing ethics, ethics that measure action having no temporal component.
(T)echnological power has turned what used and ought to be tentative, perhaps enlightening plays of speculative reasoning into competing blueprints for projects, and in choosing between them we have to choose between extremes of remote effects. (...) In consequence of the inevitably "utopian" scale of modern technology, the salutary gap between everyday and ultimate issues, between occasions for common prudence and occasions for illuminated wisdom, is steadily closing. Living now constantly in the shadow of unwanted, built-in, automatic utopianism we are constantly confronted with issues whose positive choice requires supreme wisdom -- an impossible situation for man in general, because he does not possess that wisdom (...) We need wisdom the most when we believe in it the least.[22]
Jonas' concern is with the global environment and the looming disaster coming with regard to such. "Mankind Has No Right to Suicide" and "The Existence of âMan' Must Never Be Put at Stake" are eye-catching section titles. Jonas concludes that the present generation has an imperative responsibility to ensure to next generation's existence at no less a state than we enjoy without forfeiting said future existence. It is a detailed argument and well worth reading. Of interest to this essay is the association by Jonas of progress with "Baconian utopianism" as well as the logical framework that Jonas constructs to reach his ultimate conclusion. Progress is an ideal so deeply embedded in our society that it's axiomatic. Progress is broadly understood as an individual process, discussed on individual terms. The individual strives to discern knowledge from wisdom, to act with justice. These individual aims are then reflected into cooperative action but cooperative action will, necessarily, be tainted by those that lack wisdom or do not hope for justice. Thoreau's thoughts sit comfortably here. In Thoreau there is also a broader sense of "progress", as elsewhere in post-Enlightenment Western thought.
While there is hardly a civilization anywhere and at any time which does not, or did not, speak of individual progress on paths of personal improvement, for example, in wisdom and virtue, it seems to be a special trait of modern Western man to think of progress preeminently as an attribute -- actual or potential -- of the collective-public realm: which means endowing this macrodimension with its transgenerational continuity with the capacity, the disposition, even the inbuilt destination to be the substratum of that form of change we call progress. (...) The connection is intriguing: with the judgement that the general sense of past change was upward and toward net improvement, there goes the faith that this direction is inherent in the dynamics of the process, thus bound to persist in the future -- and at the same time a commitment to this same persistence, to promoting it as a goal of human endeavor. [23]
That this becomes bound up with technological progress should be no surprise. Especially once the Industrial Revolution was well-established most new technology enabled greater and greater material comfort. Technology became associated with the means toward Progress, with Progress in itself. Note the vigorous self-congratulation of the early industrialist or the present self-congratulation of the software engineer. Marxist thought counters that this Progress is only true if you are able to afford it and it's hard to disagree. Of note is that Marxist thought does not reject the connection of technology with progress but contests that it is the most efficient political / economic system to bring about Technological Progress. In Jonas' analysis the ultimate failing of this ideal of progress is that, while comfort may be gained, it comes at the expense of the hope for, not only the comfort of, but the existence of future generations: successes becomes greater and greater but the failures, likewise, grow in scope. Our ethics are unable to cope with these works.
(T)hese are not undertaken to preserve what exists or to alleviate what is unbearable, but rather to continually improve what has already been achieved, in other words, for progress, which at its most ambitious aims at bringing about an earthly paradise. It and its works stand therefore under the aegis of arrogance rather than necessity...[24]
Existing ethics are "presentist": that is, they are concerned with actions in the present moment between those who are now present. In such an ethics, it is no less morally laudable to sacrifice in the present for the well-being of the future than it is to sacrifice the well-being of the future for the present. What Jonas attempted to construct was an ethics which had in itself a notion of responsibility toward the future of mankind as a whole. The success of the project is, I think, apparent in the modern sense of conservation that pervades our thinking about pollution and its impact on the biosphere. The failure of the project is also apparent.
Jonas' logical framework is independent of the scope of his ultimate aim. This framework is intuitively familiar to working engineers, living, as we do, with actions which though applied today will not come to fruition for quite some time.
(D)evelopments set in motion by technological acts with short-term aims tend to make themselves undefended, that is, to gather their own compulsive dynamics, an automotive momentum, by which they become not only, as pointed out, irreversible but also forward-pushing and thus overtake the wishes and plans of the initiators. The motion once begun takes the law of action out of our hands (...)[25]
Every technical construction, as we have established, is some reflection of the political process that commissioned it. This technical artifact will go forward into the future, eclipsing the context in which it was made and influencing the contexts that are to come. The BART of the 1960s was intended to run infrequently -- for morning and evening commutes -- and to service lower middle income areas. The BART of the present period runs for twenty hours a day and the areas around its stations have become very desirable, attracting higher income residents and businesses. The Chernobyl disaster, most immediately, destroyed the planned city of Pripyat but has left a centuries long containment and cleanup project for Europe. No technology is without consequence. We see this most clearly in software with regard to automating jobs that are presently done by people. Whole classes of work which once gave means to millions -- certain kinds of clerical work, logistics, manufacturing -- have gone with no clear replacement. Such "creative destruction" seems only the natural order of things -- as perhaps it is -- but it must be said that it likely does not seem so natural if you are made to sit among the destruction.
What is wanted is some way of making software well. This has two meanings. Let's remind ourselves of the two questions this essay set out to address:
- How do we make software that makes money?
- How do we make software of quality?
In the first sense of "well" we treat with the first question. Restated, we wish to make software whose unknown behaviors are limited so that we can demonstrate fitness for purpose and be rewarded for our labors. In the second sense of "well" we treat with the second question. What we wish to make is software whose unknown consequences are limited. This later sense is a much more difficult.
How do we restrict unknown behavior in our software? Per Perrow there will always be such and I believe that we must look, today, at those working in high-criticality software systems for some clue of the way forward. I think it will be no controversial thing to say that most software systems made today are not as good as they could be. Even with great personal effort and elaborate software development rituals -- red/green testing, agile, scrum, mob, pair, many eyes make all bugs shallow, benevolent dictatorship, RUP, RAD, DSDM, AUP, DAD etc etc -- most software is still subpar. Why? In 1981 STS-1 -- the first flight of the Space Transport System, as the Space Shuttle was officially known -- was stalled on the launch pad for want of a computer synchronization. Per John Gorman in "The 'BUG' Heard 'Round the World":
On April 10, 1981, about 20 minutes prior to the scheduled launching of the first flight of America's Space Transportation System, astronauts and technicians attempted to initialize the software system which "backs-up" the quad-redundant primary software system ...... and could not. In fact, there was no possible way, it turns out, that the BFS (Backup Flight Control System) in the fifth onboard computer could have been initialized properly with the PASS (Primary Avionics Software System) already executing in the other four computers. There was a "bug" - a very small, very improbable, very intricate, and very old mistake in the initialization logic of the PASS. [26]
The Shuttle computer system is an oddity, a demonstration of a technique called "N-version" programming that is no longer in use. The Shuttle was a fly-by-wire craft, an arrangement where the inputs of the pilot are moderated by the computer before being sent to control surfaces or reaction thrusters. The PASS was a cluster of four identical computers running identical software. Each PASS computer controlled a "string" of avionics equipment with some redundancy. Most equipment received the coverage of a partial subset of the PASS computers where very important equipment, like main engines, would receive four-way coverage. The PASS performed their computations independently but compared the results among one another. This was done to control defects in the computer hardware: were a disagreement to be found in the results the computers -- or, manually, the pilot -- could vote a machine out of control over its string, assuming it to be defective. Simultaneously to all this the BSF received the same inputs and performed its own computations. The BFS computer used identical hardware to the PASS, was constructed in the same HAL/S programming language but ran software of independent construction on a distinct operating system. This is N-version programming. The hope was that software constructed by different groups would prove to be defective in distinct ways, averting potential crisis owing to software defect.
There are five onboard computers (called "GPC's" by everyone - with few remembering that they really were "general purpose") -- four operate with identical software loads during critical phases. That approach is excellent for computer or related hardware failures - but it doesn't fit the bill if one admits to the possibility of catastrophic software bugs ("the bug" of this article certainly is not in that class). The thought of such a bug "bringing down" four otherwise perfect computer systems simultaneously and instantly converting the Orbiter to an inert mass of tiles, wires, and airframe in the middle of a highly dynamic flight phase was more than the project could bear. So, in 1976, the concept of placing an alternate software load in the fifth GPC, an otherwise identical component of the avionics system, was born.[27]
The PASS was asynchronous, the four computers kept in-sync by continually exchanging sync codes with one another during operation, losing an effective 6% of operating capacity but gaining loose coupling between systems that, conceptually, should be tightly coupled on time to one another. The BFS was a synchronous time-slotted system wherein processes are given pre-defined durations in which they will run. Synchronizing asynchronous and synchronous machines is a notoriously hard problem to solve and the shuttle system did so by building in compromises to the PASS, requiring it to emulate synchronicity its its high-priority processes in order to accommodate the BFS.
The changes to the PASS to accommodate BFS happened during the final and very difficult stages of development of the multi-computer software.[28]
In order for the BFS to initially synchronize with the PASS it must calculate the precise moment to listen on the same bus as the PASS. That the computers' clocks were identically driven made staying in sync somewhat easier though this does nothing to address initial synchronization at startup. The solution taken by the BFS programmers was to calculate the offset of the current time from the time when the next sync between PASS and BFS was to occur and simply wait. The number of cycles for this calculation was known and, therefore, the time to wait could be made to take into account the time to compute the time to wait. Excepting that, sometimes, in rare circumstances the timing would be slightly off and the sync would remain one clock cycle off. Resolution required both the PASS and BFS to be power-cycled. Once the sync was achieved it was retained and on 12 April 1981 John Young and Robert Crippen flew the space shuttle Columbia into low-earth orbit.
Another subsystem, especially one as intricately woven into the fabric of the avionics as is the BFS, carries with it solutions to some problems, but the creation of others. While it certainly increased the reliability of the system with respect to generic software failures, it is still argued academically within the project whether the net reliability is any higher today than it would have been had the PASS evolved to maturity without the presence of its cousin - either as a complicating factor...or a crutch. On the other hand, almost everyone involved in the PASS-side "feels" a lot more comfortable![29]
In 1986 John C. Knight and Nancy G. Leveson published "An Experimental Evaluation of the Assumption of Independence in Multiversion Programming". N-version software, as mentioned, was assumed to be more effective at reducing the risk from software bugs in critical systems not by removing them but by making them different. This assumption drove increased complication into the shuttle flight computer system, delaying the initial flight but also the operation of the shuttle as well as the maintenance of the flight system going forward through the lifetime of the shuttle.
The great benefit that N-version programming is intended to provide is a substantial improvement in reliability. (...) We are concerned that this assumption might be false. Our intuition indicates that when solving a difficult intellectual problem (such as writing a computer program), people tend to make the same mistakes (for example, incorrect treatment of boundary conditions) even when they are working independently. (...) It is interesting to note that, even in mechanical systems where redundancy is an important technique for achieving fault tolerance, common design faults are a source of serious problems. An aircraft crashed recently because of a common vibration mode that adversely affected all three parts of a triply redundant system.[30]
Knight and Leveson's experiment to check this assumption is delightfully simple. Graduate students at the University of Virginia and the University of California at Irvine were asked to write a program from a common specification and this program was subjected to one million randomly generated test cases. The programmers were given common acceptance tests to check their programs against but were not given access to the randomly generated test cases. Once submitted
(o)f the twenty seven, no failures were recorded by six versions and the remainder were successful on more than 99% of the tests. Twenty three of the twenty seven were successful on more than 99.9% of the tests.[31]
Knight and Leveson examined the failing cases and determined that they tended to cluster around the same logic mistakes. For example
The first example involves the comparison of angles. In a number of cases, the specifications require that angles be computed and compared. As with all comparisons of real quantities, the limited precision real comparison function was to be used in these cases. The fault was the assumption that comparison of the cosines of angles is equivalent to comparison of the angles. With arbitrary precision this is a correct assumption of course but for this application it is not since finite precision floating point arithmetic was used and the precision was limited further for comparison.[32]
That is, the independently programmed systems displayed correlated failures. Further correlated failures were the result of misunderstandings in geometry, and of trigonometry.
For the particular problem that was programmed for this experiment, we conclude that the assumption of independence of errors that is fundamental to the analysis of N-version programming does not hold.[33]
The PASS' software is compromised, not for want of care in its construction as NASA used the the best available understanding of how to assemble a reliable software system at the time of construction but because this understanding proved to be defective. The shuttle was forever more complicated than it should otherwise have been, adding to the difficulty of its operation and to the expense of its maintenance. There are surely many systems which, now in the world, were constructed with the best of intentions, with the best of available knowledge, but are compromised in a similar manner. Recall that safety systems are not, themselves, independent but become a part of the system, interactive with the existing components in ways potentially unanticipated. In the design and constructions of systems we must strive to limit complexity, must push back on its inclusion, especially late in a project. The more simple a system is the more capable we'll be in predicting its behavior, of controlling its failures.
The shuttle computer system is an example of a technical system in which the techno-political organization had a great degree of balance in its technical and political sub-cultures. (The Shuttle itself was not so, being made too heavy in order to accommodate DoD payloads with an eye toward "paying for itself" through flights. This constraint was added by the US Congress and exacerbated by unrealistic flight rates put forward by NASA. This topic is outside the scope of this essay, but "Into the Black" by Rowland White is a fine book, as well as "Space Shuttle Legacy: How We Did It and What We Learned" by Launius, Krige and Craig). We have seen what happens to technical systems where this is not so: fitness for purpose is compromised for expedience along political lines. The PASS / BFS sync issue as well as our lived experience should give a sense that the opposite is not true: a perfect balance between technical and political sub-cultures will not produce a defect free system. In such cases where do defects creep in and why? Robyn R. Lutz' "Analyzing Software Requirements Errors in Safety Critical Embedded Systems" is of particular interest.
This paper examines 387 software errors uncovered during integration and system testing of two spacecraft, Voyager and Galileo. A software error is defined to be a software related discrepancy between a computed observed or measured value or condition and the true specified or theoretically correct value or condition.[34]
The Voyager probes were launched in 1977 to study the outer solar system, consisting of imaging equipment, radio transceivers and other miscellaneous equipment common to spacecraft. Galileo was a later spacecraft, launched in 1989 and as such is a more capable scientific instrument meant to study Jupiter and its moons but, broadly, is not dissimilar to Voyager for our purposes here. The software of both devices was safety-critical, in that the computer system monitored and controlled device equipment which, if misused, would cause the loss of the device. Every precaution was taken in the construction of the probes as both technical and political incentives aligned toward achieving the greatest safety. This is a common feature of techno-political organizations when the system at the core of the organization is perceived both to be very important and to be very sensitive to failure. Because of this great precaution Lutz' was able to catalog each defect identified by the development teams, breaking them down into sub-categories. The spread of time between the Voyager and Galileo projects gives confidence that the results are generally applicable.
Safety related software errors account for 56% of the total software errors for Voyager and 48% of the total software errors observed for Galileo during integration and system testing.[35]
The kinds of errors discovered are of great interest to us here. Broadly they are broken down into three schemes:
- Internal Faults
- Interface Faults
- Functional Faults
Internal Faults are coding errors internal to "a software module". A function that returns the wrong result for some input, an object that mis-manages its internal state are all examples of such. Lutz notes that there are so few of these that the paper does not address them. That these are basically non-existent by the stage of integration and system testing is a testament to the effectiveness of code review and careful reasoning. It is also indicative of the notion that simply testing small units of a larger project is insufficient for catching serious errors in software systems. The combination of systems is not necessarily well-understood, though the individual systems might be. Interface Faults are the unexpected interaction of systems along their interaction pathways, their interfaces. In software systems this is the transfer or transformation of data between components or the incorrect transfer of control from one path to another. Functional Faults are missing or unnecessary operations, incorrect handling of conditions or behavior that does not conform to requirements.
At a high level of detail, safety-related and non-safety-related software errors display similar proportions of interface and functional faults. Functional faults (...) are the most common kind of software error. Behavioral faults account for about half of all the functional faults on both spacecraft (52% on Voyager; 47% on Galileo). (...) The analysis also identifies interface faults (...) as a significant problem (36% of the safety-related program faults on Voyager; 19% on Galileo).[36]
Lutz' analysis goes on to demonstrate that of the interface faults the majority of these in both projects -- 93% Voyager, 72% Galileo -- are due to communication errors between engineering teams. That the majority failing of software in these projects is due to incorrect functionality per the specification -- ambiguities in or a misinterpretation of said specification -- puts human communication as the primary defect source in an ideal techno-political organization. The root problem here is ambiguity in human speech, either in written specification or agreements between peers with regard to the action of computers. This being a dire problem, recognized relatively early in the field of software engineering, a body of work has gone in toward its resolution. The most obvious approach is to remove the ambiguity. That is, if we could but produce a document which would unambiguously declare the behavior of the system we wished to have on hand -- as well as all the necessary sub machines -- then we would do a great deal toward removing the primary source of defects. This notion is very much of the formalist school of mathematics and suffers from the same defect. Namely, unambiguous specification is a monstrously complicated undertaking, far harder than you might think at first. The most generally useful formal specification language today is Z, pronounced Zed. Z does not have wide use and the literature around its use is infamous for applying Z to simplistic examples. Jonathan Bowden wrote "Formal Specification and Documentation Using Z: A Case Study Approach" to remedy this, noting in his introduction that:
The formal methods community has, in writing about the use of discrete mathematics for system specification, committed a number of serious errors. The main one is to concentrate on problems which are too small, for example it has elevated the stack to a level of importance not dreamt of by its inventors.[37]
Bowden's work is excellent -- chapter 9 "The Transputer Instruction Set" is especially fun -- but reading the book you cannot help but feel a certain hopelessness. This is the same hopelessness that creeps in when discussing dependently typed languages or proof tools with extractive programming. Namely, these tools are exceptionally technical, requiring dedicated study by practitioners to be used. It seems hopeless to expect that the political side of a techno-political organization will be able or willing to use formal specification tools, excepting in exceptional circumstances. This is not to say that such tools are not valuable -- they are even if only today as an avenue of research -- but that they lack a political practicality, excepting in, again, exceptional circumstances. What does have political practicality is "testing" as such, understood broadly to be necessary for the construction of software. Existing methodology focuses on hand-constructed case testing of smallish units of systems as well as hand-constructed case testing of full systems. Design -- like domain-driven -- methodologies are also increasingly understood to be a good for the removal unspoken assumptions in specifications. This is excellent. Where the current dominate testing culture fails are in the same areas as above, in input boundary conditions, in unexpected interaction between components, in unexplored paths. Testers are often the same individuals as those who wrote the initial system. They are biased toward the success of the system, it being of their own devise. It takes a great deal more than most have to seek out the failings in something of emotional worth. That is, manually constructed test cases often test a âhappy path' in the system because that is what is believed most likely to occur in practice and because imagining cases outside of that path are difficult.
In the same spirit of formal specification, approaches to the challenge of creating effective test cases centered on efficient exhaustive testing. Black-box testing -- where test inputs are derived from knowledge of interfaces only -- and white-box (or structural) testing -- where test inputs are derived similarly to black-box in addition to personal knowledge of the software's internal structure -- are still employed today. Defining the domain of a program's input and applying it in a constrained amount of time is the great challenge to both approaches. Equivalence partitioning of inputs cuts down on the runtime issue but defining equivalence effectively is a large, potentially error-prone task in itself. Pre-work for the purposes of testing is a negative with respect to political practicality. The underlying assumption is the need for exhaustiveness to make the probability of detecting faults in software high. Joe E. Duran and Simeon C. Ntafos' 1984 paper "An Evaluation of Random Testing" determined that this assumption does not hold. Their method is straightforward. Duran and Ntafos took a series of programs common in the testing literature of the time and produced tests using then best-practice methods, as well as tests which simply generated random instances from the domain of input. Their results showed that randomized testing "performed better than branch testing in four and better than required pairs testing in one program" but "was least effective in two triangle classification programs where equal values for two or three of the sides of the triangle are important but difficult to generate randomly."
The results compiled so far indicate that random testing can be cost effective for many programs. Also, random testing allows one to obtain sound reliability estimates. Our experiments have shown that random testing can discover some relatively subtle errors without a great deal of effort. We can also report that for the programs so far considered, the sets of random test cases which have been generated provide very high segment and branch coverage.[38]
Key is that both sub-cultures in a techno-political organization have their needs met. The technical side of the organization is able to achieve high-confidence that the potential state-space of the system under test has been explored and that with very little development effort. The political side of the organization receives the same high assurance but, again, with very little effort, that is cost. Tooling has improved in recent years, making randomized testing more attractive as a compliment to existing special-case testing: property testing -- introduced in Hughes and Claessen's "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs" -- libraries are common in mainstream languages and automatic white-box testing tools like American Fuzzy Lop are quick to set up and reap immediate benefits. As Duran et al. note "the point of doing the work of partition testing is to find errors," and this is true of test methods in general. Restated, the point of testing is to uncover unexpected behavior, whether introduced through ambiguity or accident. Randomized testing is a brute force solution, one that can be effectively applied without specialized technique -- though property testing can require a fair bit of model building, as noted in Hughes' follow-up papers -- and at all levels of the software system. Such an approach can probe unexpected states and detect the results of ambiguity in human communication, being limited in the scope of the environment for simulation of the system under test.
The limit to probing for ambiguity will come in the limitation of the engineer to the construction of a simulation environment and the impatience of the political sub-culture to avoid its construction. We've come now back to our second question, that of construction software whose unknown consequences are limited. In no small sense we, the technical side of the techno-political organization, must understand that the consequences of a system cannot be understood if its behaviors are not. It is worth keeping in mind that a consequence will be of supreme importance to the political organization: existence, whether for profit or for the satisfaction of a constituency. Leveson's "The Role of Software in Spacecraft Accidents" is comprehensive, of interest to this essay is the first flight of the Ariane 5. This flight ended forty seconds after it began with the spectacular explosion of the rocket.
The accident report describes what they called the "primary cause" as the complete loss of guidance and attitude information 37s after start of the main engine ignition sequence (30 seconds after liftoff). The loss of information was due to specification and design errors in the software of the inertial reference system. The software was reused from the Ariane 4 and included functions that were not needed for Ariane 5 but were left in for "commonality."[39]
Why, when Ariane 4 had been such a successful launch system, was its successor's guidance system knocked together?
Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk. This phenomenon is not new, and it is extremely difficult to counter when it enters the engineering culture in an organization. Complacency is the root cause of most of the other accident factors described in this paper and was exhibited in all the accidents studied. (...) The Ariane 5 accident report notes that software was assumed to be correct until it was shown to be faulty. As noted by the Ariane accident investigation board, the opposite assumption is more realistic.[40]
More damning,
While management may express their concern for safety and mission risks, true priorities are shown during resource allocation. (...) A culture of denial arises in which any evidence of significant risk is dismissed.[41]
Leveson's specific focus in "The Role of Software" is on the failure of management to contain risk in the destruction of safety-critical systems. If we set our minds to speak of consequence then this applies equally well to us as we are those who now make the world what it will be. I mean this not in a self-congratulatory "software is eating the world" sense but in the more modest sense that everyone now living, through their action or inaction, effects some change on what is to come. The discipline of engineering is special, if not unique, for the construction of artifacts that will be carried forward into the future, bringing with them the unarticulated assumptions of the present. Consider the "multi-core crisis" where an assumption of sequential machines met unavoidably with a world of superscalar, multi-level cached, multi-core machines. Algorithms developed in the sequential era continue to work when the machines have been fundamentally redesigned but are no longer necessarily optimal, requiring a rewrite of existing software and a retraining of existing thought to meet with present machines on their own terms. This imposes an inefficiency burden in a personal and non-personal sense. Consider as well the trend toward non-binary gender expression where it meets with software constructed in an era assuming binary gender. Whether intended now or not this enforces a gendering scheme from the past onto the present. Adapting such systems demands work from those whose gender identity does not conform to the old model -- convincing those unaffected is a great labor; the minority is often made to bear the disproportionate weight of the work for equality, to put it mildly -- and from those who steward such systems.
Where Jonas contends that technology tends to gather up its own momentum and "overtake the wishes and plans of the initiators" we here further contend that if we view ourselves not as the beginning of some future or the end of some past but a people in the middle of a future and the past then the weight of the choices made by the past is borne most heavily when making choices to effect the future. No technology is neutral. In seeking to solve some problem it encodes at once what its originator viewed as a problem and what was also a valid solution to the problem as framed. No technology can be fully controlled, as per Perrow's notion of the "normal accident". What is made will express its reality and affect the reality that it is placed in. Jerry Mander expresses this clearly in his "Four Arguments for the Elimination of Television" though now, maybe, the framing of this argument will seem out of date. Mander worked as an advertising executive at a time when mass media was ceding primacy from newspapers and radio toward television. The early assumptions of television were that it would have great, positive effects on mass culture. It would be the means of tele-eduction, lend an immediacy to politics and spread high-culture to all classes which had been cut off from it for want of leisure time. Television did not simply extend the existing culture into a new medium but invented one, bringing what parts of the old culture were suitable into the new one created and informed by television.
In one generation, out of hundreds of thousands in human evolution, America had become the first culture to have substituted secondary, mediated versions of experience for direct experience of the world. Interpretations and representations of the world were being accepted as experience, and the difference between the two was obscure to most of us.[42]
As I say, though, Mander's argument and, more, Neil Postman's in his "Amusing Ourselves to Death", while important, are difficult to comprehend on their own terms. We've entered a time when television as the dominate medium is in the decline and the culture it made, let alone obscured, has become distant. What has more immediacy for us now living is the change brought by the Internet, by the re-centering of mass culture on its norms. The early Internet was dominated culturally by a certain kind of person, educated, often technical and living in areas of the world with ready access to telephone networks and cheap electricity. Of these -- for reasons that are beyond this essay -- many of these people were anti-authoritarian in mood, suspicious of existing power structures but perfectly comfortable setting up new power structures centered around their strengths: capability with computers and casual indifference to the needs of others -- coded as cleverness -- foremost. Allison Parrish's "Programming is Forgetting: Toward a New Hacker Ethic" is an excellent work in this direction. Of interest to our purpose here is the early utopian scheme of the Internet: in "throwing off" existing power structures the Internet would be free to form a more just society. As John Perry Barlow said in "A Declaration of the Independence of Cyberspace"
Governments of the Industrial World, you weary giants of flesh and steel, I come from Cyberspace, the new home of Mind. (...) You have not engaged in our great and gathering conversation, nor did you create the wealth of our marketplaces.[43]
That the global Internet descended from the DARPANET, a United States funded project to build a distributed communication network that could survive a nuclear shooting war, underlies this claim somewhat. But, granting it, Barlow continues
Where there are real conflicts, where there are wrongs, we will identify them and address them by our means. We are forming our own Social Contract. (...) Our world is different. (...)
We are creating a world that all may enter without privilege or prejudice accorded by race, economic power, military force, or station of birth.
We are creating a world where anyone, anywhere may express his or her beliefs, no matter how singular, without fear of being coerced into silence or conformity.[44]
Berrow's utopia of radical freedom -- specifically centered around radical freedom of speech -- did not exist at the time and has ultimately not come to fruition. The failure of Berrow's conception is that the Internet was not different from the world that spawned it. Much like television, the Internet did not extend a culture into a new medium but created a new culture, cannibalizing the old to make itself. As we now are acutely aware, class and race were not left behind as signifiers but were changed and in being changed were not made less important. Silence and conformity are not incidental features of the world shaped by the "Governments of the Industrial World" but are, seemingly, of human nature. At any rate if it is not, mediating conformist and silencing human interaction via computer networks does not bring forward people's best self. Computers are not magic. Of no less importance in the failure of Berrow's vision is that his vision is necessarily gated by a capital requirement. Especially in 1996, when Berrow's declaration was published, access to the global Internet was not a cheap thing, requiring semi-specialized computer hardware and knowledge to interact with effectively. In reality, Berrow's different world was merely exclusive and invested exclusivity with a kind of righteousness.
The Internet did become available to a broad swath of humanity -- which, tellingly, the original "inhabitants" of the Internet refer to as the Eternal September, a reference to the period of initiation that college students would go through at the start of every school year, September -- but not in the manner that Berrow expected. Initial inclusion was brought by companies like AOL which built "walled gardens" of content, separated off from the surrounding Internet and exclusive to paid subscription. Search engines eroded this business model but brought a new norm: "relevance" as defined by PageRank or similar algorithms. Content as such was only of value if it was referred to by other Content. Value became a complicated graph problem, substituting what was originally a matter for human discernment for a problem of computation. This value norm we see reflected in the importance of "viral" materials on our discourse today. Google succeeded in opening the walled gardens away from their subscription models. However free information wants to be, paying for the computer time to make it so is not and it is no accident that, today, the largest advertisers on the Internet are either walled gardens -- Facebook -- or search engines -- Google. Advertisement subsidizes the "free" access to content, in much the same manner that advertisement subsidizes television. The makers of content inevitably, in both mediums, change their behavior to court advertisers, elsewise they cannot exist. What is different about the Internet -- and this is entirely absent in early utopian notions -- is the capability for surveillance. Advertising models on the Internet are different from previous mediums which rely on statistical models of demographics to hit target audiences. So called "programatic" Internet advertising is built on a model of surveillance, where individual activity is tracked and recorded for long periods, compiled into machine learning models of "intent" and subsequently paired with advertisers' desires. Facebook might well be a meeting place for humanity but it is also a convenient database of user-submitted likes and dislikes, of relationships and deep personal insights to be fed into a machine whose purpose is convincing humanity to buy trivialities. Information becomes compressed to drive collection of information, collection of information becomes a main purpose of creation of information.
Technically speaking building systems to interact in the world of programatic advertising is difficult. Most exchanges -- the Googles and Facebooks of the world -- work by auction. Websites make available slots on their sites where ads may go and, in the 100 milliseconds that exists between the start of a page load and the human's perception is able to register an empty space, an auction occurs. A signal with minimally identifying information goes out from the exchanges to bidders. The bidders must use this signal to look up identifying information in their private database and, from this information, make a bid. This happens billions of times a day. These bidders, built by private companies and working off pools of information collected by same entities, work to drive, largely, clicks. That is, the ads they display after winning bids are meant to be "relevant" to the user they're displayed to, enough to make them click on the ads and interact with whatever it was that the advertiser put on the other side of the link. So do the wheels of commerce now turn. Building a system that is capable of storing identifying information about humanity in local memory and performing a machine learning computation over that information in the space of approximately 35 milliseconds -- if you are to respond in the 100 milliseconds available you must take into account the time to transmit on both sides of the transaction -- is no small matter. It takes real dedication to safety analysis of complex systems, to automated coping with the catastrophic failure of firm real-time systems to make this possible at profitable scale. It is easy, in the execution of such a system, to confuse the difficulty of its construction with its inherent quality. It is this confusion that must be fought. Is it in fact a social good to build a surveillance database on the whole of humanity to drive the sale of trinkets, ultimately so that content on the Internet can be "free" in vague accordance with the visions of wishful utopians? Maybe it is. But, if we can use the relevance norm of the Internet against itself, we note that some 11% of all Internet users now use adblock software and this percentage is growing year by year.
That the consequence of a technology is intimately tied to its initial fitness for purpose but ultimately untethered from such should be no surprise. Yet, the phrase "We really ought to keep politics out of technology," is often spoken and assumed to be correct. This is part and parcel of the reductivist mindset, one which works well in scientific discipline but, being effective in that limited domain, is carried outward and misapplied. Reductionism works only so long as we are interested in the question of how a thing is, not why a thing is. Put another way, reductionism is an intellectual ideology that is suitable for a method of learning in which learning has no effect on the underlying system. Discovering the laws of optics does not change the laws of optics. This is fundamentally unsuitable to the project of engineering. A technical artifact will change the world it is placed in and the question of why that thing is becomes fundamentally a part of it.
Our inventions are wont to be pretty toys, which distract our attention from serious things. They are but improved means to an unimproved end, an end which it was already but too easy to arrive at; as railroads lead to Boston or New York. We are in great haste to construct a magnetic telegraph from Maine to Texas; but Maine and Texas, it may be, have nothing important to communicate. (...) We are eager to tunnel under the Atlantic and bring the Old World some weeks nearer to the New; but perchance the first news that will leak through into the broad, flapping American ear will be that the Princess Adelaide has the whooping cough.[45]
If we wish to construct software of quality -- whose unknown consequences are limited -- we must understand two things. Firstly, we must be aware of the immediate results of our construction. This demands a thinking through of the effect of the system to be constructed, what community it will effect and what community it might well create. This demands, as well, care toward initial behavior, of which specification and testing are key components. Secondly, we must understand that the consequence of the systems we make will grow well beyond what we can see at the outset but that this will be colored by the ambitions of the technical and political organization that spawned it. We must have introspection, not as a secondary feature but as a first class consideration of the engineering discipline. Reductionism is a powerful tool for reasoning but it is a tool with intentional limitations. It is great mistake to reduce oneself to the merely mechanical mindset implied as the correct avenue by such reason. There is so much more to the human mind.
I'd like to speak a word for good software systems. By this I mean something very simple, though it took an awfully long time to get here. A good software system is one that has been constructed with great care -- constrained through cooperation with others, constrained through probing of its possible states -- to be fit for some well-intentioned purpose. That purpose is necessarily political, solves some problem to be found in the world by someone. The technical method applied toward the software system's construction will be affected by the political demands and the approach toward those demands drawn around by the technical constraints of the modern day. It is essential to question the assumptions that are the genesis of the software system, to apply to them the best reason of your own sense of right from wrong, to probe the world that they seek by their own momentum to bring about and find it in accord with the one you seek to inhabit. The engineer of a good software system will understand that this "goodness" is fleeting, made up of the needs of a certain time and a certain place and gone almost as soon as it arrives. Knowledge grows and what is thought best changes. Randomized testing, currently, is one of the best methods for testing in its trade-offs between cost and effectiveness at probing for bugs. Improvements to the method are published regularly -- "Beginner's Luck: A Language for Property-Based Generators" by Hughes et al starts to cope with the generation issue noted by Duran and Ntafos -- but should human knowledge truly advance then, someday, randomized testing will seem overly simplistic and ineffective. That is the progression of the works of pure reason. The progression of the work of politics does not advance in this way and it is common to discount it as "trivial", thereby. This "triviality" is in fact a mask over true complexity, a domain of knowledge where there is no clear right from wrong but a slow climb up out of the shadows toward wisdom. If we are to build good software systems in this sense then we must understand that, no matter how good our intentions or, perhaps harder, no matter how fine our craft, a thing well made might in fact be a social ill. That is, the problem to be solved may not a problem -- if it ever was -- or, should it still be, may not have been solved in a way that functions now, if it ever did.
In the Thoreauvian spirit of making known what is good, I say this: the sense that the pursuit of engineering is purely an exercise of reason is wrong and we would do well to abandon this fantasy. The software we create will be made for others; it is a wooden gun to a wooden gun. The sense that the techno-political balance must be worked through superior technology is wrong too. Consider that most software systems are under-tested. This is so because of the incentives of the political organization that surrounds them. Consider as well that this very same political organization is often not capable of engaging with technical artifacts in any deep fashion. Should you like to do randomized testing, say, but find no political support for it, well, then build a testing tool that encodes randomized testing fundamentally, inextricably. Reality will move toward your tool, the frog will be boiled and the future will be encoded with your norm. The future might well rue this, understanding more than we do, but such is the nature of true progress: the past, however advanced it seemed at the time, takes on a sense of triviality and pointlessness. Progress, true progress, is not done out of arrogance -- as a demonstration of one's own talent -- but out of duty for the well-being of the future. The sense that ultimate political aims do not matter is also wrong. These, no less than the technical details of a project, must be fully understood and thought through, our own works no less than the works of others. The political aims of a technical system are a fundamental part of the design. We must probe these and make known what we find. Perhaps, say, an always on voice-activated assistant is a good to the world. Why, then, does it record everything it hears and transmit this to an unaccountable other? Was this device made to make the lives of people better or to collect information about them and entice them into a cycle of want and procurement? Just as technique moves on, so too politics. This, because we who people the world change: our needs change and what was once good may no longer be. What seems good may not be. In choosing to inflict some thing on the future -- whether by construction or by support of construction -- we must strive to make a freedom for the future to supplant it at need. We must keep the future of humanity "in the loop" of the technologies that shape their world. The techniques that we develop today are those which are the very foundation of what is to come. The politics of today makes the technique of today, the techniques the politics.
In the spirit of making known I say this: what is good is that which seeks the least constraint for those to come and advances, at no harm to others, knowledge in the present day. We are those who make the future. Our best will not be good enough but, in struggling to meet our limit, we set the baseline higher for those who will come after us. It is our responsibility so to struggle.
-
Henry David Thoreau, Civil Disobedience (The Library of America, 2001), 203.
-
Thoreau, Civil Disobedience, 203.
-
Thoreau, Civil Disobedience, 224.
-
Thoreau, Civil Disobedience, 204.
-
Thoreau, Civil Disobedience, 204.
-
Olin E. Teague, et al., _Automatic Train Control in Rail Rapid_, (Office of Technology Assessment, 1976), 45.
-
Olin E. Teague, et al., Automatic Train Control in Rail Rapid, 48.
-
Friedlander, G. D., The case of the three engineers vs. BART, (IEEE Spectrum, 11(10), 1974), 70.
-
Friedlander, G. D., The case of the three engineers vs. BART, 69.
-
Olin E. Teague, et al., _Automatic Train Control in Rail Rapid_, 48.
-
Charles Perrow, Normal Accidents: Living With High-Rick Technologies, (Princeton University Press, 1999), 5.
-
Perrow, Normal Accidents, 75.
-
Perrow, Normal Accidents, 76.
-
Perrow, Normal Accidents, 92.
-
Perrow, Normal Accidents, 94.
-
C. West Churchman, Wicked Problems, (Management Science, 1967), 14.
-
Grigori Medvedev, The Truth About Chernobyl (BasicBooks, 1990), 32.
-
Medvedev, The Truth About Chernobyl, 37.
-
Medvedev, The Truth About Chernobyl, 58 - 59.
-
David E. Hoffman, The Dead Hand: The Untold Story of the Cold War Arms Race and Its Dangerous Legacy, (Anchor Books, 2009), 245.
-
Hans Jonas, The Imperative of Responsibility: In Search of an Ethics for the Technological Age, (University of Chicago Press, 1984), 1.
-
Jonas, The Imperative of Responsibility: In Search of an Ethics for the Technological Age, 24.
-
Jonas, The Imperative of Responsibility: In Search of an Ethics for the Technological Age, 163.
-
Jonas, The Imperative of Responsibility: In Search of an Ethics for the Technological Age, 36.
-
Jonas, The Imperative of Responsibility: In Search of an Ethics for the Technological Age, 32.
-
John R. Garman, The BUG Heard 'Round the World: Discussion of The Software Problem Which Delayed the First Shuttle Orbital Flight, ( ACM SIGSOFT Software Engineering Notes, 6(5), 1981), 3.
-
Garman, The BUG Heard 'Round the World: Discussion of The Software Problem Which Delayed the First Shuttle Orbital Flight, 4.
-
Garman, The BUG Heard 'Round the World: Discussion of The Software Problem Which Delayed the First Shuttle Orbital Flight, 5.
-
Garman, The BUG Heard 'Round the World: Discussion of The Software Problem Which Delayed the First Shuttle Orbital Flight, 6.
-
John C. Knight and Nancy G. Leveson, An experimental evaluation of the assumption of independence in multiversion programming, (Software Engineering, SE-12(1), 96--109), 2.
-
Knight and Leveson, An experimental evaluation of the assumption of independence in multiversion programming, 10.
-
Knight and Leveson, An experimental evaluation of the assumption of independence in multiversion programming, 16.
-
Knight and Leveson, An experimental evaluation of the assumption of independence in multiversion programming, 14.
-
Robyn R. Lutz, Analyzing software requirements errors in safety-critical, embedded systems, (Proceedings of the IEEE International Symposium on Requirements Engineering, 1993), 1.
-
Lutz, Analyzing software requirements errors in safety-critical, embedded systems, 4.
-
Lutz, Analyzing software requirements errors in safety-critical, embedded systems, 4.
-
Jonathan P. Bowen, Formal Specification and Documentation using Z, (International Thomson Computer Press, 1996), ix.
-
Joe W. Duran and Simeon C. Ntafos, An Evaluation of Random Testing, (Software Engineering, SE-10(4), 1984), 443.
-
Nancy G. Leveson, The Role of Software in Spacecraft Accidents, (Journal of Spacecraft and Rockets, 41(4), 2004), 2.
-
Leveson, The Role of Software in Spacecraft Accidents, 4.
-
Leveson, The Role of Software in Spacecraft Accidents, 5.
-
Jerry Mander, Four Arguments for the Elimination of Television, (William Morrow Paperbacks; Reprint Edition, 1978), 18.
-
John P. Barlow, A Declaration of the Independence of Cyberspace (retrieved from https://www.eff.org/cyberspace-independence, 2017).
-
Barlow, A Declaration of the Independence of Cyberspace.
-
Henry David Thoreau, Walden; or, Life in the Woods, (The Library of America, 1985), 363 - 364.
Peculiar Books Reviewed
Way back in 2014 I was published in the Huffington Post. They never paid me anything but they did let me write book reviews for odd books and they've kept the content online. You can find it here. This work got me mentioned in the New York Times. Unfortunately, it was the obituary for Henry S.F. Cooper.
I've got all my old essays mirrored here. I look on them now not necessarily as reflective of my current thought but I am pleased to see the through lines.
Anyhow, may you never learn that an author whose work you respect has died because you're mentioned in his obituary.
Peculiar Books Reviewed: David A. Mindell's "Digital Apollo"
Originally published May 31, 2014
This is the first of a series of monthly book reviews intended to make the case for expanding the canon of Software Engineering texts. Don't get me wrong, books like Code Complete or the Mythical Man Month are venerable and valuable, but I contend that the corpus should be more inclusive of interdisciplinary studies. What does this mean? I believe we can create better software more rapidly, by studying the works of other fields and learning, as much as possible, from their mistakes and triumphs. If this sounds suspiciously like a liberal arts reading list for engineering then your suspicions are accurate. By way of justification, I will simply note that the best engineers I have ever had the privilege of working with were, respectively, a military historian and a philosopher.
The US space program is a treasure trove of insight into engineering at the extremes of human ability. It is a field which concerns itself deeply with human-machine interaction. Spacecrafts are not fully automated, nor are they under the total control of the human operators (the astronauts "in the can" and the ground control crew). Rather, they are sophisticated semi-autonomous machines, the machine performing background tasks and translating human commands into sensible actions. The balance between human and machine is not immediately obvious as David A. Mindell explores in Digital Apollo: Human and Machine in Spaceflight. Mindell's book is concerned with the interaction between the test pilots (later, astronauts) and the rocketry and guidance control engineers of N.A.C.A. (later, NASA), the MIT Instrumentation Laboratory and their struggle to design extremely reliable aircraft (later, spacecraft) in the presence of environmental unknowns and human fallibility. "Stable, but not too stable," meaning that the craft is autonomous enough to avoid being overwhelming to the pilot, but unstable enough to be responsive to the pilot's commands.
For Apollo, NASA and it's contractors built a "man-machine" system that combined the power of a computer and its software with the reliability and judgment of a human pilot. Keeping the astronauts "in the loop," overtly and visibly in command with their hands on the stick, was no simple matter of machismo and professional dignity (though it was that too). It was a well-articulated technical philosophy.
Mindell traces the history of this philosophy through the X15 project--a rocket-powered plane which left and re-entered the atmosphere variably under full human and full computer control, but successfully only in hybrid operation--the Mercury project--relatively short "spam in a can" orbital and sub-orbital flights with extensive ground-based observation and secondary computer control--and the Gemini project--a series of moon-trip length flights and computer-aided orbital rendezvous--up through the last Apollo flight, Apollo 17. Project Gemini was particularly influential in solidifying this philosophy.
Unlike Mercury, where the craft reentered the atmosphere in an open-loop, ballistic fashion, Gemini would be steered by the pilot right down to the point of landing.
Re-entry is tricky, the wrong angle will send you bouncing back into space. Sufficient instrumentation allowed the human pilots to control this angle manually. The rendezvous missions proved more difficult.
Intuitive piloting alone proved inadequate for rendezvous. Following Grissom and Young's successful demonstration of manual maneuvering on Gemini III, on Gemini IV astronaut Jim McDivitt attempted to rendezvous with a spent booster. He envisioned the task as "flying formation essential in space," but quickly found that his aviation skills would not serve him (...) Orbital dynamics created a strange brew of velocity, speed and range between two objects and called for a new kind of piloting. Catching up to a spacecraft ahead, for example, might actually require flying slow, to change orbit. (...) A successful rendezvous would require (...) numbers, equations and calculations. It would require simulators, training devices and electronics.
The Gemini computer--the first digital onboard flight computer--translated the pilot's intuitive commands into the proper orbital responses. Mercury's extensive instrumentation and Gemini's onboard flight control would be synthesized in the Apollo project, the craft of which could be largely automated but kept the human crew "in the loop," aware of craft operations, in control and able to respond creatively with deep knowledge of the running system in times of accident. Wernher von Braun, creator of the Saturn V launch vehicle, had imagined future space travelers as mere passengers in fully automated machines. Indeed, Neil Armstrong was deeply disappointed that the Saturn V could not be flown off the platform by astronauts, but previous simulations had demonstrated that human reflexes were too slow. These passenger astronauts could not service the craft as it flew, nor would they be aware of it's operations; ground control would be the sole human element "in the loop."
While the X15 had suggested this was folly--the solely computer controlled craft was capable only in situations designers had anticipated, having a tendency to skip off the atmosphere--the Apollo 13 accident and the highly trained astronauts' role in their survival demonstrated this unequivocally: had Apollo 13 carried mere passengers, ground control could only have sat helplessly as they asphyxiated on a pre-proved flight path.
Digital Apollo is a detailed study of a complex organization's struggle to find the right balance between abstraction--via automation--and skilled human oversight to create a more functional system in a complex environment. It's here that Mindell's work finds its applicability to the craft of creating software: designing good systems with informed humans' interaction in mind is fiendishly difficult to get right.
Simply swamping the human with information and requests is unfeasible, but setting the machine off as an automata without oversight is, while possible, only justifiable until an accident occurs, seemingly without warning and with no clear path toward resolution. It is essential, if we are to tackle complex new frontiers, to get this balance right. It's the humans' knowledge of the system, their understanding of when to trust the machine and when to silence it--as Mindel notes in his opening chapter on the Apollo 11 landing--that leads to a more capable machine. As we rely increasingly on semi-automated systems, the lessons learned in the Space Race have great bearing for the designers and implementors of these systems. We should not seek to cut humanity out but keep our hands on the stick, as it were, and trust to our genius and our learning to go further than we might either alone or as mere passengers of machines. Mindell does a fine job detailing how NASA succeeded in striking this balance in the service of landing on the Moon. We software engineers can learn a thing or two for our own moonshots.
Peculiar Books Reviewed: Charles Perrow's "Normal Accidents"
Originally published July 1, 2014
Wheels roll over feet, kitchen knives slice into fingers, heaters catch houses on fire, software crashes losing work and chemical plants blow up. Each of these things is man-made and each is performing actions they were not intended to perform. As Charles Perrow terms it in Normal Accidents: Living with High-Risk Technology, these "accidents" are intrinsic to the technology--wheels, knives, etc. Why do the things we build fail and why, when they do, are we so often surprised by the ways in which they fail? Can we build systems which function perfectly in all circumstances? Can we avoid all accidents with enough time, enough information, enough practice running the system or enough process around it?
I would wager that for many of the practicing engineers reading this the intuitive answer will be, no, we cannot. Long experience and deliberate effort to create perfectly functioning systems, which fail even so, tend to make one dubious about the possibility of pulling it off in practice. We invent new methods of developing software in the hopes of making this more likely and new formalisms to make verification of system behavior possible. While systematic testing in development (but not TDD, which its creators say is really more about design) puts the system under test through its paces before it reaches a production environment, the paces it's put through are determined by the human engineers and limited by their imaginations or experience. Similarly, sophisticated formalisms like dependent types or full blown proof assistants make the specification of the system extremely exact. However they can do nothing about externally coupled systems and their behaviors, which may be unpredictable and drive our finely crafted systems right off a cliff--literally in the case of a self-driving car hit by a semi-truck. Finally, the organization of humans around the running system provides a further complication in terms of mistaken assumptions about necessary maintenance, inattentive monitoring or, rarely, sabotage.
Perrow's contention in Normal Accidents is that certain system faults are intrinsic; there is no way to build a perfect system:
There are many improvements we can make that I will not dwell on, because they are fairly obvious--such as better operator training, safer designs, more quality control and more effective regulation. ... I will dwell upon characteristics of high-risk technologies that suggest that no matter how effective conventional safety devices are, there is a form of accident that is inevitable.1
These inevitable accidents, which persist despite all effort, are what Perrow terms 'normal', giving the book it's title and strongly suggesting that systems cannot be effectively designed or evaluated without considering their potential failures. The characteristics that cause these normal accidents are, briefly stated:
- linear and complex interactions within the system
- tight and loose coupling with internal sub-systems and external systems
The definitions of these and their subtle interactions--irony of ironies--represents the bulk of Perrow's work. (The book is not as gripping a read as Digital Apollo--which I reviewed last month--but it does make a very thorough examination of the Three Mile Island accident, Petrochemical facilities, the Airline Industry and large geo-engineering projects like mines and dams to elucidate the central argument.) With regard to interactions in systems Perrow points out that:
Linear interactions are those in expected and familiar production or maintenance sequence, and those that are quite visible even if unplanned.
Complex interactions are those of unfamiliar sequences, or unplanned and unexpected sequences and either not visible or not immediately comprehensible.2
Having previously established that his text follows the use of 'coupling' in the engineering sense as a connection, Perrow states:
Loosely coupled systems, whether for good or ill, can incorporate shocks and failures and pressure for change without destablization (of the system). Tightly coupled systems will respond more quickly to these perturbations ... Loose coupling, then, allows certain parts of the system to express themselves according to their own logic or interests.
Roughly, you can think of the interactions occurring within the system and the coupling being centered on external interfaces. When we test our systems--and we should all be testing--in small components and in a full, integrated environment, what we're doing, to use Perrow's terms, is driving the system through it's linear interaction space. We can address the complex interactions with extensive batteries of testing in a production-like environment and use sophisticated monitoring. Yet, testing can lull us into a false sense of mastery over the system and instrumentation, an integrated component of the system, is not immune to normal accidents of it's own. The Three Mile Island accident, for example, was caused largely by the complex interaction of the plant's instrumentation, its human operators and a redundant failure-handling sub-system:
Since there had been problems with this relief valve before ... an indicator had been recently added to the valve to warn operators if it did not reset. ... (S)ince nothing is perfect, it just so happened that this time the indicator itself failed ... Safety systems, such as warning lights, are necessary but they have the potential for deception. If there had been no light assuring them the valve had closed, the operators would have taken other steps to check the status of the valve ... (A)ny part of the system might be interacting with other parts in unanticipated ways.3
Perrow's book is studded with various asides. Some date the book in detrimental ways--the small political rants about US President Reagan come to mind--but others are particularly chilling:
Had the accident at Three Mile Island taken place in one of the plants near Moscow, it would have exposed the operators to potentially lethal doses, and irradiated a large population.4
From the publish date, it would be yet two years before the Chernobyl disaster.
The key contribution of Perrow's Normal Accidents is not merely more material on how to avoid or mitigate accidents, though there is a bit of that. Instead, Perrow's ultimate focus is the determination not of how a system should be constructed but if. Given that every system will eventually fail--and fail in ways we cannot predict--are the failure conditions of the system something we can live with? If not, then we might rethink building the system.
It's here that we find Perrow's primary applicability to software engineering. In our work we are often tasked with building critical systems for the health of companies or of humanity in general. Each new feature brings a possibility of new failures, each new system in the architecture more so. Would the OpenSSL Heartbleed vulnerability have happened if there had been greater concern with having fewer complex interactions within the code-base and the integration of experimental features? Perhaps not. Was the global cost worth it? Probably not.
Failure is inevitable in any system. The next time you're engineering something new or adapting something old, ask yourself, "What are the normal accidents here? Is there a design that meets the same end without these faults? If not, can I live with the results?" To recognize, in a systematic way, that failures are intrinsic and not something that can be simply worked away with enough testing or documentation or instrumentation has been incredibly influential in how I approach my work. Perhaps Normal Accidents: Living with High-Risk Technologies will be similarly influential to you.
Charles Perrow, Normal Accidents: Living with High-Risk Technologies (Princeton University Press, 1984), 3.
Perrow, Normal Accidents, 78.
Perrow, Normal Accidents, 21.
Perrow, Normal Accidents, 41.
Peculiar Books Reviewed: Alain de Botton's "Status Anxiety"
Originally published June 2, 2014
I went to a conference earlier this month where everyone was uniformly lovely and brilliant and interesting and everyone agreed that it was an excellent conference and damn near everyone felt like surely, soon, all the other uniformly lovely and brilliant and interesting people would realize that they, and they alone, didn't belong. If you, gentle reader, have never felt anything like this then, bless your heart, may you never. Everyone else, if you don't know, this is called Impostor Syndrome and it's generally understood to be an inability to properly accept one's accomplishments and talents as one's own, believing that the recognition of such from others is a mistake. This conference I refer to was Write the Docs in Portland, Oregon and the unique thing about it, in my experience, was how open and communicative the crowd was with one another. Fitting, I suppose, for a conference dedicated to communication. After a speaker admitted feeling like a fraud, one of the large side-conversations of the conference, as you might expect from a room full of people who care deeply about communication, folks began sharing their like feeling, admitting it with palpable relief. It was a singular experience.
Throughout, I thought of Alain de Botton's "Status Anxiety". It's a charming book, split into two parts: the first, is a historical discussion of the changing standards of high-status through history and the second a series of themed essays for resolving the personal anguish caused by having low status. To say 'resolve' is something of a mistake, however: Alain de Botton is no Tony Robbins. "Status Anxiety" works as a survey of historical thought--trailing off somewhere in the early 20th century--and doesn't instruct readers so much as puts them in a reflective frame of mind. If de Botton attempts to stress anything as invariant through time it is the universality of concerns about status and the variability of just what it is that 'status' means.
Different societies have awarded status to different groups: hunters, fighters, ancient families, priests, knights, fecund women. Increasingly (status in the West) has been awarded in relation to financial achievement.
Stopping as it does in the early 20th century, de Botton's text sets the stage for considering one's own Impostor Syndrome but does not address it. de Botton leaves off with the Industrial era, money and it's accumulation are the primary indicator of status and the lack of it a sign of moral deviance or other personal failings.
If the successful merited their success, it necessarily followed that failures had to merit their failure. In a meritocratic age, an element of justice appeared to enter into the distribution of poverty no less than that of wealth. Low status came to seem not merely regrettable but also deserved. ... To the injury of poverty, a meritocratic system now added the insult of shame.
Perhaps Impostor Syndrome is unique, a new thing we're adding on top of the mountain of anxieties--money, religious morality, personal appearance--that our species has accumulated as it's built more and more complex societies. Jean-Jacques Rousseau, as de Botton remarks on at length, asserted that in mankind's most simple states we had perfect wealth--and, therefore, no status anxiety--as our desires were exactly matched by our meager ambitions.
For the benefit of those who might wish to explain this away as the absurdly romantic fantasy of a pastoral author unreasonably ... it is worth nothing here that if the eighteenth century paid attention to Rousseau's argument, it was in part because it had before it a single, stark example of it's evident truths, in the fate of the indigenous populations of North America. ... (Native American) society ... (was a) materially modest yet psychologically rewarding ... Even a chief might own no more than a spear and a few pots. ... Within only a few decades of the arrival of the first Europeans ... (w)hat mattered most was no longer an individual's wisdom ... but ownership of weapons, jewelry and whiskey.
Later in the argument,
(T)he Indians, no different in their psychological makeup from other humans, had succumbed to the easy lure of the trinkets of modern civilization.
It seems rather more likely that Impostor Syndrome, rather than being a totally new thing, is rather a very old, latent behavior waiting for the right circumstances to become evident. Impostor Syndrome very often coincides with anxieties about intellectual prowess or experiences that might lead to greater intellectual ability. The modern professional workplace, in which duels to the death are discouraged as a way of defending or gaining status, provides a perfect environment for noticing and fretting about the distinctions in intelligence or know-how (along with all the other status fixing concerns like wealth, appearance etc).
For example, I have a friend, a mathematician by training, that regrets that his childhood lacked 'practical' preparations. This admission occurred when I casually mentioned that I could repair drywall and had learned to do so when I broke a wall as a child and patched it up before anyone was the wiser using the craftsmanship books my father had on hand. My friend, you see, grew up relatively well-off in an urban environment and I did not on both counts. Where his parents focused on the common concerns of the Politically Correct in the 1990s, my own stressed individual reliance and deliberate progress. My friend the mathematician argues that these are a 'simpler' kind of values--hear the drums of Rousseau's Primitive Man pounding in the distance--and are, therefore, to be envied. Worse, my friend feels behind the curve on becoming a Whole Person for the lack of these practical sorts of skills and despairs of ever learning them. The curious thing here is that I envy my friend his more esoterically intellectual childhood. I should have liked to have learned a foreign language and cannot speak one now. I have never been outside of the United States--save the time I accidentally invaded Canada in the Boy Scouts. I feel, analogously to my friend, behind the curve on being a Whole Person. Each of us, I think, looks out onto the world and feeling desire feels oneself to be short of that desire, peculiar to each.
Impostor Syndrome is a reflection of this affliction, combined with an overconfidence in the contentment of our fellow humans. To feel before a crowd that you don't Belong is to think, surely, that they must have their acts together, they who are so uniformly lovely and brilliant and interesting. The trick here is that they do not, not if there's any ambition in their hearts. Excepting the rare stoic--and perhaps even then, felt if not expressed--each member of the crowd will look out with fresh eyes, weigh themselves against an immediate, superficial impression and find themselves lacking in some measure which is close to the desire of their heart.
Software Engineering is a profession struggling to gain diversity on every front. It strikes me that this shared recognition of Impostor Syndrome at Write the Docs is remarkably important. There, a few hundred people all came together to speak about documentation and ended up largely agreeing that, "Yes, I feel pretty awkward and scared most of the time," and it wasn't that big of a deal. Folks empathized, shared their stories and what suggestions, however meager, they had for overcoming status oriented discomfort.
However disgruntled or puzzled a social hierarchy may leave us feeling, we are apt to go along with it on the resigned assumption that it is ... somehow natural. ... (U)nderstanding may also be a first step towards an attempt to shift, or tug at, society's ideals, and thus to bring about a world in which it will be marginally less likely that veneration and honor will be dogmatically or unskeptically surrendered to those who are (already honored).
The craft of putting software together is a task of cooperation, between ourselves in the moment and those strangers in the future that will inherit our systems. To value those that can produce the most code or stir up the most controversy with aggressive rhetoric--to call this natural--is to lose those who write contemplatively or speak rarely and passively. If you enjoy programming machines but don't much care to "crush code"--what does this even mean?--by no means does that make you a fraud.
(Philosophy, art, politics, religion and bohemia) have helped to lend legitimacy to those who, in every generation, may be unable or unwilling to comply dutifully with the domination notion of high status, but you may yet deserve to be categorized under something other than the brutal epithet of "loser" or "nobody". They have provided us with a persuasive and consoling reminders that there is more than one way ... of succeeding at life.
If you're different from me and we're different from all the other people we're building things of worth with, we'll probably make it, in so far as anyone ever does.
Peculiar Books Reviewed: Francis Spufford's "Backroom Boys"
Originally published August 31, 2014
What is the soul of software engineering as a discipline? That is, who is it that the software engineer can esteem? What characteristics are laudable and worthy of emulation?
Physicists have their heroes: Niels Bohr, say, careful to the point of being paralyzed by indecision but brilliant nonetheless, or Richard Feynman, a person, by his own account, in love with the possibility of understanding, in a strictly physical sense, the Universe. Mathematicians can look to Ramanujan or Euler or Erdos. What of software engineers? Two individuals come to my mind. Foremost is Admiral Grace Hopper, inventor of the compiler, promoter of standardization and, least of all, originator of the maxim "It's easier to ask forgiveness than it is to get permission." Another is Donald Knuth, a computer scientist and engineer both, if you'll grant the distinction, who has over decades made smooth pearls of the fundamentals of the field, inventing a good chunk of it besides. From Admiral Hopper I, at least, take wit, savvy in bureaucratic environs and clever, clear explanations of complex topics. From Professor Knuth, I take patience and diligence. Of course, these are just a handful of traits from a few people. There's more to each of them and more than each of them, besides. These estimable characteristics are highly specific to myself, as well, and say less about the soul of the discipline than they do about my own soul. Perhaps you value Woz' legendary ability to do more with less over Professor Knuth's patience, for example.
It's on this difficultly that Francis Spufford's Backroom Boys: The Secret Return of the British Boffin ultimately runs aground. Spufford's ambition, according to his preface, is to tell the story of the backroom boys, "what industrial-age Britain used to call the ingenious engineers who occupied the draughty buildings at the edge of factory grounds and invented the technologies of the future" and "what happened to the backroom boys as the world of aircraft factories and steel mills faded." Spufford attempts to tell the story of the "adaptation" of the engineers beginning with the sputtering out of the British Black Arrow rocketry program shortly before achieving its sole successful flight, through the French/British Concorde, to David Braben and Ian Bell's Elite (a video game), the expansion of the Vodafone cellular network, through the Human Genome Project out and through the British built, ESA launched Beagle Mars probe. Through it all, Spufford claims an essential engineer-ness to the people profiled, describing a group of engineers--which included the author Arthur C. Clarke--who cheered--yes, they got up and cheered--a German V2 strike against London as it vindicated their theories that rocketry was, in fact, practically possible by saying that "(t)hey had the tunnel vision of the engineer, with its exclusive focus on what is technically possible." Elsewhere, a radio propagation specialist is described in similarly broad terms: "modification to his house that declares to his neighbors ... that an engineer resides within, is a custom TV aerial, self-built ... Like other engineers, he prefers to have things just so, intellectually as well as technologically." Braben and Bell are portrayed as a bit otherworldly, eager to explain that Elite could have supported 2 to the 48' power number of galaxies, not because it would make the game better--"Acornsoft could see that having 282,000,000,000,000 galaxies would rub the player's nose in the artificiality of what they were enjoying"--but because it was possible to do. Even more shaky, is the distinction Spufford draws between scientists and engineers: "He was a scientist, not an engineer, so he put knowing above making as the highest, the most central of motivations."
Nonsense. Although he set out to examine the changing circumstances of the British boffin through the post-war decline, Thatcher-era austerity and up through the Information Economy boom of the 1990s, Spufford has really only succeeded in this regard in shoehorning complex people into a broad stereotype, furthering the mythology of the Engineer as different and better from you and me. Why Spufford has done this is clear: without an Archetype to track through time, there'd be no "adaptation" to discuss. I alluded to it a bit above, but this is how the book turns out: it's really just about people that happen to build things or, tellingly, about the business people who employ them. The Concorde, Human Genome Project and Cellular Network chapters are each more strongly focused on the business opportunity posed by a certain technical challenge than the engineers that went about creating a solution for the challenge.
It's an awful shame, really, because while the core of the book is dodgy it's really quite a good read. Each chapter is self-contained and carries along well, even when the underlying technical problem isn't terrifically stimulating. Spufford interviewed a huge number of people for the book, evident in the stories that were banal to those involved, but outlandish to outsiders. Typically:
Dribbles of (high-test peroxide, a rocket fuel) left behind after a test in the twists of a pipe assembly would drain onto the sleeve of a person taking it apart: 'Instantly the whole sleeve catches fire, pooff, as quickly as that. So everybody worked in twos, with one of them holding a running hose, and you just flicked the hose onto your mate when he was on fire and he'd go, "Oh, that was a nuisance."'
and:
At one point, with the Soviet Union falling, it even looked as if they might be able to get Russian spy sat picture to use (for mapping Britain for cellular coverage purposes). ... Causebrook vividly remembers what arrived when he asked the Russian for a sample of their photograph workmanship. 'They sent the Pentagon! A very good image of the Pentagon where you could see the cars in the carpark...'
Throughout, however, are scattered small pearls of hard won wisdom:
If you don't know what to do, do something, and measure it.
Spufford falls into the trap of discussing Engineering through the tiny lens of a handful of the personal stories of engineers, leaving aside the difficulties of doing such work without, as Spufford puts it, resorting to "selective amorality." Even while Spufford chides individual engineers for building things just to build them, with no consideration of the consequences, he is unable to elaborate how the field might be differently exercised. The lens is too focused and, in the end, they're just people. People, sure, with a certain career and probably similar backgrounds, but people just the same. Was Admiral Grace Hopper a better archetypical software engineer than Professor Knuth is now? It's a nonsense question but this is the question that, to its detriment, lies at the center of Backroom Boys. Spufford's chiding of "technologists" for their selective amorality is ironic as he commits a similar error, considering complex individuals only in terms of their service to the narrative flow of the book.
What then is the soul of software engineering, the question posed so long ago at the start of this essay? Why, it is the soul of engineering in general, which is the creation of something out of "obdurate matter" partially for its own sake and partially to meet some need. This is a very Human impulse and therein lies the failure of this whole distinction: at no point is the engineer, fundamentally different from other people. Rather, the engineer is merely specialized in domain of interest and, perhaps, more specially educated. No more, no less. How does a practicing engineer address the problem of "selective amorality" in the pursuit of Engineering? How do we avoid cheering our own V2 strikes? That's another book and another review, I'm afraid.
Peculiar Books Reviewed: Henry S. F. Cooper Jr.'s "Thirteen: The Apollo Flight that Failed"
Originally published October 1, 2014
In the first Peculiar Books Reviewed we discussed David A. Mindell's delightful book "Digital Apollo" and, in particular, took from the book this lesson: a technical system which puts the human component in a position of supremacy over the machine is more capable of achieving the aim of the system than one which holds humans in a subservient position. That is, ignoring any moral or dystopian considerations, putting people in a role in which they are made to serve machines creates worse outcomes in terms of what has been built. Mindell puts this as the difference between "engineering" and "pilot" mentalities, the former being in favor of full automation–think Werner von Braun's desire have mere passengers aboard a clockwork spacecraft–and the later in favor of manual control of the craft. The "pilot" mentality works fine in systems that are relatively simple but as the complexity increases human ability to cope with the banal demands of operations falls off: we can only do so much. The "engineer" mentality succeeds up until the system encounters a situation not expected by the engineers and the mechanism, being unable to cope, falls back to human operators who may or may not be paying attention at the time of the crisis or capable of adapting the mostly automated system to their immediate needs.
This idea, that the role of people in a complex system–spacecraft, software only, industrial etc–can be considered in purely technical terms is important enough that I'm going to spend this review and the next elaborating on it. There's a moral argument to be made as well, as hinted at in the review on Francis Spufford's "Backroom Boys", but the time is not ripe yet for that.
At a little after nine central standard time on the night of Monday, April 13, 1970, there was, high in the western sky, a tiny flare of light that in some respects resembled a star exploding far away in our galaxy.
Thus begins Henry S. F. Cooper, Jr.'s "Thirteen: The Apollo Flight That Failed", one of the best technical explanations of a catastrophic failure and its resolution ever written. This "tiny flare of light" was a rapidly expanding cloud of frozen oxygen coming from the now seriously damaged Service Module (SM). A tank failure ("Later, in describing what happened, NASA engineers avoided using the word 'explosion;' they preferred the more delicate and less dramatic term 'tank failure'…") of Oxygen Tank No. 2 damaged the shared line between the two primary oxygen tanks and the three fuel cells. Immediately after the failure two of three fuel cells began producing reduced amounts of electricity as a set of reactant valves which fed them were jostled shut, permanently, by the force of the failure. Another valve, meant to isolate Oxygen Tank No. 1 from No. 2 failed because of the same mechanical jarring, but was left in an open position. Over the next two hours, both tanks vented into space, pushing the craft off course and ruining the Service Module.
The subsequent flight which Cooper so expertly lays out was a "ground show", in the words of the astronauts themselves. Usual operation of the flight is a delicate balance between the on-board astronauts–in physical possession of the craft and able to manipulate it–and the flight controllers, receiving constant telemetry from the craft, thinking through consequences and making recommendations. Cooper describes this by saying "Astronauts are more like officers aboard a large ship… (and) there were about as many astronauts and flight controllers as there are officers aboard a big vessel (…) In fact, one of the controllers, the Flight Director, in some respects might have been regarded as the real skipper of the spacecraft…" Apollo craft could have operated independently of any ground crew but only in planned-for situations. Post-failure, it became the flight controllers' task to find a plan to land the astronauts safely and the crew's job to carry this out.
Plan they did. With the service module ruined it was abandoned and the crew began to use the Lunar Module (LM) as a life-boat, an eventually never seriously considered.
Aside from some tests a year earlier (…) no one had ever experimented to see how long the LM could keep men alive–the first thing one needs to know about a lifeboat.
Almost entirely through luck the LM was equipped sufficiently to make the trip back survivable and possible. Cooper was likely unaware, but as Mindell pointed out the LM and SM had duplicates of the same computer, meaning that the LM computer, not being a special purpose lunar-landing device, could make rocket burns to return the craft to Earth. The rigging of various internal systems–made famous in the Apollo 13 film: the CO2 scrubbers were incompatible between modules and had to be adapted–careful rationing of electricity and continuous drift from a landing flight-path kept Mission Control busy creating and testing new flight checklists.
Cooper's real interest is the people involved in this story and their interplay through the crisis. Astronauts rushed in to man simulators to test flight controller theories about rocket firings, computer teams kept telemetry gathering systems, flight projection calculators and the CMS/LMS Integrator which "would insure that the instructions for the two modules dovetailed–that there were no conflicts between them" humming. Cooper is telling the story of a complex organization manning a damaged complex system, with human lives at risk. Implicit in all of this are the machines these people are using: tools being adapted to new situations and the spacecraft being repurposed in ways never intended.
In a basic sense, the Apollo spacecraft was a couple of habitable tin cans, some rockets and two computers to control said rockets. The computer was 'programmed' by calling up subroutines and feeding in input parameters, all augmented by feedback from the pilot. Normal flight operations dictated the call-up of subroutines and the parameters input, with a feedback loop dictated by real-time telemetry from the craft and astronauts' expert opinions. The Apollo computer could not demand nor decide, it was instructed. To deal with this 'limitation' NASA was forced to invest in training of all flight staff and ensure that the craft could be flexibly programmed by the astronauts. This, of course, meant that the craft and crew were not not rigidly locked into a fixed plan but could use their human understanding to change course (literally, in this case) as reason dictated.
In documenting the catastrophic failure of Apollo 13, Cooper has likewise documented the exquisite working of a complex organization in a position of mastery over a complex system. These human-oriented complex systems are arranged to take our instructions, to guide but not command. In a crisis, this proves invaluable: we humans may apply our intelligence to the problem at hand and use the machine as just another tool in the solution, keeping in mind, of course, the limitations of the machine but never once struggling to bend it to our informed wills. We may also choose to opt out of the tool's use. Only Jim Lovell, commander of the Apollo 13 mission, intended to make use of the LM's ability to automatically land itself. He never got the chance, of course, but there's something telling in the notion that every other astronaut who landed on the Moon–comfortable with and pleased by the craft's theoretical abilities, all–would choose to go down manually.
As a society, we're building more and more purely automatic complex systems. In the best case they take no input from humans and function in so far as the system's engineers were able to imagine failures. In the worst case, they demand input from humans but do so within the limited confines of the system engineers' imagination, implicitly invalidating any and all expert opinion of the human component. Such systems are brittle. Such systems are, indeed, not maintainable in the long-term: the world changes and knowledge of their operation is lost as none of the humans involved in the system ever truly were responsible for understanding its mechanism.
What Cooper has done is craft an engaging story about a very nearly fatal six day trip round the moon in a faulty craft. What he has also done is to give a vision of the effective interplay between human and machine in a way which enhances the overall capability of the people involved, extending their strengths and making up for their weaknesses. This is the valuable contribution of Cooper's book: a rough blueprint, seen through a particular accident, for complex systems that must be tolerant of faults in the fulfillment of the designer's aims. Machine-oriented systems are fine, maybe even less onerous to run in the average case, but in failure scenarios seriously bad things happen.
More on that, next review.
Peculiar Books Reviewed: Grigori Medvedev's "The Truth About Chernobyl"
Originally published November 5, 2014
Let me tell you a joke.
Days after the Chernobyl plant melted down General Tarakanov, aware of the extreme importance of clearing the reactor roof of radioactive graphite ahead of the weather, began accepting offers of robots from other nations to do the job.
The West Germans, very confident, delivered a tele-presence robot designed for coal mining in dangerous conditions. The robot was lifted onto the roof and set to work, pushing blocks toward the crater. After only a few minutes it ceased to function, ruined by the radiation.
The Japanese, also very confident, delivered a autonomous robot to do the job. Placed on the roof it enjoyed some success, pushing over a ton of radioactive material back into the breach before succumbing to the radiation.
Seeing that the roof was beginning to be cluttered with dead, radioactive robots General Tarakanov said, "This is nonsense! Soviet Science has developed the perfect robot for situations such as this!"
Up went the Soviet's robot and, indeed, it performed beautifully. Though less strong than the Japanese robot it functioned far longer and managed to push several tons of materials back into the reactor. After several hours, seeing that the robot was beginning to be affected by the radiation, General Tarakanov retrieved his bullhorn and called up to the roof, "Private Sidorov, you may come down now!"
It's not a very funny, I'm afraid, but such were the jokes of the Liquidators, the soldiers and volunteers tasked with the cleanup of the Chernobyl Disaster. Said in another fashion, earnestly, by the chairman of the State Committee on the Use of Nuclear Energy, A. M. Peetrosyants: "Science requires victims."
"That is something one cannot forget,"1 says Grigori Medvedev in his "The Truth About Chernobyl". The jacket blurb for my particular copy boasts that Medvedev's book is "an exciting minute-by-minute account ... of the world's largest nuclear disaster and coverup." Make no mistake, Medvedev's book is the primary source we have, especially in the English speaking world, for insight into individual reactor operator's actions on April 26, 1986. The books and is written with great skill and insight into the plant's operation but it is not exactly what I would call exciting. Rather, it is a work of creeping horror.
Why did Unit No. 4 of the Chernobyl Nuclear Power Plant melt down? With a nod to Charles Perrow's "Normal Accidents" there were several System Accidents lurking. The most apparent is the instability of the reactor design, a "high-power channel-type reactor" (RBMK). "The core of an RBMK ... is tightly packed with graphite columns, each of which contains a tubular channel. The nuclear fuel bundled are loaded into these channels ... The tubular openings ... receive control rods, which absorb neutrons. When all the rods are lowered within the core, the reactor is shut down. As the rods are withdrawn, the chain reaction of nuclear fission begins ... The higher the rods are withdrawn, the grater the power of the reactor."2 A little later, Medvedev notes that RBMK type reactors suffer from a "series of positive reactivity coefficients..."3 In particular, this last means that the reactor--which can only be shutdown through the intentional movement of control rods, which must be perfectly aligned--has a tendency to increase its output, to the point of "positive shutdown": the RBMK's default state is to blow up.
More concerning, those in charge of the reactor were not nuclear experts. Chief Engineer Fomin, in control of Unit 4 at the time of the accident, was an electrical engineer by training, with no advanced understanding of nuclear facilities. "I talked to Fomin and warned him that a nuclear power station was a radioactive and extremely complex facility. ... With a knowing smile he replied that a nuclear power station was a prestigious and ultramodern place to work; and that, in any case, you didn't have to be a genius to run one ..."4 The plant's manager, Bryukhanov, was "specialized in turbines"5. The plant, poorly understood by those that supervised it, was not operated with due respect to its danger. Equipment vital for the diagnosis of failures and for the safety of individuals were simply unavailable. "I asked whether those were the only radiometers they had," from the testimony of a shift foreman, "and (they) told me that they had some but they were in the safe, which had been buried in the rubble after the explosion. (They) felt that the people in charge of the plant had never expected such a serious accident."6
Medvedev treats the proximate cause of the accident with care, going to great lengths to establish the sequence of events of that night, explaining how a relatively common test of latent inertial energy potential of the plant's steam turbines became catastrophic: over-confidence in the infallibility of the plant and incompetence in its operation. "Thus, the emergency core cooling system was disconnected deliberately ... Apparently (the operators) were confident the reactor would not fail them. ... It is clear that the operational staff did not fully understand the physics of the reactor..."7 is a theme played in its various repetitions throughout the text: chronic failures of communication, deliberate misinformation and an unwillingness to believe what was right before one's eyes. After to the reactor's destruction the operational staff continued to insist that the reactor had not been breached, no matter that radiation readings were off the scale of the devices on-hand and graphite, which could only have come from the very center of the reactor's core, was strewn about the plant. When told that the the radiation situation was fatal after limited exposure in many places, the plant manager responded, "There's something wrong with your instrument. Fields that high are just impossible. ... Get that thing (a radiometer capable of measuring up to 250 roentgens) out of here, or toss it in the garbage!"8 Over and over, Medvedev documents the slow movement toward understanding of those responding to the disaster, of its immense scale. Even once the breach was accepted many people were needlessly irradiated--often fatally--for want of a clear-eyed grasp of the situation, of both the immediate and lasting dangers of radiation. "People tend to see only what is convenient for them to see--even if it costs them their lives!"9 laments Medvedev.
The Chernobyl reactor is one of the most dramatic examples of a system designed with humans as a mere service component. The reactor, requiring constant intervention to avoid going critical, was not arranged internally such that those on-duty at the time of any incident could properly diagnose what had happened, nor had the operators been sufficiently trained--or screened for training--to make a good go at keeping the reactor in a steady state. Medvedev notes, even, that the system of feedback in the event of minor accidents was entirely severed: accidents were hushed up and no one could learn from them, all the while a 'spotless' record engendered a casual over-confidence. The reactor was presumed to be so safe, effortlessly efficient, that it had never been designed with mundane human intervention in mind, requiring, instead, in the end, heroics on a massive scale. Had the plant been designed with humans in a more elevated role it would still have, eventually, suffered a catastrophic failure--system accidents, being as they are--but fewer people would have died from radiation burns, having been issued proper equipment, and fewer civilians would have been dangerously irradiated, having been informed of the radiation risks and evacuated.
Ultimately, that's the real trick to such systems. Disasters will occur; it is the nature of the response to the disaster and the effects of the disaster which change. Consider that the response to the Apollo 13 accident had many of the same features to the Chernobyl response: redundant systems that weren't and led experts off on tangents, high trust in the correct operation of the machine and a disbelief in the instrumentation of the system post-failure. Unlike the Chernobyl operators, Mission Control and the Flight Astronauts were all highly trained domain experts, equipped with an intuitive understanding of the machine they were operating and provided by said machine tools to override its behavior. With no small effort, the lunar module was re-purposed, mid-flight, into a life-boat. However, machine-oriented systems simply are how they are. When this is good, it's very good: humans can enjoy the ride, feeling positively about how technologically clever we are. When this is bad, though, it's dreadful: not truly understanding the machine, we're at it's mercy as it travels, by itself, to it's default failure state. Maybe, if we're lucky, this is a quiet failure.
When reading "The Truth About Chernobyl" think very carefully through the descriptions of nuclear tans which turn into full-body necrotizing flesh, of firemen irradiated to death in a matter of hours, of shifts clearing debris so ruinous that the soldier's entire military mobilization lasts a mere forty-five seconds because any further work would have irradiated them too much to be of any further use. Think this all through and consider how the people tied up in the systems you build end up when the systems suffer their inevitable accidents. Very few things are so deadly serious as a nuclear reactor, of course, but failures must be considered, with great care, in the pursuit of technical excellence. Failure as a first-class concern in the design of a system adapts the system to meet the challenges human operators will face, giving them tools, insight and, ultimately, a position of supreme control over the mechanism. This is a natural result of seriously considering failure and any system which subverts the human to the machine has not been designed with graceful failure in mind, necessarily. Medvedev's "The Truth About Chernobyl" charts the progress of one such machine-oriented system, through its inevitable, catastrophic failure and on through the struggle to contain the damage. Medvedev gives the reader an outsized example of a general concern for anyone knocking mechanisms together.
Grigori Medvedev: The Truth About Chernobyl (BasicBooks, 1991) 5.
Medvedev, The Truth About Chernobyl, 34.
Medvedev, The Truth About Chernobyl, 35.
Medvedev, The Truth About Chernobyl, 43,44.
Medvedev, The Truth About Chernobyl, 42.
Medvedev, The Truth About Chernobyl, 129.
Medvedev, The Truth About Chernobyl, 49.
Medvedev, The Truth About Chernobyl, 113-114.
Medvedev, The Truth About Chernobyl, 131.
Writing "Hands-On Concurrency with Rust"
Hey, I wrote a book!
From late November 2017 to early May 2018 I did very little other than write my book, Hands-On Concurrency with Rust. In this post I'd like to tell you a little bit about the book itself—which, miracrulously, is available for purchase—but also the process of writing it. I've done a number of elaborate conference talks over the years and written a good deal of software but this is the first time I'd seriously attempted a book. Truthfully, I didn't know what I was doing at the outset but I did it with gusto and kept at it and here we are. There's a book now. Let me tell you all about the work.
What's the book about?
First, what's Hands-On Concurrency with Rust about? Hopefully you can guess some of the subject from the title. It's a Rust-focused book that's meant to teach you, as of 2018, what you can do in Rust to fiddle with modern, commodity parallel machines. The ambition I had at the outset of the project was to lay out all the work I do professionally when I'm doing parallel programming. How do you go about imagining the machine itself and structure your data effectively for it? How do you validate the good function of concurrent implementations, especially across different CPUs? How do you debug these things? What crates in the Rust ecosystem ought you to know about? How do these crates actually work?
The book aims to teach mechanical sympathy for modern machines. We do a deep-dive into the Rust compiler, we use atomic primitives to build up mutexes and the like, examine atomic-safe garbage collection methods, investigate the internal mechanism of Rayon and a ton more. Here's the chapter titles, which I tried to make as on-the-nose as possible:
- Preliminaries: Machine Architecture and Getting Started with Rust
- Sequential Rust Performance and Testing
- The Rust Memory Model: Ownership, References and Manipulation
- Sync and Send: the Foundation of Rust Concurrency
- Locks: Mutex, Condvar, Barriers and RWLock
- Atomics: The Primitives of Synchronization
- Atomics: Safely Reclaiming Memory
- High-Level Parallelism: Threadpools
- FFI and Embedding: Combining Rust and Other Languages
- Futurism: Near-Term Rust
When writing out the book outline, I imagined the concepts arranged in an inverted pyramid: the first chapter—Preliminaries—is the base of everything. In this chapter I discuss the CPU itself, how it operates and what features of that operation are especially apropos to the book. The second chapter introduces the consequences of, say, cache hierarchies introduced in Preliminaries and motivates perf, valgrind and the use of lldb/gdb from that. The chapter Locks discusses the semantics of coarse locking and Atomics: The Primitives of Synchronization then takes all that materials and re-builds it from scratch. But, once you start doing this it becomes clear that memory reclamation is a hard problem, motivating Atomics: Safely Reclaiming Memory. In this fashion, the book builds on itself.
Okay, well, I mentioned an outline? How exactly does one go about writing a technical book?
How 'Hands-On Concurrency' was written.
The kind folks at Packt Publisher reached out to me in late November 2017 wanting to develop a video course on concurrent programming in Rust. Putting together conference presentations is something I adore doing so I was immediately interested in putting together a video script. I found, also, through my work on cernan that I was having repeat conversations about the structure and validation of parallel Rust systems and been (unsuccessfully) putting notes together on the topic. The lack of success was a result of size limitations: I figured a note on the subject suitable for cernan development should be no more than 50 pages or so. This very blog contains some of that work in the hopper series, which needs but one final article to be complete.
Anyhow, the topic is one I'd been thinking on for a while so that a video course was falling into my lap seemed real opportune.
From video course to book.
Why did I end up writing a book? I had a very enjoyable Skype conversation with my initial acquisition editor—this is the person that scouts new authors and shops subjects to the same—and it became clear to them that my personal ambitions for a video course was overly ambitious for such a thing. By, like, a lot. Here's what I wrote in my journal on the subject:
I aim to produce a path into the field. Each chapter should demand the next. Each chapter, it should be clear, must sit on top of a great body of research.
The eventual book came to just under 100k words—Thoreau's Walden, for reference, is 114k—and 7k lines of source code. Imagine how horrible it would have been to sit through the videos of all that. (The audiobooks of Walden, for reference, run around 11 hours.) The acquisition editor was forthright and said that it was plainly too much material for a video course but that Packt was interested in developing the project as a book as well.
In 2015 I put together one talk a month for six months, flying around to present them in the process. I remember that time as difficult, but doable. "How hard can a book be?" I asked myself. "Treat a chapter like a talk and work them one after the other. Doable."
Here's the secret about my talks: they are simple subjects presented in great detail. Each talk is one clean and small idea supported by careful selection of photographs, production of illustrations and typographic design. The longest to prepare talk I've ever done, Getting Uphill on a Candle, took around 200 hours, give or take. It's also the most complex I've put together, as well, being that it's a history of aeronautics research through the 20th century. Most every other talk takes around 100 hours which I put together over weekends and nights after work. That's not a modest amount of time but is possible to turn-around in a month with familiar subject material. If you were to include the background research time into the preparation estimate, well, you'd be looking at closer to 1k hours, easy. Most are the result of years long cycles of reading, trips to the book store and talking over the particulars with friends.
I forgot that.
Here's my first naive mistake with the book and I did it in the initial phone call: I assumed that because I could use a body of knowledge in my work I was familiar with the material enough to teach it.
"Sounds fun. Let me put together an outline." I told the editor.
What is an outline?
At least with Packt, the book outline serves two purposes. Firstly, it informs the editors of the intention of your book project: these are the readers I expect, these are the things I intend to cover, these are the things I don't intend to cover. Secondly, the outline serves to structure the writing project itself. I wrote a summary of each chapter, describing how the chapters would call back to previous ones and influence those to come. This was very valuable four months in when I was tired and maybe had lost a bigger sense of the project.
The outline didn't end up being the book I wrote. I'd assumed that the seventh chapter, Atomics: Safely Reclaiming Memory, would be a smaller section in the sixth, Atomics: The Primitives of Synchronization. As it happens, chapter seven is one of the longest in the book!
If you're interested, you can find the outline we went with on Dropbox, here. The book eventually shipped with ten chapters, not twelve, and the working title Rust Concurrency didn't survive through to the end of the project.
But! By December 23 I had word the publisher was enthusiastic for the book idea, contracts were signed and I was off.
Laying down chapters
To help me keep tempo on the project I wrote a little program called coach
that would inspect a yaml file with chapter/date goals and tell me how much I needed to write each day. Here's the schedule file:
---
global:
total_pages: 350
words_per_page: 300
chapters:
- sequence_number: 01
total_pages: 20
preliminary_draft: 2018-01-10
final_draft: 2018-04-29
- sequence_number: 02
total_pages: 35
sufficient_wordcount: true
preliminary_draft: 2018-01-22
final_draft: 2018-05-02
- sequence_number: 03
total_pages: 35
preliminary_draft: 2018-02-01
final_draft: 2018-05-05
- sequence_number: 04
total_pages: 25
preliminary_draft: 2018-02-09
final_draft: 2018-05-08
- sequence_number: 05
total_pages: 30
preliminary_draft: 2018-02-19
final_draft: 2018-05-11
- sequence_number: 06
total_pages: 40
preliminary_draft: 2018-03-04
final_draft: 2018-05-14
- sequence_number: 07
total_pages: 20
preliminary_draft: 2018-03-11
final_draft: 2018-05-17
- sequence_number: 08
total_pages: 15
preliminary_draft: 2018-03-16
final_draft: 2018-05-20
- sequence_number: 09
total_pages: 45
preliminary_draft: 2018-03-29
final_draft: 2018-05-23
- sequence_number: 10
total_pages: 20
preliminary_draft: 2018-04-05
final_draft: 2018-05-26
- sequence_number: 11
total_pages: 40
preliminary_draft: 2018-04-18
final_draft: 2018-05-29
- sequence_number: 12
total_pages: 20
preliminary_draft: 2018-04-25
final_draft: 2018-06-01
The schedule here is intense in a way I didn't fully appreciate at the outset. But, the general notion of doing a little bit of work each day—usually in the morning—is a style of work I find very comfortable. Up through chapter three, approximately, I managed to keep up with the schedule. coach
would say it was a 500 word day and I'd crank through that before work, over lunch breaks (carting my personal laptop back and forth to work) and again after work. Saturdays were fairly well taken over as uninterrupted writing days. This pattern held all through the project—I don't have children and my wife works long hours as a chef with weekends offset from my own—but broke down once I'd got through chapter three.
I forgot about software.
If you have a look at the book's code repository and take a peak in the chapter sub-directories you'll find that the amount of software I wrote custom for the chapters increased markedly between chapter three and four and then again from chapter four to five. The software written for the book is tested (with quickcheck, often enough), probed with fuzzers and verified in a variety of other ways I go into in the text. Meaning, I found a good deal of bugs in the custom software and spent much more time than expected on, say, a single section out of a chapter. I wrote the book serially—hah!—and without any co-authors: if I couldn't figure out a bug I couldn't advance the book.
Judging by my commits, here's how the book actually proceeded:
Chapter Name | Started | First Draft Complete |
---|---|---|
Preliminaries: Machine Architecture and Getting Started with Rust | December 30, 2017 | January 9, 2018 |
Sequential Rust Performance and Testing | January 10 | January 20 |
The Rust Memory Model: Ownership, References and Manipulation | January 23 | February 04 |
Sync and Send: the Foundation of Rust Concurrency | Febrary 20 | February 25 |
Locks: Mutex, Condvar, Barriers and RWLock | February 26 | March 01 |
Atomics: The Primitives of Synchronization | March 02 | March 17 |
Atomics: Safely Reclaiming Memory | March 21 | April 07 |
High-Level Parallelism: Threadpools | April 22 | April 28 |
FFI and Embedding: Combining Rust and Other Languages | April 30 | May 08 |
Futurism: Near-Term Rust | May 09 | May 11 |
Each time you see a big jump between the end of a chapter and beginning of the next, that's me struggling to get the software for that chapter ready. Without the software, there's not much to talk about in Sync and Send. Each time there's a chapter that goes on for a while, like the Atomics chapters, I'm moving slowly from one section to the next as I produce the software to write about it.
Ultimately, the schedule the publisher and I worked up was just not relevant to the day to day work of putting the book together and I started ignoring it for my own well-being. I'm sure that my editors would be not at all surprised by that information.
My writing environment
Packt Publishers has a workflow built around a thing called Type Cloud, which, near as I can tell, is built on top of Wordpress. Type Cloud is a WYSIWYG browser-based editor and ended up being very effective for the final proofing stage of the book, going through and fixing up formatting issues or incorporating technical reviewer feedback. For writing, at least for me, Type Cloud did not present a good working environment. Part of the project deal was that I'd send specially formatted HTML-like markup that Type Cloud uses internally to my content editor—more on the different kinds of editors below—and they'd import it into the WYSIWYG environment.
My understanding is that other publishers work end-to-end in Asciidoc, LaTeX or similar. That sounds nice.
I wrote in Pandoc markdown, one file per chapter. The source code lived in a separate directory structure from the chapter source and I included it with pandoc-include-code
. I'm not at all sure how I would have produced the book without this plugin.
A great deal of care was made to ensure that readers wishing to type out the source code of a book could get a working project on the other end. Me, I find it beneficial to type-along to a text and get really frustrated when the source code is missing something important. Also, it's not impossible that the source code for a book won't be available online a few decades after having been initially published: even worse! I, my editors and the technical reviewers all went through and checked that you could read the book in this way.
The editing team
There were a good number of other people that worked on this book with me, each with different jobs. I worked with most everyone through email or Type Cloud's inline commenting scheme. The roles:
- acquisition editor, discussed the initial project with me, worked to finalize the outline
- content editor, my day to day contact and helped shepherd the overall tone of the book
- technical editor, where the 'content' editor was concerned with the prose of the book the 'technical' editor was concerned with the software side of the house
- copy editors, this small crowd of people went through the text, correcting typos, asking about awkward wording, ensured the typographic conventions were uniform throughout and asked me not to begin sentences with "Which, ..."
- technical reviewer, this person typed out all the book source code, called out any questionable assertions in the text and generally read the book as a reader would, advocating their position
Each of these people had a different take on the book. Some takes I agreed with, some strayed from my original intentions. The copy edit stage—this happens after the first draft is submitted and before the draft is considered final—was kind of rough in this regard. I fully admit I am particular about how I word things and generally read material aloud to make sure the language flows in the way I'd say it. (I will have read this article aloud a few times before you've read it.) My other major naive mistake was to believe that my sense of a chapter would make it through from the first draft to the final. It mostly did, but if a couple of copy editors insisted on a phrase change I'd have to ask myself if this was vital to my intention or just part of my offbeat idiolect.
Final thoughts
Now that I've gone through all the work to write a book it's clear to me that I would love to write another. I'm not sure on what subject but I guess we'll see here in the next five years or so. I think I could, very reasonable, take a few of the chapters from Hands-On Concurrency with Rust and expand them into something stand-alone.
That said, I absolutely will not write one in five/six months again and probably in the future cannot. I bet the life circumstances that allowed the book to be written in as brief a time as I took are temporary. In retrospect I can't rightly understand how the book was written so quickly and I hope, as people go through it, they don't come away feeling that it was rushed. Me, I've read the book through several times and don't think it reads rushed but I also don't have much space from it.
The process of producing a book in secret and then gating access to the material behind payment seems odd to me. I know that's how it's mostly done. But, a model like Learn You Some Erlang for great good!, Real World Haskell or Crafting Interpreters—where the text of the book is available online during the writing and you can contact the authors meanwhile but if you want a physical copy you gotta pony up—seems like a better way to spread the ideas of a book and maybe a way to build a broader audience. This impression is driven partly by the collaborative, open model of software development I've grown up around and a sense that I won't make a living writing books. My motivation is to disseminate ideas. Making money sweetens the pot. It's similar for talks: I want to get some idea across and being able to travel around to do so is a really nice treat on the side.
Anyhow, if you have read or intend to read Hands-On Concurrency with Rust I sure do thank you. It was a real pleasure to put it together and I hope it has or will manage to teach you something.
An Incomprehensive Bibliography
Written: 2015-07-24
This is an incomprehensive bibliography for the books and articles that have influenced my thinking around Instrumentation by Default for complex systems. As this is an area of ongoing personal research, this bibliography is necessarily a vague snapshot.
Books
The following texts are sorted—roughly—in order of importance:
- Charles Perrow, "Normal Accidents: Living with High-Risk Technologies" review
- Joseph Tainter, "The Collapse of Complex Societies"
- Henry David Thoreau, generally
- David A. Mindell, "Digital Apollo: Human and Machine in Spaceflight" review
The following texts are sorted—precisely—as they occurred to me:
- Charles Perrow, "Complex Organizations: A Critical Essay"
- Svetlana Alexievich, "Voices from Chernobyl"
- Henry S.F. Cooper Jr., "Thirteen: The Apollo Flight that Failed"
- David E. Hoffman, "The Dead Hand: The Untold Story of the Cold War Arms Race and its Dangerous Legacy"
- Igor Kostin, "Chernobyl: Confessions of a Reporter"
- William Vollmann, "Europe Central"
- Kurt Vonnegut, "Player Piano"
- Various, "Report of the PRESIDENTIAL COMMISSION on the Space Shuttle Challenger Accident"
- Richard Feynman, "Surely You're Joking, Mr. Feynman"
- Eric Schlosser, "Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety"
- Hermann Kopetz, "Real-Time Systems: Design Principles for Distributed Embedded Applications"
- Various, "The Practice of Programming"
- Deepwater Horizon Study Group, "Final Report on the Investigation of the Macondo Well Blowout"
- Various, "Library of American: The Debate on the Constitution I & II"
- Sanora Babb, "Whose Names are Unknown"
- Lockheed Martin Corporation, "Joint Strike Fighter Air Vehicle C++ Coding Standards"
Articles/Essays
- Joe Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors"
- C. West Churchman, "'Guest Editorial' of Management Science (Vol. 14, No. 4, December 1967)"
- G. H. Hardy, "A Mathematician's Apology"
On Deck
There are several texts that I haven't gotten a chance to read yet. I can't vouch for them with complete confidence, but I'm excited to read them.
- Czeslaw Milosz, "The Captive Mind"
- Francis Spufford, "Backroom Boys: The Secret Return of the British Boffin"
- Samuel C. Florman, "The Existential Pleasures of Engineering"