[linux-elitists] ELC: SpaceX lessons learned

Eugen Leitl eugen at leitl.org
Mon Mar 25 04:54:08 PDT 2013


http://lwn.net/Articles/540368/

ELC: SpaceX lessons learned

By Jake Edge

March 6, 2013

On day two of the 2013 Embedded Linux Conference, Robert Rose of SpaceX spoke
about the "Lessons Learned Developing Software for Space Vehicles". In his
talk, he discussed how SpaceX develops its Linux-based software for a wide
variety of tasks needed to put spacecraft into orbit—and eventually beyond.
Linux runs everywhere at SpaceX, he said, on everything from desktops to
spacecraft.

 
Rose is the lead for the avionics flight software team at SpaceX. He is a
former video game programmer, and said that some lessons from that work were
valuable in his current job. He got his start with Linux in 1994 with
Slackware.

SpaceX as a company strongly believes in making humans into a multi-planetary
species. A Mars colony is the goal, but in order to get there, you need
rockets and spaceships, he said. It is currently expensive to launch space
vehicles, so there is a need to "drive costs down" in order to reach the
goal.

The company follows a philosophy of reusability, which helps in driving costs
down, Rose said. That has already been tried to some extent with the space
shuttle program, but SpaceX takes it further. Not only are hardware
components reused between different spacecraft, but the software is shared as
well. The company builds its rockets from the ground up at its facility,
rather than contracting out various pieces. That allows for closer and more
frequent hardware-software integration.

One thing that Rose found hard to get used to early on in his time at SpaceX
is the company's focus on the "end goal". When decisions are being made,
people will often bring it up: "is this going to work for the Mars mission?"
That question is always considered when decisions are being made; Mars
doesn't always win, but that concern is always examined, he said.

Challenges

Some of the challenges faced by the company are extreme, because the safety
of people and property are involved. The spacecraft are dangerous vehicles
that could cause serious damage if their fuel were to explode, for example.
There is "no undo", no second chance to get things right; once the rocket
launches "it's just gonna go". Another problem that he didn't encounter until
he started working in the industry is the effects of radiation in space,
which can "randomly flip bits"—something that the system design needs to take
into account.

There are some less extreme challenges that SpaceX shares with other
industries, Rose said. Dealing with proprietary hardware and a target
platform that is not the same as the development platform are challenges
shared with embedded Linux, for example. In addition, the SpaceX team has had
to face the common problem that "no one outside of software understands
software".

SpaceX started with the Falcon rocket and eventually transitioned the
avionics code to the Dragon spacecraft. The obvious advantage of sharing code
is that bugs fixed on one platform are automatically fixed on the other. But
there are differences in the software requirements for the launch vehicles
and spacecraft, largely having to do with the different reaction times
available. As long as a spacecraft is not within 250 meters of the
International Space Station (ISS), it can take some time to react to any
problem. For a rocket, that luxury is not available; it must react in short
order.

False positives are one problem that needs to be taken into account. Rose
mentioned the heat shield indicator on the Mercury 6 mission (the first US
manned orbital flight) which showed that the heat shield had separated. NASA
tried to figure out a way to do a re-entry with no heat shield, but
"eventually just went for it". It turned out to be a false positive. Once
again, the amount of time available to react is different for launch vehicles
and spacecraft.

Gathering data

Quoting Fred Brooks (of The Mythical Man-Month fame), Rose said "software is
invisible". To make software more visible, you need to know what it is doing,
he said, which means creating "metrics on everything you can think of". With
a rocket, you can't just connect via JTAG and "fire up gdb", so the software
needs to keep track of what it is doing. Those metrics should cover areas
like performance, network utilization, CPU load, and so on.

The metrics gathered, whether from testing or real-world use, should be
stored as it is "incredibly valuable" to be able to go back through them, he
said. For his systems, telemetry data is stored with the program metrics, as
is the version of all of the code running so that everything can be
reproduced if needed.

SpaceX has programs to parse the metrics data and raise an alarm when
"something goes bad". It is important to automate that, Rose said, because
forcing a human to do it "would suck". The same programs run on the data
whether it is generated from a developer's test, from a run on the
spacecraft, or from a mission. Any failures should be seen as an opportunity
to add new metrics. It takes a while to "get into the rhythm" of doing so,
but it is "very useful". He likes to "geek out on error reporting", using
tools like libSegFault and ftrace.

Automation is important, and continuous integration is "very valuable", Rose
said. He suggested building for every platform all of the time, even for
"things you don't use any more". SpaceX does that and has found interesting
problems when building unused code. Unit tests are run from the continuous
integration system any time the code changes. "Everyone here has 100% unit
test coverage", he joked, but running whatever tests are available, and
creating new ones is useful. When he worked on video games, they had a test
to just "warp" the character to random locations in a level and had it look
in the four directions, which regularly found problems.

"Automate process processes", he said. Things like coding standards, static
analysis, spaces vs. tabs, or detecting the use of Emacs should be done
automatically. SpaceX has a complicated process where changes cannot be made
without tickets, code review, signoffs, and so forth, but all of that is
checked automatically. If static analysis is part of the workflow, make it
such that the code will not build unless it passes that analysis step.

When the build fails, it should "fail loudly" with a "monitor that starts
flashing red" and email to everyone on the team. When that happens, you
should "respond immediately" to fix the problem. In his team, they have a
full-size Justin Bieber cutout that gets placed facing the team member who
broke the build. They found that "100% of software engineers don't like
Justin Bieber", and will work quickly to fix the build problem.

Project management

In his transition to becoming a manager, Rose has had to learn to worry about
different things than he did before. He pointed to the "Make the Invisible
More Visible" essay from the 97 Things Every Programmer Should Know project
as a source of inspiration. For hardware, it's obvious what its integration
state is because you can look at it and see, but that's not true for
software. There is "no progress bar for software development". That has led
his team to experiment with different methods to try to do project planning.

Various "off the shelf" project management methodologies and ways to estimate
how long projects will take do not work for his team. It is important to set
something up that works for your people and set of tasks, Rose said. They
have tried various techniques for estimating time requirements, from wideband
delphi to evidence-based scheduling and found that no technique by itself
works well for the group. Since they are software engineers, "we wrote our
own tool", he said with a chuckle, that is a hybrid of several different
techniques. There is "no silver bullet" for scheduling, and it is "unlikely
you could pick up our method and apply it" to your domain. One hard lesson he
learned is that once you have some success using a particular scheduling
method, you "need to do a sales job" to show the engineers that it worked.
That will make it work even better the next time because there will be more
buy-in.

Some technical details

Linux is used for everything at SpaceX. The Falcon, Dragon, and Grasshopper
vehicles use it for flight control, the ground stations run Linux, as do the
developers' desktops. SpaceX is "Linux, Linux, Linux", he said.

Rose went on to briefly describe the Dragon flight system, though he said he
couldn't give too many details. It is a fault-tolerant system in order to
satisfy NASA requirements for when it gets close to the ISS. There are rules
about how many faults a craft needs to be able to tolerate and still be
allowed to approach the station. It uses triply redundant computers to
achieve the required level of fault tolerance. The Byzantine generals'
algorithm is used to handle situations where the computers do not agree. That
situation could come about because of a radiation event changing memory or
register values, for example.

For navigation, Dragon uses positional information that it receives from the
ISS, along with GPS data it calculates itself. As it approaches the station,
it uses imagery of the ISS and the relative size of the station to compute
the distance to the station. Because it might well be in darkness, Dragon
uses thermal imaging as the station is slightly warmer than the background.

His team does not use "off-the-shelf distro kernels". Instead, they spend a
lot of time evaluating kernels for their needs. One of the areas they focus
on is scheduler performance. They do not have hard realtime requirements, but
do care about wakeup latencies, he said. There are tests they use to quantify
the performance of the scheduler under different scenarios, such as while
stressing the network. Once a kernel is chosen, "we try not to change it".

The development tools they use are "embarrassingly non-sophisticated", Rose
said. They use GCC and gdb, while "everyone does their own thing" in terms of
editors and development environments. Development has always targeted Linux,
but it was not always the desktop used by developers, so they have also
developed a lot of their own POSIX-based tools. The main reason for switching
to Linux desktops was because of the development tools that "you get out of
the box", such as ftrace, gdb (which can be directly attached to debug your
target platform), netfilter, and iptables.

Rose provided an interesting view inside the software development for a large
and complex embedded Linux environment. In addition, his talk was more open
than a previous SpaceX talk we covered, which was nice to see. Many of the
techniques used by the company will sound familiar to most programmers, which
makes it clear that the process of creating code for spacecraft is not
exactly rocket science.

[ I would like to thank the Linux Foundation for travel assistance to attend
ELC. ]


More information about the linux-elitists mailing list