Members box download this article in PDF format.
What you will learn:
- Main problems faced by designers when designing orbital equipment.
- Types of radiation that can lead to “events” that damage equipment.
- How to mitigate the impact of these events.
- How to deal with software errors.
It’s an understatement to say that launching equipment into space doesn’t come cheap. Even with falling costs in recent years, the cost of a satellite and its launch can still be between $10 million and $400 million. Indeed, a geomagnetic storm in February is estimated to have caused around $50 million in damage to SpaceX’s low Earth orbit (LEO) communications satellites. Weather monitoring satellites cost about $290 million.
And that’s before maintenance costs, with the failure of one component requiring either in-space repair or the cancellation of the entire system. In short, all equipment must be incredibly reliable and able to withstand the extreme environment.
There are basically three key differences that chip and systems engineers face when creating equipment in orbit versus on the ground.
The first is the temperature, the equipment must be protected from extreme fluctuations (150°C) and an operational level maintained. The second is emptiness.
Collectively, these create a different cooling effect than on the ground, relying solely on thermal radiation rather than air convection. This requires slightly different calculations for heat dissipation. It also creates issues with moisture, which enters the packaging on the ground and then seeps out of the packaging once in orbit, which can potentially peel the packaging off the board. So a separate qualification is required to make sure you haven’t trapped moisture in the packaging before launch.
Both of these issues are relatively simple to mitigate through packaging and insulation. The third difference (and arguably the most difficult) encountered by electronic components in space is radiation.
The Earth’s magnetosphere concentrates particles into two main belts (Fig.1). The lower belts are mostly protons and the upper belts are mostly electrons, with most of these particles coming from the sun via solar wind/solar flares. All remaining particles come from cosmic radiation from other galaxies.
While it’s difficult to replicate the full test environments on Earth – to generate some of these particles would require in the region of giga-electron-volts – these belts are at least very well understood and have been studied by NASA since the beginning of its space programs. Depending on the equipment’s orbit, it will traverse these belts at different speeds.
Anything operating in a polar orbit, such as a spy satellite, will regularly pass through concentrated radiation belts and will need to be shielded from a higher radiation dose. While LEO satellites operate between 1,000 and 1,500 km and therefore experience lower levels of radiation.
Thus, depending on these factors, a satellite can absorb between 1 and 10 kilorads per year. Therefore, we need to calculate the dose that a satellite will receive during its lifetime.
Effect of radiation
In addition to dose, we need to look at the types of radiation and the damage they cause.
Let’s first look at ionizing radiation (non-ionizing radiation is covered near the end of the article), which can come from protons or electrons. These particles strike the gate oxide of the semiconductor and cause damage by accumulation of charged particles in the MOSFET gates (Fig.2).
In a PMOS transistor, this increases the threshold voltage and makes it harder to turn on. Conversely, in an NMOS transistor, the reverse is true, causing it to turn on at a lower threshold.
And the probability of this happening is proportional to the size of the door. Therefore, older process nodes tend to have a higher probability of gate oxide radiation damage.
Also, in an analog design, you can get bandgap shifts, changes in bias current leakage side effects, and increased 1/F noise.
Types of events
There are two possible outcomes following an impact: a non-destructive or a destructive event (Fig.3). A non-destructive event can be a single upset event in a storage element, a soft error where the radiation causes a noise spike and changes a memory location from zero to one.
A destructive event can be a single-event door break, which primarily affects power devices. Or, if the particle impact energy is high enough, it can also cause a single-event lockout, causing a device to turn on permanently until a power cycle is complete. undertaken. And depending on the device, this can be catastrophic.
Thus, you must detect and protect the equipment against these events. We have looked at these single end events and their effects tend to come more from the heavier particles, creating electron hole pairs and a transient conduction path; flipping memory slots and toggles is also an issue. And that’s usually how most of these things are tested: putting a lot of memory on a device, streaming it, and then measuring the number of errors on the megabits of memory.
Lockout is a separate test and you need to put in safeguards to detect surges and read the ability to reset critical devices.
The first is gate selection. For example, with a power transistor, which must not turn on inadvertently, it is better to use PMOS than NMOS.
Also, as probability increases with size, there is some benefit to going to smaller process nodes, but it also comes with other risks.
And there are specific design mitigation techniques such as avoiding doors with many entrances.
Beyond these basic steps, a multitude of other specific mitigation techniques target both single-event disturbances and single-event locks.
Noteworthy among these is the triple redundant flip-flop (Fig.4). This (and its multiple variations) has often been used in the aerospace industry, as it offers the best possible protection against single-event disturbances.
But there are drawbacks that prevent the use of the triple-redundant flip-flop throughout the design: it will triple the size of your solution and increase the power demands of the entire system. But they need to be deployed in critical areas where you need to make key decisions.
To prevent a clock problem from causing the same error to propagate, it is also possible to jitter or defer the clocks of each of the flip-flops so that they are all completely independent. The disadvantage of this method is that it doubles the rate of safe failures.
Finally, you could have a fully redundant system with separate microprocessors where results are compared. And only when there is a majority of votes can the output be used to decide. Again, this is an expensive solution.
In short, the mitigation technique employed will depend on the criticality of the component and the likelihood of a destructive event affecting it.
Prediction and correction of software errors
After an incoming particle hits RAM, the issue depends on where the next step is read or write. If you are going to read this data, you have a problem. And if the next operation is a write, the error will be cleared. So how do you predict what will happen?
It is completely interdependent with the software operation of the system. And, of course, you can implement memory refresh strategies to try and keep your memory clean.
Going a step further, for some functions, where the data is not essential to the function, you can skip it and continue.
For more critical items, you can deploy an Error Correcting Code (ECC) to give you error protection, with the ECC built into memory (Fig.5). It should be noted that you will not always have enough correction parity to correct all the errors. However, even then, such an approach will still be able to catch an error, alerting you to a problem and preventing you from performing certain actions or functions at critical steps.
Additional measures that can be taken include implementing cyclic redundancy checks in communication channels; or by setting the Hamming distance of state machines to a value greater than 1, which will prevent accidental switching to another state.
Beyond that, it is possible to run software self-test procedures, as well as have hardware checked on software and software on hardware. Of course, there are also external watchdogs, like we have in embedded systems. Finally, an internal clock monitor or a remote security clock can be retrieved from a PC.
Coping with non-ionizing radiation
As mentioned above, there are also effects of non-ionizing radiation. Displacement damage leads to more gradual effects, with pieces of the silicon structure being damaged over time and leaks causing decreased gain in bipolar transistors.
This effect has been documented in satellites, especially with CMOS imaging sensors that become damaged over time. In this case, the component may need to be replaced at some point, or the use of larger devices to create spare pixels and extend the life of the satellite.
Again, a host of mitigation techniques can be deployed, including incorporating some form of redundancy or ECC into the system. On the analog side, it’s a good idea to watch voltage and current carefully – if there’s a lockup, you can detect it more easily and shut things down quickly.
And the arrangement of the silicon plays a crucial role. Here on Earth, there’s always a push to bring metal tracks closer and closer with every generation, with designs that take them to the absolute limits of what technology can handle. However, for electronics intended for use in space, it is advisable to increase the separation of the critical node to give better protection.
So, is Your Chip Space qualified?
Can you use something off the shelf? It’s a little (much) harder than that, and the answer, of course, is “it depends”. NASA tried to define four classes: A, B, C and D (Fig.6). These depend on the mission and its lifespan. For example, the James Webb Space Telescope is class A (see opening image).
The ionizing radiation that electronic components will experience in space can cause significant damage if they are not taken into account from the start of their design.
Several mitigation techniques can be deployed. However, cost and size limitations preclude their use throughout a system. Thus, careful cost versus risk analyzes should be made when developing the system as a whole.
*James Webb Space Telescope aperture image credited to NASA/Desiree Stover.