Challenges in Keeping Software Clean and A New Hope
The recent global outage caused by an update to the CrowdStrike Falcon agent caused me to reflect on software development and software design practices, with a focus on stability. In May 2023, CISA director Jen Easterly said, “We don’t have a cyber problem, we have a technology and culture problem. Because at the end of the day, we have allowed speed to market and features to really put safety and security in the backseat.” I agree with this assessment, and I think the recent major IT outage puts software safety and reliability front and center.
There are significant, tangible challenges developing reliable and resilient code that we will discuss, but not dwell on for too long. I’d like to focus more on solutions. Also, it is important to note that this article is focused on the challenges of building security agents. While massive, distributed databases may share some of the challenges and benefit from many of the techniques described in this article, there are notable differences. Let’s dive in!
Challenges
Three significant challenges development teams face while building resilient products are:
1. Fast Delivery: The first bane of software quality is our competitive market. Organizations are in competition with each other for business, and winning an account might mean releasing a feature a few weeks before it was truly baked. The desire to deliver quickly means fewer quality controls and greater risk of stability issues.
2. Turnover: Engineering teams are not static, but constantly changing. People retire, move, and find new opportunities. Similarly, new team members are added that have little experience with an existing codebase, and need to be onboarded in such a way that gradually introduces them to a product and allows them to make changes without a high probability of causing damage due to the incorrect use of some archaic function.
3. Product Complexity: Any CS101 student, and many people that never took a computer science course in their life, can write a simple “Hello World!” program in a language of their choice. If this program were a product that someone was willing to pay for, it could be built and delivered without much worry of breaking in the future or causing system instability. Of course, products that add value are far more sophisticated. With this complexity comes the opportunity for unexpected second or third order effects. A change in one area of a codebase may have unintended second or third order consequences in a different product managed by a different team that works in a different time zone. Worse, the problem might not be deterministic, and only appears in very specific circumstances.
Combatting the three challenges of fast delivery, turnover, and product complexity requires several different techniques:
-
- Deterministic startup and shutdown
- Testing
- Specialized build types
- Documentation
- Code reviews
- Static code scanning
- Diagnostics and logging
- Defensive programming techniques
For this first piece in the series, we are going to focus on deterministic startup and shutdown, testing, and specialized build types.
Solutions, Part 1
Deterministic Startup and Shutdown
When developing the software used on the first manned mission to the moon, NASA engineers faced significant reliability requirements. One particularly sophisticated application was the software running the Apollo Guidance Computer (AGC) used for managing navigation and guidance for the mission. One technique the engineers found that significantly improved reliability was ensuring that application startup and shutdown was deterministic, explicit, and well thought out. The software for the AGC required the availability of numerous sensors and hardware interfaces. When the application started, each interface was validated and brought up, one at a time. On shutdown, each interface was shut down, one at a time. Each shutdown process involved freeing the memory and other resources used by the interface.
Having a clean startup and shutdown is useful for a few reasons. One of the most common issues experienced by security agents is “it didn’t start properly”. These issues are normally straightforward to identify, and can be due to missing dependencies, or missing operating system resources. Clean, deterministic shutdown is important because it helps ensure the data model and object models used by an application are resilient. By forcing an engineering organization to ensure shutdown is deterministic, it also causes memory to be freed in a standard manner. Ensuring this level of determinism and employing a well thought out data model is critical at runtime too, when applications are normally handling events.
Like the hardware interfaces used by the AGC software, security agents are often composed of many services running within a single application. Examples may include a logging service that writes logs on the local machine, a threat response service that handles attacks, and an event handling service that analyzes activity on the monitored system. Ensuring deterministic startup and shutdown verifies that each of these services can be created or cleaned up in a repeatable, reliable manner that is free of issues such as race conditions. Deterministic startup and shutdown also enable effective testing, the topic of our next section.
Testing
There are three types of tests worth covering in this section:
- Unit tests
- Functional tests
- End-to-end tests
Unit testing is what some organizations swear by, ensuring every single function is independently testable. In our experience, this expectation is ineffective and unrealistic for runtime security agents. Often, this approach leads to breaking up functions into smaller subfunctions that are difficult to understand until they are used together in a larger method. This issue makes it harder to onboard new engineers and handle churn in the engineering team. That said, unit tests are quite useful to test methods that will be reused often and are self-contained. For example, a function that parses a string input using regex and returns a result is relatively straightforward to test with unit testing.
Functional tests are often the most useful form of testing for runtime security agents. Functional tests involve starting the agent, exposing it to some system behavior, and ensuring the response is appropriate. For example, a functional test may involve starting an agent, monitoring a specific application, running malware from that application, and verifying the agent responds correctly. Functional testing is more difficult to automate than unit testing, but benefits from exercising significant portions of the code base in a manner that is like a production system. If a system is broken up in an appropriate way and the interfaces are well defined, functional testing of independent components goes a long way to ensure effective production performance.
End-to-end tests involve testing the entire system. These tests are the most difficult to automate. An example of an end-to-end test would be starting up an agent, monitoring an application, launching malware through that application, generating an alert from the agent that is sent to a management console, and automated a user interacting with that alert to select a response. This type of testing requires automating the instantiation of the malware, and user activity in the UI. It also involves automating the setup and teardown of an agent and a management console. If effective functional testing of the management console and agent are implemented, setting up end-to-end tests is typically simpler. Thus, it is better for teams to start with unit tests, layer on functional tests, and add end to end tests only after functional tests are in place.
Build Types
While most users will only interact with the released version of a product, it is essential for an engineering organization to have multiple types of build. In addition to the release build, I would suggest at least having:
- Debug build
- Address sanitizer build
- Thread sanitizer builds
- and Trace builds
Debug builds are meant to aid in debugging. These executables are not stripped and involve extra symbols to enhance the debugging experience. Aside from this, debug builds are the closest to normal production builds.
Address sanitizer builds include extra instrumentation to detect memory leaks and memory corruption. These builds are extraordinarily useful for identifying nefarious issues that are hard to diagnose simply by reading logs.
Thread sanitizer builds are like address sanitizer builds in that they include extra instrumentation. However, this instrumentation is used to diagnose data races, lock inversion issues, and many other problems that occur with parallelized computing. Thread sanitizer builds are an absolute must for diagnosing hard to reproduce race conditions or deadlock issues.
Finally, we have the trace build. The trace build is a normal debug build, with an extreme amount of logging. In a trace build, you want to log entry and exit from every function. Entry logs should include the function parameter values and return log statements should include the return value. Similarly, any operations that exit the current scope, such as an exception being thrown, should have their own trace statements. Trace statements should be categorized based on the subsystem they relate to, so the logs can be parsed with automated tools to make them easier to read. Trace builds are incredibly useful for addressing issues when the other build types are not enough. Further, inspection of the logs can lead to better understanding of program behavior. It is not uncommon for engineers to think a function is called once, only to review the trace logs and see it is in fact called dozens of times. Trace logging provides better insight into how a program is operating and can be used to improve the application’s efficiency by removing unnecessary function calls. Thus, trace logging should be reviewed periodically, not just when an issue is identified in the field. This periodic review ensures engineers have a solid understanding of how their application operates and won’t be surprised with unexpected behavior.
________________________________________________________________________________________________
That’s all we have time for today, but please stay tuned as I will post additional pieces in this series around the other solutions:
-
-
- Documentation
- Code reviews
- Static code scanning
- Defensive programming techniques
- Diagnostics and Logging
-