Aviation-Grade Testing in SQLite

How much test coverage is enough? SQLite maintains 92 million lines of test code for 155,000 lines of production code. In their words, a 590:1 ratio. That sounds absurd until you understand how a small team maintains code deployed on billions of devices while making “major changes to the structure of the code” without fear of breaking things.

When I wrote about proving your code works before submission, I focused on the minimum bar of manual testing plus automated tests that demonstrate correctness. SQLite’s testing approach reveals what “proven” means at the other end of the spectrum, and more importantly, when that extreme investment actually makes economic sense.

What Aviation-Grade Actually Means Link to heading

Richard Hipp describes SQLite’s testing as following DO-178B, the FAA standard for safety-critical aircraft software. In his 2019 SIGMOD interview, he explains:

It’s a very detailed design spec about the processes you do in developing the software. And the key point is that you have to test it to 100 percent modified condition/decision coverage (100% MC/DC). Which, basically, means you have to test the machine level such that every branch instruction has been taken and falls through at least once.

MC/DC is stronger than branch coverage. For a condition like if (a > b && c != 25), branch coverage requires two tests (true and false outcomes), but MC/DC requires testing where each condition independently affects the decision. You need to show that a > b matters even when c != 25 is false, and vice versa.

The standard evolved for avionics because in aircraft software, untested code paths can kill people. SQLite adopted it because embedded deployment creates similar constraints. You can’t patch a car’s firmware or a phone’s database engine with the ease of deploying a web service. As Hipp notes “It will amaze you how many bugs pop up when your software is deployed on two billion cell phones.” It was a “fun” day when I realized our AWS service volume meant we’d probably hit a once-in-a-trillion bug every day.

The Counterintuitive Benefit: Speed Link to heading

Hipp’s framing also flipped the conventional wisdom for me:

Aviation grade testing allows us to move fast, which is important because in this business you either move fast or you’re disrupted. So, we’re able to make major changes to the structure of the code that we deliver and be confident that we’re not breaking things because we had these intense tests.

Half their development time goes to writing tests. They’ve spent 17 years accumulating this suite. And it enables a handful of developers to maintain what might require 30-50 people with traditional approaches.

The mechanism is confidence. When the test suite is trusted and comprehensive, you can refactor the query planner without three rounds of manual QA. You can change internal data structures knowing that if something breaks, you’ll know within minutes instead of months. The tests become the specification for precise, executable, and constantly verified.

Compare this to PostgreSQL’s approach. Hipp asked them how they prevent bugs and learned they rely on “a very elaborate peer review process, and if they’ve got code that has worked for 10 years they just don’t mess with it, leave it alone, it works.” Both strategies achieve quality. SQLite’s approach enables fearless change while PostgreSQL’s favors stability through conservatism. I sadly see the latter more often than necessary.

Constraints enabling automation show up here too. Extensive testing creates forcing functions. You can’t make ad-hoc changes when every modification must pass millions of tests. That constraint clarifies intent. The test suite becomes the contract, and violations are caught immediately rather than discovered in production.

But test architecture can be deliberate. You can design your test suite to act as speed bumps in critical paths (testing core database integrity checks forces you to think hard before changing them) while letting you refactor freely in others (well-isolated component tests that don’t cascade failures). The worst outcome is adding tests mindlessly (esp the AI / LLM ones) creating a brittle suite where every refactor breaks hundreds of unrelated tests. That gives you a maintenance nightmare than any shred of confidence.

Trade-Offs You Can’t Avoid Link to heading

SQLite’s testing documentation acknowledges something most teams rarely discuss upfront: different ways to test are often in conflict.

1. MC/DC vs Fuzzing Link to heading

The testing page states plainly as “Fuzz testing and 100% MC/DC testing are in tension with one another.”

MC/DC requires testing every branch and proving each condition independently affects outcomes. This discourages defensive code because branches designed to handle never-possible cases are unreachable during normal testing, hurting coverage metrics. But fuzz testing does the opposite, specifically targeting edge cases and malformed inputs that require exactly those defensive checks normal code paths never hit.

SQLite’s solution? Keep both. They maintain 100% MC/DC for the core while running continuous fuzzing with dbsqlfuzz and Google’s OSS-Fuzz. The fuzzer runs billions of mutations daily, catching adversarial inputs that structured tests miss like malformed SQL, unexpected state transitions, corrupted data.

SQLite resolves the problem with ALWAYS() and NEVER() macros that compile to assertions in testing but pass-through in production. This excludes defensive branches from coverage metrics while keeping protective checks in shipped code. This approach isn’t elegant, but it’s honest about the tension between provable correctness and robust failure handling.

2. Defensive Code and Unreachable Branches Link to heading

SQLite also explicitly keeps additional defensive code even when it complicates coverage. They document using testcase() macros to mark boundaries and bitmask effects that should be tested but might be hard to trigger naturally.

I’ve seen teams face this differently. Either they remove defensive checks to hit coverage targets (“this can’t happen, so we don’t need the check”), or they skip coverage for entire sections (“too hard to test, we’ll just do this manually”). At my previous workplace I’ve seen sister teams push back on assertions because they’ll bring the coverage down. SQLite’s approach is more disciplined because they mark what should be tested, build the infrastructure to test it, and track coverage with any exceptions documented. It’s very pragmatic and balanced.

3. Static Analysis and Warning Chasing Link to heading

Here’s another striking claim from the testing page that only a mature codebase can make:

More bugs have been introduced into SQLite while trying to get it to compile without warnings than have been found by static analysis.

Early projects benefit from static analysis because it catches obvious mistakes. But once you’ve fixed the low-hanging fruit, further warning elimination by refactoring code you don’t fully understand is a fool’s errand. It goes back to avoiding mindless upgrades to satisfy linter heuristics. It was one of the common triggers for AWS’ outages until they scaled back some of their arduous checklist for releases.

SQLite still uses Valgrind, memory sanitizers, and undefined behavior detection, but they’re selective. They explicitly mention that full Valgrind runs on every test would slow things 25x and making the suite impractical. They run it on the “veryquick” subset and TH3 (their proprietary test harness designed for 100% MC/DC coverage, more on this below), accepting the trade-off that some memory bugs might slip through in paths only exercised by slower tests.

Two Paths to Quality: PostgreSQL vs SQLite Link to heading

These trade-offs reveal a deeper question around how do you actually achieve quality? We can go back to the comparison with PostgreSQL to contrast two competing philosophies:

PostgreSQL’s approach is designed for larger, more distributed teams that push for extensive peer review before changes merge. There’s conservatism around proven code and high-value for stability through careful human review. SQLite’s approach is more suited for a tiny but deep team where it’s possible to justify minimal peer review process. This allows for fearless refactoring enabled by extensive tests, with stability coming through automation.

Neither is universally better. PostgreSQL’s model works for a community-driven project where contributors come and go. Peer review socializes knowledge and catches design issues that tests miss. SQLite’s model works for a small, stable team with deep domain expertise and the discipline to maintain comprehensive tests.

The economic constraint differs too. PostgreSQL’s model requires organizational coordination but spreads cost across many contributors. SQLite’s model requires upfront investment in developing TH3 and constant test maintenance, but enables a lean operation. For a widely-deployed library supporting itself through support contracts, focusing on depth over size matters. For a more community-driven project, engaging many contributors matters more.

At Block, I see both models at work. Services with smaller, stable teams tended toward SQLite’s approach with heavy automation and fearless refactoring. Services with rotating contributors (especially those worked on by many teams) leaned toward the PostgreSQL model with extensive design reviews and an “if it ain’t broke, don’t touch it” mentality. Neither felt wrong, though I do see the smaller/stable teams move much faster because they save on coordination overheads (until they are forced to work with “that other team”).

Where Should Your Code Land on This Spectrum? Link to heading

Google’s testing blog suggests 70-80% coverage for most projects, noting that going higher shows diminishing returns. Empirical studies found that increasing coverage beyond this range is “time consuming and therefore leads to a relatively slow bug detection rate.”

But those recommendations come with caveats as higher coverage goals are recommended for safety-critical systems or where the cost of a failure is high, such as products for the medical (IEC 62304) or automotive industries (ISO 26262), or widely deployed products. Arguably, SQLite falls in this bracket more easily now because of the effort they put into testing.

For most other application code, 70-80% coverage with good TDD practices works. TDD may slow initial development 15-30% but speeds overall delivery through fewer bugs and faster maintenance. That’s the right trade-off when you can iterate quickly and fix issues post-deployment.

For infrastructure code (think libraries, frameworks, systems that others depend on), the cost of a bug multiplies across all downstream users. The inability to quickly patch (embedded systems, long release cycles) changes the risk profile. The internal version of Ruler took years to upgrade because of how deeply embedded it was while the library itself was getting monthly updates back then. SQLite’s extreme test-to-code ratio made a lot of sense to me because of the environments it gets deployed in.

Test What You Fly, Fly What You Test Link to heading

TH3, SQLite’s proprietary test harness, embodies another aviation principle of “test what you fly and fly what you test.” Rather than testing source code directly, TH3 validates compiled object code using only public APIs.

This catches compiler bugs, configuration drift, and platform-specific issues. When you’re deployed across iOS, Android, Windows, Linux, and embedded systems with different compilers and optimization flags, testing the source alone misses entire categories of bugs. You need to verify the actual binary behavior across configurations.

Most teams don’t need this. If you control deployment and use consistent toolchains, source-level testing suffices. But for libraries deployed everywhere, the additional rigor pays off. The SQLite team runs full test suites on multiple platforms including “antique hardware like ancient MacBooks still using PowerPC processors” that take 3-4 days(!!) for complete validation. Some might think it’s overkill, but it’s why SQLite can claim to be “built into all mobile phones and most computers.”

If you’re iterating on application features weekly, this would kill velocity. For infrastructure code where bugs cascade, it’s what reliability actually costs.

The Economics Nobody Talks About Link to heading

When someone shows you their code and says “it works,” the natural follow-up is “how do you know?” For a prototype, “I ran it once and it didn’t crash” might suffice. For infrastructure code supporting 911 calls, it’s catastrophically insufficient.

SQLite’s approach clarifies what most teams miss; that testing rigor is an economic decision with specific enabling conditions. The extreme investment works because they’ve amortized it over 17 years, the team stayed tiny, the domain is stable (database engines don’t fundamentally change), and bugs generate expensive support calls. Remove any of those conditions and the equation changes.

The interesting part is what enables what. SQLite’s testing doesn’t just prevent bugs, it enabled a business model. Hipp initially thought he’d sell the test suite. That didn’t work, but the testing did become the product differentiator and allowed handful of developers support billions of devices because the tests do the quality assurance work that would otherwise require organizations of people. I’m pretty sure companies pay for SQLite support and licenses not because of the public domain codebase, but because they’re buying the confidence that comes from such rigor.

Most other teams face opposite constraints like changing requirements, large distributed teams, tolerance for some bugs, ability to patch quickly. If you’re building a startup with rapidly changing requirements, SQLite’s approach would be fatal. You’d spend months perfecting tests for features you’ll pivot away from. We shipped EventBridge with limited testing and paid for that. But it was the right call. When we launched multi-region endpoints for recovery, no amount of testing felt enough. The question was always “what’s the cost of being wrong?”

There’s also a spectrum to consider, with “manual testing” on one end and “100% MC/DC with continuous fuzzing and multi-platform validation” on the other. Where your code lands depends on deployment constraints, failure costs, and team dynamics, not on what testing blog posts say you “should” do.

Still, I’m curious whether the aviation-grade approach will become more common as AI generates more code. When generation is cheap, perhaps verification become the differentiator. SQLite spent half their time writing tests because humans were writing the production code slowly. If AI writes the production code in minutes, does the test investment equation change? Or does it just mean we need even more rigorous verification because we’re generating code faster than we can understand it!

And what happens to the PostgreSQL model-peer review and conservatism when AI agents can participate in code review? Do we get the best of both worlds (automated testing plus AI-augmented review), or does the automation just shift the bottleneck somewhere else we haven’t identified yet?