Quamina: Parsing Numbers

This series is an excuse to build a writing habit while exploring a codebase I’ve wanted to understand better.

Automata operate on discrete symbols, typically bytes in a sequence. But numbers present a challenge in that their byte-level representations don’t preserve numeric ordering. Take IEEE 754 floating-point. The sign bit sits in the most significant position, which inverts the ordering of negative numbers. A negative value with all bits set (representing a large negative number) appears greater than zero in lexicographic comparison, even though it’s numerically smaller. Similarly, different decimal representations of the same number (like 35, 35.0, or 3.5e1) would follow different paths through an automaton despite being semantically identical.

For pattern matching systems like Ruler and Quamina, this poses a problem. The system needs to match numeric ranges (x > 100) and equality checks (x = 35.0) using byte-by-byte automaton traversal. The solution was a two-stage transformation: 1/ convert IEEE 754 bits into a lexicographically comparable form (numbits) and then 2/ encode that form as a variable-length byte sequence that’s UTF-8 safe and compact. The result is what Quamina calls Q numbers—byte sequences where lexicographic ordering exactly matches numeric ordering across the full float64 range.

How the Transformation Works Link to heading

Consider the number 42.5. In IEEE 754 format, it’s represented as 0x4045400000000000: a sign bit of 0 (positive), an exponent field, and a mantissa encoding the fractional component. The first transformation step converts this to numbits through branchless bit manipulation. The algorithm computes a mask based on the sign bit: for positive numbers, the mask is 0x8000000000000000 (just the sign bit set); for negative numbers, it’s 0xFFFFFFFFFFFFFFFF (all bits set). An XOR operation applies this mask—positive numbers get their sign bit flipped, moving them to the upper half of the uint64 space, while negative numbers get bitwise inverted, reversing their ordering so that more negative values produce smaller numbits.

For 42.5, the transformation yields 0xC045400000000000. This numbits value now has the property that lexicographic comparison on the raw uint64 matches numeric comparison. Meaning, all positive numbers fall above 0x8000000000000000, all negative numbers fall below it, and within each group, magnitude ordering is preserved. Zero sits exactly at 0x8000000000000000, the midpoint.

The second transformation converts numbits to a Q number using base-128 encoding. The algorithm processes the 64-bit value in 7-bit groups (septets) from right to left, identifying trailing zero groups that can be omitted. For 0xC045400000000000, this yields four significant septets: [0x01, 0x40, 0x22, 0x50]. Each byte is guaranteed to be less than 0x80, making the entire sequence valid UTF-8 and avoiding conflicts with Quamina’s internal markers. The big-endian ordering ensures that shorter encodings (fewer significant bits) naturally sort before longer ones in lexicographic comparison.

The variable-length property matters for practical performance. Zero encodes to a single byte [0x01]. Small integers like 1.0 encode to 2-3 bytes. Only numbers with many significant bits or extreme exponents require the full 10-byte encoding. In typical JSON workloads dominated by small integers and simple decimals, this compactness reduces memory allocation and comparison overhead.

Design Evolution and Attribution Link to heading

Early versions of Quamina (pre-1.3) used a home-grown format with 14 hexadecimal digits covering a limited range of -5 billion to +5 billion with at most 5 fractional digits. This used to be the way Ruler worked for eons as well. Any number outside this range was treated as a string, which broke semantic matching. The numbits approach replaced this with full float64 coverage. The initial implementation worked but used a multiplication-based mask computation. Later use of arithmetic right shift (to leverage CPU sign extension) reduced the transformation to three ARM64 instructions (ASR, ORR, EOR) with no conditional branches.

The base-128 encoding was a revelation to me. The idea is to find an encoding that satisfied three constraints: 1/ preserve the ordering established by numbits, 2/ stay within the UTF-8 safe byte range (0x00-0x7F), and 3/ allow variable length to avoid wasting space on common values. Base-128 allows for all three, requiring at most 10 bytes for the worst case (ceil(64/7)). I vagulely recall someone later discovered that an equivalent representation had been used in DB2’s disk format, though I’ve lost who and where they found the documentation.

Implementation Mechanics Link to heading

The core transformation in numbitsFromFloat64 (numbits.go:22-31) extracts the raw IEEE 754 bits via math.Float64bits(f), then computes the mask as uint64(int64(u)>>63) | (1 << 63). The cast to int64 before shifting ensures arithmetic (sign-extending) shift behavior. If the high bit is set, the shift produces all ones; otherwise, all zeros. The OR with 1 << 63 forces the sign bit position to be set in both cases. The final XOR with this mask performs the transformation: positive numbers XOR with 0x80…00 (flipping just the sign bit), negative numbers XOR with 0xFF…FF (bitwise NOT).

func numbitsFromFloat64(f float64) numbits {
    u := math.Float64bits(f)
    mask := uint64(int64(u)>>63) | (1 << 63)
    return numbits(u ^ mask)
}

The arithmetic right shift int64(u)>>63 sign-extends the high bit across all 64 bits. For positive numbers (high bit 0), this produces 0x00…00. For negative numbers (high bit 1), it produces 0xFF…FF. The OR with 1 << 63 ensures the sign bit position is always set, giving 0x80…00 for positive, 0xFF…FF for negative.

The base-128 encoding in toQNumber (numbits.go:40-61) walks the numbits value in two passes. First, it identifies trailing zero septets by repeatedly checking nb & 0x7f and shifting right by 7 until a non-zero septet appears. This determines the output length. The second pass encodes the significant septets into a byte slice, extracting 7 bits at a time via nb & 0x7f and storing as byte values guaranteed to be less than 0x80. The result is a big-endian sequence where the most significant bits appear first, making lexicographic byte comparison equivalent to numeric comparison.

func (nb numbits) toQNumber() qNumber {
    trailingZeroes := 0
    var index int
    for index = MaxBytesInEncoding - 1; index >= 0; index-- {
        if nb&0x7f != 0 {
            break
        }
        trailingZeroes++
        nb >>= 7
    }

    b := make([]byte, MaxBytesInEncoding-trailingZeroes)
    for ; index >= 0; index-- {
        b[index] = byte(nb & 0x7f)
        nb >>= 7
    }
    return b
}

The first loop counts trailing zero septets (groups of 7 bits), shifting right by 7 each time. The second loop fills the byte slice from right to left, masking 7 bits at a time with nb & 0x7f.

Performance measurements show the full matching cycle, including JSON parsing, flattening, Q number conversion, and automaton traversal, completes in approximately 615 nanoseconds per operation (measured via BenchmarkNumberMatching on Apple M3 Max). The benchmark simulates “realistic” workloads with 10 pattern numbers compiled into an automaton, with 50% of incoming events matching. Each operation allocates 1908 bytes across 10 allocations, covering the entire event processing pipeline. The branchless nature of the numbits transformation contributes to this performance: assembly analysis confirms just three ARM64 instructions (ASR, ORR, EOR) with no conditional branches.

Edge Cases and Constraints Link to heading

The implementation deliberately avoids handling NaN and Infinity because JSON itself prohibits these values. The JSON specification has no representation for NaN or Infinity, and Go’s encoding/json package rejects them during parsing. Quamina relies on this upstream rejection rather than detecting and handling these cases explicitly.

Negative zero presents a more subtle edge case. IEEE 754 distinguishes between positive zero (0x0000000000000000) and negative zero (0x8000000000000000). Go’s strconv.ParseFloat preserves this distinction: parsing the string "-0.0" yields actual negative zero. The numbits transformation converts these to different values: positive zero becomes 0x8000000000000000 (Q number [0x01]), while negative zero becomes 0x7FFFFFFFFFFFFFFF (Q number [0x00, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F]). In lexicographic ordering, this makes negative zero appear less than positive zero, which is semantically questionable but rarely matters in practice since JSON rarely expresses negative zero explicitly.

Subnormal numbers (as in values smaller than 2^-1022 with denormalized mantissas) work correctly. The transformation treats them as any other bit pattern, and explicit tests in numbits_test.go:44-52 verify ordering preservation for the full subnormal range. The algorithm makes no assumptions about IEEE 754 special cases beyond requiring that the sign bit occupies bit 63.

The base-128 encoding naturally avoids collision with Quamina’s value terminator byte (0xF5). Since all Q number bytes are less than 0x80, and 0xF5 is 245, there’s no possibility of confusion. This allows the automaton to use 0xF5 as a sentinel marking the end of field values without risking ambiguity with encoded numbers.

What This Enables Link to heading

By converting numbers to Q number form, Quamina brings numeric values into the same representation space as strings and other byte sequences. The automaton doesn’t need special-case logic for numeric comparisons, it simply processes Q numbers as byte sequences, and the transformation guarantees that lexicographic ordering matches semantic numeric ordering. This architectural consistency allows numeric range queries, equality checks, and prefix-based matching to all use the same automaton traversal code path.

The transformation happens in two strategic phases. During pattern compilation, pattern numbers are converted to Q numbers and added to the automaton alongside their original string forms. During event matching, the system checks whether the matcher has any numeric patterns (vmFields.hasNumbers) and whether the incoming field is numeric (eventField.IsNumber). Only if both conditions hold does Quamina incur the cost of Q number conversion. This dual-guard optimization avoids unnecessary work when patterns contain only strings or when events contain only strings.

The next component in Quamina’s architecture addresses a different representation problem: JSON’s hierarchical structure. Automata naturally process flat sequences, but JSON nests objects and arrays. The flattening system transforms hierarchical JSON into a sequence of field-path and value pairs, bridging this structural gap while preserving the information needed for pattern matching.