Article
Binary Tower Fields are the Future of Verifiable Computing
32-bit Binary Towers are 5x more hardware efficient than Mersenne31 multipliers
10/1/24
Irreducible Team
TL;DR
Small fields enable faster multiplication, directly improving STARKs’ performance.
Hardware efficiency comparison shows that 32-bit Binary Towers are 5x more efficient than Mersenne31 multipliers.
The lack of underlying integer multiplication and its inherent carry propagation makes Binary Towers a winning choice for hardware-friendly, verifiable computing systems.
Introduction
STARKs have seen remarkable performance gains in recent years by transitioning from large prime fields to small prime fields. The introduction of the 64-bit Goldilocks field in 2022 enabled Polygon's plonky2 to achieve a 100x speed-up over state-of-the-art proofs using alternate techniques. Since then, proof systems such as RISC Zero and Polygon’s Plonky3 have further boosted performance by moving to the BabyBear and Mersenne 31-bit prime fields. Consequently, two zkVM projects using STARKs over the BabyBear field, RISC Zero and Succinct's SP1, can emulate millions of RISC-V CPU cycles in mere seconds when run on a single computer. StarkWare also recently announced that their Stwo proof system, which uses the 31-bit Mersenne field, is 1,000x faster than their first generation Stone proof system that relied on a 252-bit prime field.
The immense impact of small field adoption comes from the substantial number of field multiplications required during proving. Finite field multiplications are the basis of all prover computations.
Binius is a new proof system we are developing at Irreducible based on Binary Tower fields. This post dives into the hardware efficiency of binary tower arithmetic and explains why it is 5x more efficient than the arithmetic of the most hardware-friendly 32-bit prime fields.
The Conventional Option: Prime Fields
Each multiplication consists of an integer multiplication and a prime reduction. Integer multiplication becomes much slower as the field size increases. For a quick validation, consider how the long multiplication method from grade school requires as many full additions as there are digits. The result of the integer multiplication is twice the field size, requiring prime reduction. Certain prime numbers known as Mersenne primes permit the most computationally efficient reduction method, which involves just one extra field addition.
STARK protocols generally require finite fields that are at least 31 bits. Recently published research on Circle STARKs introduces an elegant variation of the STARK protocol using the 31-bit Mersenne prime. This effectively makes the Mersenne-31 field the most efficient prime field compatible with STARKs.
Figure 1. Prime Multiplier Block Diagram
The Carryless Option: Binary Tower Fields
Binary tower fields, like those used in Binius, have several intrinsic advantages over prime fields. The most important advantage is that addition in binary fields is carryless. For integer adder circuits, we need to worry about the carry propagation, which requires the 1-bit full-adder circuit. On the other hand, in binary fields, bitwise addition is achieved by a single, much cheaper, XOR gate. Figure 2 shows a schematic for both sum (S) and carry (Cout) logic functions of a 1-bit full-adder. The full-adder requires two XOR, two AND, and one OR logic gates, while the Binary Tower adder requires a single XOR gate.
Figure 2. 1-bit Full Adder
The second advantage is the depth of the critical path of field element adders. Binary Towers’ critical path is fixed, one gate deep, while the critical path of the integer adder is proportional to the field element size. In the case of the full-adders, presented in Figure 2, the sum and carry calculations create a critical path two logic gates deep. For the 32-bit field elements, the Binary Tower adder will have a critical path one gate deep, while the ripple-carry adder will have a critical path 64 gates deep.
The third advantage is related to the reduction, which is performed as recursion progresses up the tower field, preventing intermediate results from exceeding the field element size. In contrast, prime fields produce results twice the input size, and reduction cannot be parallelized with multiplication (as presented in Figure 1).
Efficiency Comparison: Mersenne31 vs 32-bit Binary Towers
This report examines the hardware efficiency of multiplier designs. The hardware efficiency of a digital circuit is defined as the ratio of the number of operations per second (mult/s) to silicon area (µm2) utilized by the digital logic. More efficient hardware multipliers execute more multiplications per dollar spent.
Hardware efficiency has been well-studied in the past. In 2002, Gerardo Orlando wrote about Efficient Elliptic Curve Processor Architectures for FPGAs in his PhD thesis, which compares the hardware efficiency of FPGA implementations. In 2011, Michael Mühlberghuber wrote about Comparing ECDSA Hardware Implementations based on Binary and Prime Fields in his MSc thesis, which compares the hardware efficiency in 180nm process node.
We present an updated analysis based on modern silicon process nodes. We synthesized Mersenne31 (M31) and 32-bit Binary Tower (BT32) multipliers in 32nm and 14nm process nodes, comparing their maximum operating frequency and silicon area utilization. For this experiment, we used the Synopsys cloud platform that hosts the Design Compiler synthesis tool. This design environment provides access to the Synopsys DesignWare IP library, the ASIC industry-leading collection of high-performance digital circuits. This library provides advanced adder designs (CSA, CLA, RCA) that could be used to synthesize multiplier circuits.
Figure 3. Mersenne31 multiplier, implemented with the DesignWare CLA & CSA adders
Figure 4. 4-bit, 8-bit, 16-bit, and 32-bit Binary Tower multipliers, implemented recursively with fancy XOR and NAND gates
Maximum Operating Frequency Analysis: Our multiplier designs operate in a single-cycle regime, crucial for exposing the design's inherent critical path. We only register the input operands and the output result, with the multiplication digital logic being purely combinatorial.
Silicon Area Utilization Analysis: Logic gates, such as 2-input NAND gates, vary significantly in physical size between 32nm and 14nm processes due to transistor shrinking. We report silicon area in square micrometers (µm2).
Table 1. Summary of the Synthesis Results.
The efficiency comparison indicates a 5x advantage for Binary Tower fields. Given the definition of hardware efficiency, we conclude that if we spend the same amount of money to manufacture two chips of the same size, one running M31 and one running BT32 multiplications, the BT32 will perform 5x more multiplications per second. This comparison demonstrates that Binary Towers are much more hardware-friendly than the most efficient prime field, Mersenne31.
A Note on Critical Path: BT32 shows a critical path of only 15 physical gates for each process node. For comparison, see the estimated critical paths using ideal two-input AND and XOR logic gates in Table 2. The numbers are remarkably close for the 32-bit case. For the larger field element sizes, the critical path increases approximately logarithmically.
Table 2. Estimated logic gates count in the critical path.
A Note on Pipelining: One could argue that the M31 multiplier could be pipelined to cut the critical path. After pipelining, the number of logic levels may match BT32, improving the efficiency ratio in favor of M31. However, inserting four pipeline stages would require a relatively large amount of registers to capture the partial products of the integer multiplication (each 31bit + 1bit for carry).
In addition to a significant increase in silicon area, we must review this digital circuit’s clock frequency. In 14nm, BT32 can achieve a single-clock operation at 1.4GHz with a 15 logic levels critical path. Pipelining the M31 multiplier to reach that operational frequency would require significant effort to deliver clocks to newly inserted pipeline registers. Managing clock setup and hold times becomes an implementation challenge, especially at 1.4GHz. This intervention could cause issues with self-heating, power density, IR drop, and similar low-level silicon effects, irrelevant to this report.
The BT32 multiplier could also be further pipelined, requiring fewer registers. While pipelining could benefit both multipliers, the efficiency ratio of 5x will be at least preserved, if not improved, in favor of BT32.
Conclusion
We at Irreducible are certain that novel high-performance proof systems will be built using Binary Tower fields. Binius, our open-source proof system, utilizes Binary Tower fields for their superior hardware efficiency compared to prime fields.
Binary fields, particularly Binary Towers, will always be a preferred choice for highly-efficient hardware arithmetic. The lack of underlying integer multiplication and carry propagation makes them ideal for hardware-friendly cryptographic applications. The Advanced Encryption Standard uses the GF(8) binary field and is a hardware-friendly block cipher. Modern CPUs support AES Instructions designed to efficiently perform AES encryption and decryption operations.
We did not discuss the hardware efficiency of smaller binary towers, such as 16-bit, 8-bit, and 1-bit fields, which are also used in the Binius proof system and offer unprecedented levels of hardware performance. We focused on the 32-bit Binary Tower field to directly compare it with the Mersenne31 prime field. This report leaves no doubt: Binary Tower fields are the future of verifiable computing!