CryptoDB
Papers from Transactions on Cryptographic Hardware and Embedded Systems 2025
Year
Venue
Title
2025
TCHES
A Code-Based ISE to Protect Boolean Masking in Software
Abstract
Side-Channel Attacks (SCAs) pose a significant threat to data security in embedded environments. To counteract the power-based SCAs, masking is a widely used defense technique, that introduces randomness to obscure the sidechannel information generated during the processing of secret data. However, in practice, some challenges exist when implementing masking schemes. For example, in the implementation of Boolean masking, they may refer to low noise level and implementation flaws. To address the said implementation challenges, we present an effective and efficient solution that incorporates the code-based masking technique: We mask the shares of Boolean masking with code-based masking and then use a selfdesigned Instruction Set Extension (ISE) to perform efficient private computations within this masked domain. Based on a 32-bit RISC-V Ibex core, we develop a prototype implementation of our ISE, whereby it mainly wraps the ALU with three code-based encoders/decoders and integrates a leakage-resilient pseudo-random generator (PRG). Compared to the base core (vanilla Ibex), the hardware overhead of the ISE implementation is only 8%. The security evaluation based on formal verification and practical evaluation demonstrates that our ISE can provide a more robust practical security guarantee. Furthermore, our approach significantly reduces the signal-to-noise ratio (SNR) of each share, decreasing it to just 2% of the original SNR on the base core.
2025
TCHES
A TRAP for SAT: On the Imperviousness of a Transistor-Level Programmable Fabric to Satisfiability-Based Attacks
Abstract
Locking-based intellectual property (IP) protection for integrated circuits (ICs) being manufactured at untrusted facilities has been largely defeated by the satisfiability (SAT) attack, which can retrieve the secret key needed for instantiating proprietary functionality on locked circuits. As a result, redaction-based methods have gained popularity as a more secure way of protecting hardware IP. Among these methods, transistor-level programming (TRAP) prohibits the outright use of SAT attacks due to the mismatch between the logic-level at which SAT attack operates and the switch-level at which the TRAP fabric is programmed. Herein, we discuss the challenges involved in launching SAT attacks on TRAP and we propose solutions which enable expression of TRAP in propositional logic modeling in a way that accurately reflects switch-level circuit capabilities. Results obtained using a transistor-level SAT attack tool-set that we developed and are releasing corroborate that SAT attacks can be launched against TRAP. However, the increased complexity of switch-level circuit modeling prevents the attack from realistically compromising all but the most trivial IP-protected designs.
2025
TCHES
AETHER: An Ultra-High Throughput and Low Energy Authenticated Encryption Scheme
Abstract
In this paper, we introduce AETHER, an authenticated encryption scheme that achieves ultra-high throughput and low energy consumption, supporting a 256- bit key and a 128-bit tag. While inspired by an AEGIS-like structure, AETHER stands out with a completely redesigned round-update function. We replace the AES round function with a new inner function optimized for ultra-low latency and energy consumption. This function incorporates Orthros’s S-box and a 16x16 binary matrix from Akleylek et al., leading to a 1.56 times reduction in energy consumption and a 1.25 times reduction in delay compared to the AES round function. To further optimize hardware performance, we design the general construction of the roundupdate function to be more hardware-friendly, allowing parallel execution of the inner function on all 128-bit words, thereby enhancing both throughput and security against collision-based forgery attacks. AETHER achieves a throughput of 2.1 Tbit/s and an energy consumption of only 204.31 nJ, in the Nangate 15 nm standard cell library and a throughput of 5.23 Tbit/s and energy consumption of 1.83 nJ using the CNFET-OCL 5nm library, outperforming all existing AEADs.
2025
TCHES
All-You-Can-Compute: Packed Secret Sharing for Combined Resilience
Abstract
Unprotected cryptographic implementations are vulnerable to implementation attacks, such as passive side-channel attacks and active fault injection attacks. Recently, countermeasures like polynomial masking and duplicated masking have been introduced to protect implementations against combined attacks that exploit leakage and faults simultaneously. While duplicated masking requires O (t · e) shares to resist an adversary capable of probing t values and faulting e values, polynomial masking requires only O (t · e) shares, which is particularly beneficial for affine computation. At CHES’24, Arnold et al. showed how to further improve the efficiency of polynomial masking in the presence of combined attacks by embedding two secrets into one polynomial sharing. This essentially reduces the complexity of previous constructions by half. The authors also observed that using techniques from packed secret sharing (Grosso et al., CHES’13) cannot easily achieve combined resilience to encode an arbitrary number of secrets in one polynomial encoding. In this work, we resolve these challenges and show that it is possible to embed an arbitrary number of secrets in one encoding and propose gadgets that are secure against combined attacks. We present two constructions that are generic and significantly improve the computational and randomness complexity of existing compilers, such as the laOla compiler presented by Berndt et al. at CRYPTO’23 and its improvement by Arnold et al. For example, for an AES evaluation that protects against t probes and e faults, we improve the randomness complexity of the state-of-the-art construction when t + e > 3, leading to an improvement of up to a factor of 2.41.
2025
TCHES
CHERI-Crypt: Transparent Memory Encryption on Capability Architectures
Abstract
Capability architectures such as CHERI (Capability Hardware Enhanced RISC Instructions) are an emerging technology designed to provide memory safety protection at the hardware level and are equipped to eradicate approximately 70% of the current software vulnerability attack surface. CHERI is an instruction set architecture extension and has been applied to a small number of processors, including various versions of RISC-V. One of the benefits of CHERI is that it inherently provides segregation or compartmentalisation of software, making it suitable for supporting other types of applications such as Trusted Execution Environments, where sensitive data and computation is conducted inside a secure enclave, away from the rest of the untrusted operating system and services. To prevent untrusted software from accessing these compartments or secure regions of memory CHERI uses the mechanism of sealed capabilities. Trusted execution environments however, have been proven vulnerable to not just software-based attacks, but hardware attacks as well. In this paper we present our CHERI-Crypt design, an encryption engine extension to a CHERI-RISC-V 32-bit processor, for transparent memory encryption of sealed CHERI capabilities to additionally protect sensitive data in memory against physical hardware attacks. We show that our CHERI-Crypt design can run an enclave test program within an encrypted CHERI seal and invoke process, requiring 626 additional clock cycles with a batch size of 32 bytes. Adding CHERI-Crypt reduces the maximum frequency of the base CPU by only 6 MHz, and requires approximately 3.5x more flip flops and LUTs.
2025
TCHES
Constant time lattice reduction in dimension 4 with application to SQIsign
Abstract
In this paper we propose a constant time lattice reduction algorithm for integral dimension-4 lattices. Motivated by its application in the SQIsign postquantum signature scheme, we provide for the first time a constant time LLLlike algorithm with guarantees on the length of the shortest output vector. We implemented our algorithm and ensured through various tools that it indeed operates in constant time. Our experiments suggest that in practice our implementation outputs a Minkowski reduced basis and thus can replace a non constant time lattice reduction subroutine in SQIsign.
2025
TCHES
Designing a General-Purpose 8-bit (T)FHE Processor Abstraction
Abstract
Making the most of TFHE programmable bootstrapping to evaluate functions or operators otherwise challenging to perform with only the native addition and multiplication of the scheme is a very active line of research. In this paper, we systematize this approach and apply it to build an 8-bit FHE processor abstraction, i.e., a software entity that works over FHE-encrypted 8-bit data and presents itself to the programmer by means of a conventional-looking assembly instruction set. In doing so, we provide several homomorphic LookUp Table (LUT) dereferencing operators based on variants of the tree-based method and show that they are the most efficient option for manipulating encryptions of 8-bit data (optimally represented as two basis 16 digits). We then systematically apply this approach over a set of around 50 instructions, including, notably, conditional assignments, divisions, or fixed-point arithmetic operations. We further test the approach on several simple algorithms, including the execution of a neuron with a sigmoid activation function with 16-bit precision. We conclude the paper by comparing our work to the FHE compilers available in the state of the art. Finally, this work reveals that a very limited set of functional bootstrapping patterns is versatile and efficient enough to achieve general-purpose FHE computations beyond the boolean circuit approach. As such, these patterns may be an appropriate target for further works on advanced software optimizations or hardware implementations.
2025
TCHES
Higher-Order Time Sharing Masking
Abstract
At CHES 2024, Time Sharing Masking (TSM) was introduced as a novel low-latency masking technique for hardware circuits. TSM offers area and randomness efficiency, as well as glitch-extended PINI security, but it is limited to first-order security. We address this limitation and generalize TSM to higher-order security while maintaining all of TSM’s advantages. Additionally, we propose an area-latency tradeoff. We prove HO-TSM glitch-extended PINI security and successfully evaluate our circuits using formal verification tools. Furthermore, we demonstrate area- and latency-efficient implementations of the AES S-box, which do not exhibit leakage in TVLA on FPGA. Our proposed tradeoff enables a first-order secure implementation of a complete AES-128 encryption core with 92 kGE, 920 random bits per round, and 20 cycles of latency, which does not exhibit leakage in TVLA on FPGA.
2025
TCHES
Improving MPCitH with Preprocessing: Mask Is All You Need
Abstract
The MPC-in-the-head with preprocessing (MPCitH-PP) paradigm presents a novel approach for constructing post-quantum digital signatures like Picnic3. This paper revisits the MPCitH-PP construction, analyzing both its offline and online phases and proposing a reformulation of the protocol. By identifying redundant computations in these phases, we optimize them into a single phase, thereby enhancing the efficiency of MPCitH-PP. Furthermore, we explore the independence of the mask, demonstrating that it can be calculated in parallel, which also enables the optimization of the masked witness calculation.Our optimized implementation of Picnic3 shows significant improvements. At the L1 security level, the optimal software implementation reduces MPCitH-PP calculation time to about 30% of the previous implementation. The optimal signature implementation costs about 78% of the previous implementation time. At the L5 security level, MPCitH-PP with parallelism optimal is reduced to about 26% of the previous solution’s time, and the optimal signature implementation runs at about 53% of the previous solution’s time. For the hardware implementation, our optimizations reduce the clock cycles of MPCitH-PP from r sequential rounds to a single parallel round, where r denotes the number of rounds in the LowMC algorithm, with little change in hardware usage, and perform better in AT product, especially for parallel computing.
2025
TCHES
Information Theoretic Analysis of PUF-Based Tamper Protection
Abstract
PUFs enable physical tamper protection for high-assurance devices without needing a continuous power supply that is active over the entire lifetime of the device. Several methods for PUF-based tamper protection have been proposed together with practical quantization and error correction schemes. In this work we take a step back from the implementation to analyze theoretical properties and limits. We apply zero leakage output quantization to existing quantization schemes and minimize the reconstruction error probability under zero leakage. We apply wiretap coding within a helper data algorithm to enable a reliable key reconstruction for the legitimate user while guaranteeing a selectable reconstruction complexity for an attacker, analogously to the security level for a cryptographic algorithm for the attacker models considered in this work. We present lower bounds on the achievable key rates depending on the attacker’s capabilities in the asymptotic and finite blocklength regime to give fundamental security guarantees even if the attacker gets partial information about the PUF response and the helper data. Furthermore, we present converse bounds on the number of PUF cells. Our results show for example that for a practical scenario one needs at least 459 PUF cells using 3 bit quantization to achieve a security level of 128 bit.
2025
TCHES
KyberSlash: Exploiting secret-dependent division timings in Kyber implementations
Abstract
This paper presents KyberSlash1 and KyberSlash2 – two timing vulnerabilities in several implementations (including the official reference code) of the Kyber Post-Quantum Key Encapsulation Mechanism, recently standardized as ML-KEM. We demonstrate the exploitability of both KyberSlash1 and KyberSlash2 on two popular platforms: the Raspberry Pi 2 (Arm Cortex-A7) and the Arm Cortex-M4 microprocessor. Kyber secret keys are reliably recovered within minutes for KyberSlash2 and a few hours for KyberSlash1. We responsibly disclosed these vulnerabilities to maintainers of various libraries and they have swiftly been patched. We present two approaches for detecting and avoiding similar vulnerabilities. First, we patch the dynamic analysis tool Valgrind to allow detection of variable-time instructions operating on secret data, and apply it to more than 1000 implementations of cryptographic primitives in SUPERCOP. We report multiple findings. Second, we propose a more rigid approach to guarantee the absence of variable-time instructions in cryptographic software using formal methods.
2025
TCHES
Leading Degree: A Metric for Model Performance Evaluation and Hyperparameter Tuning in Deep Learning-Based Side-Channel Analysis
Abstract
Side-channel analysis benefits a lot from deep learning techniques, which assist attackers in recovering the secret key with fewer attack traces than before, but it remains a problem to precisely measure deep learning model performance, so as to obtain a high-performance model. Commonly used evaluation metrics for deep learning like accuracy and precision cannot well meet the demand due to their deviation in side-channel analysis, and classical evaluation metrics for side-channel analysis like guessing entropy, success rate and TGE1 are not generic because they effectively evaluate model performance in only one of the two situations: whether models manage to recover the secret key with given attack traces or not, and not efficient because they need to be performed multiple times to counteract randomness. To attain an effective generic side-channel evaluation metric, we investigate the deterministic component of power consumption, find that the elements of score vector under a key follow a linearly transformed chi-square distribution approximately, and some wrong key hypotheses usually with top scores provide great assistance in model performance evaluation, and finally we propose a new metric called Leading Degree (LD) as well as its simplified version LD-simplified for measuring model performance, which offers similar accuracy but much better generality and efficiency compared with the classical side-channel benchmark metric TGE1, and offers similar generality and efficiency but significantly better accuracy compared with recently proposed sidechannel metrics like Label Correlation and Cross Entropy Ratio. LD/LD-simplified can be easily deployed in early stopping to avoid overfitting phenomena, and we build a bridge between LD/LD-simplified and TGE1, by observing an exponential relationship, which significantly shortens the estimating time for TGE1. At last, we apply LD as a reward function to better solve the reward function design problem in reinforcement learning-based model hyperparameter tuning of side-channel analysis, and obtain better CNN model architectures compared with the state-of-the-art models obtained by previous hyperparameter tuning methods.
2025
TCHES
Leaky McEliece: Secret Key Recovery From Highly Erroneous Side-Channel Information
Abstract
The McEliece cryptosystem is a strong contender for post-quantum schemes, including key encapsulation for confidentiality of key exchanges in network protocols. A McEliece secret key is a structured parity check matrix that is transformed via Gaussian elimination into an unstructured public key. We show that this transformation is highly critical with respect to side-channel leakage. We assume leakage of the elementary row operations during Gaussian elimination, motivated by McEliece implementations in the cryptographic libraries Classic McEliece and Botan.We propose a novel decoding algorithm to reconstruct a secret key from its public key with information from a Gaussian transformation leak. Even if the obtained side-channel leakage is extremely noisy, i.e., each bit is flipped with probability as high as r ≈ 0.4, we succeed to recover the secret key in a matter of minutes for all proposed (Classic) McEliece instantiations. Remarkably, for high-security McEliece parameters, our attack is more powerful in the sense that it can tolerate even larger r . We demonstrate our attack on the constant-time reference implementation of Classic McEliece in a single-trace setting, using an STM32L592 ARM processor.Our result stresses the necessity of properly protecting highly structured code-based schemes such as McEliece against side-channel leakage.
2025
TCHES
MulLeak: Exploiting Multiply Instruction Leakage to Attack the Stack-optimized Kyber Implementation on Cortex-M4
Abstract
CRYSTALS-Kyber, one of the NIST PQC standardization schemes, has garnered considerable attention from researchers in recent years for its side-channel security. Various targets have been explored in previous studies; however, research on extracting secret information from stack-optimized implementations targeting the Cortex-M4 remains scarce, primarily due to the lack of memory access operations, which increases the difficulty of attacks.This paper shifts the focus to the leakage of multiply instructions and present a novel cycle-level regression-based leakage model for the following attacks. We target the polynomial multiplications in decryption process of the stack-optimized implementation targeting the Cortex-M4, and propose two regression-based profiled attacks leveraging known ciphertext and chosen ciphertext methodologies to recover the secret coefficients individually. The later one can also be extended to the protected implementation.Our practical evaluation, conducted on the stack-optimized Kyber-768 implementation from the pqm4 repository, demonstrates the effectiveness of the proposed attacks. Focusing on the leakage from the pair-pointwise multiplication, specifically the macro doublebasemul_frombytes_asm, we successfully recover all secret coefficients with a success rate exceeding 95% using a modest number of traces for each attack. This research underscores the potential vulnerabilities in PQC implementations against side-channel attacks and contributes to the ongoing discourse on the physical security of cryptographic algorithms.
2025
TCHES
New Quantum Cryptanalysis of Binary Elliptic Curves
Abstract
This paper improves upon the quantum circuits required for the Shor’s attack on binary elliptic curves. We present two types of quantum point addition, taking both qubit count and circuit depth into consideration.In summary, we propose an in-place point addition that improves upon the work of Banegas et al. from CHES’21, reducing the qubit count – depth product by more than 73% – 81% depending on the variant. Furthermore, we develop an out-of-place point addition by using additional qubits. This method achieves the lowest circuit depth and offers an improvement of over 92% in the qubit count – quantum depth product (for a single step).To the best of our knowledge, our work improves from all previous works (including the CHES’21 paper by Banegas et al., the IEEE Access’22 paper by Putranto et al., and the CT-RSA’23 paper by Taguchi and Takayasu) in terms of circuit depth and qubit count – depth product.Equipped with the implementations, we discuss the post-quantum security of the binary elliptic curve cryptography. Under the MAXDEPTH metric (proposed by the US government’s NIST), the quantum circuit with the highest depth in our work is 224, which is significantly lower than the MAXDEPTH limit of 240. For the gate count – full depth product, a metric for estimating quantum attack cost (proposed by NIST), the highest complexity in our work is 260 for the curve having degree 571 (which is comparable to AES-256 in terms of classical security), considerably below the post-quantum security level 1 threshold (of the order of 2156).
2025
TCHES
OPTIMSM: FPGA hardware accelerator for Zero-Knowledge MSM
Abstract
The Multi-Scalar Multiplication (MSM) is the main barrier to accelerating Zero-Knowledge applications. In recent years, hardware acceleration of this algorithm on both FPGA and GPU has become a popular research topic and the subject of a multi-million dollar prize competition (ZPrize). This work presents OPTIMSM: Optimized Processing Through Iterative Multi-Scalar Multiplication. This novel accelerator focuses on the acceleration of the MSM algorithm for any Elliptic Curve (EC) by improving upon the Pippenger algorithm. A new iteration technique is introduced to decouple the required buckets from the window size, resulting in fewer EC computations for the same on-chip memory resources. Furthermore, we combine known optimizations from the literature for the first time to achieve additional latency improvements. Our enhanced MSM implementation significantly reduces computation time, achieving a speedup of up to x12.77 compared to recent FPGA implementations. Specifically, for the BLS12-381 curve, we reduce the computation time for an MSM of size 224 to 914 ms using a single compute unit on the U55C FPGA or to 231 ms using four U55C devices. These results indicate a substantial improvement in efficiency, paving the way for more scalable and efficient Zero-Knowledge proof systems.
2025
TCHES
Protection of Oscillator-Based PUFs against Side Channel Analyses by Random Interruption
Abstract
Oscillation-based physical unclonable functions (PUFs) are known to be sensitive to power trace side channel analyses (SCAs). Although previous work investigated on countermeasures, these required significant additional amount of hardware or were just able to obscure sign information of a frequency comparison, while the magnitude information remains available to the attacker. As recent innovation on oscillation-based PUFs also require the magnitude-information beside the sign, e.g., to increase the reliability, the need arises to protect both. We present a new protection approach to hide both sign and magnitude information of oscillation-based PUF from an attacker. By introducing random interruptions in the oscillation, the power spectrum is blurred while the quality of the PUF is maintained. In addition to concept simulations and the discussion of different implementations, we use the example of a loop-PUF to show that the presented countermeasure can withstand several attack scenarios.
2025
TCHES
REED: Chiplet-based Accelerator for Fully Homomorphic Encryption
Abstract
Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence offering high performance. However, they encounter common challenges associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing costs. In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator ‘REED’ for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance.Our instruction-set and power simulations experiments with a prelayout netlist indicate that REED 2.5D microprocessor consumes 96.7mm2 chip area, 49.4Waverage power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.
2025
TCHES
Rudraksh: A compact and lightweight post-quantum key-encapsulation mechanism
Abstract
Resource-constrained devices such as wireless sensors and Internet of Things (IoT) devices have become ubiquitous in our digital ecosystem. These devices generate and handle a major part of our digital data. However, due to the impending threat of quantum computers on our existing public-key cryptographic schemes and the limited resources available on IoT devices, it is important to design lightweight post-quantum cryptographic (PQC) schemes suitable for these devices.In this work, we explored the design space of learning with error-based PQC schemes to design a lightweight key-encapsulation mechanism (KEM) suitable for resourceconstrained devices. We have done a scrupulous and extensive analysis and evaluation of different design elements, such as polynomial size, field modulus structure, reduction algorithm, and secret and error distribution of an LWE-based KEM. Our explorations led to the proposal of a lightweight PQC-KEM, Rudraksh, without compromising security. Our scheme provides security against chosen ciphertext attacks (CCA) with more than 100 bits of Core-SVP post-quantum security and belongs to the NIST-level-I security category (provide security at least as much as AES-128). We have also shown how ASCON can be used for lightweight pseudo-random number generation and hash function in the lattice-based KEMs instead of the widely used Keccak for lightweight design. Our FPGA results show that Rudraksh currently requires the least area among the PQC KEMs of similar security. Our implementation of Rudraksh provides a ~3x improvement in terms of the area requirement compared to the state-of-the-art areaoptimized implementation of Kyber, can operate at 63%-76% higher frequency with respect to high-throughput Kyber, and improves time-area-product ~2x compared to the state-of-the-art compact implementation of Kyber published in HPEC 2022.
2025
TCHES
SeaFlame: Communication-Efficient Secure Aggregation for Federated Learning against Malicious Entities
Abstract
Secure aggregation is a popular solution to ensuring privacy for federated learning. However, when considering malicious participants in secure aggregation, it is difficult to achieve both privacy and high efficiency. Therefore, we propose SeaFlame, a communication-efficient secure aggregation protocol against malicious participants. Inspired by the state-of-the-art work, ELSA, SeaFlame also utilizes two non-colluding servers to ensure privacy against malicious entities and provide defenses against boosted gradients. Crucially, to improve communication efficiency, SeaFlame uses arithmetic sharing together with arithmetic-to-arithmetic share conversion to reduce client communication, and uses the random linear combination to reduce server communication.Security analysis proves that our SeaFlame guarantees privacy against malicious clients colluding with one malicious server. Experimental evaluation demonstrates that, compared to ELSA, SeaFlame optimizes communication by up to 10.5, 6.00, and 8.17 times for a client, a server, and all entities, at the expense of 1.25-1.86 times additional end-to-end runtime.
2025
TCHES
Shortcut2Secrets: A Table-based Differential Fault Attack Framework
Abstract
Recently, Differential Fault Attacks (DFAs) have proven highly effective against stream ciphers designed for Hybrid Homomorphic Encryption (HHE). In this work, we present a table-based DFA framework called the shortcut attack, which generalizes the attack proposed by Wang and Tang on the cipher Elisabeth. The framework applies to a broad sub-family of ciphers following the Group Filter Permutator (GFP) paradigm and enhances previous DFAs by improving both the fault identification and path generation steps. Notably, the shortcut attack circumvents the issue of function representation, allowing successful attacks even when the cipher’s filter function cannot be represented over the ring it is defined on.Additionally, we provide complexity estimates for the framework and apply the shortcut attack to Elisabeth-4 and its patches. As a result, we optimize the DFA on Elisabeth-4, requiring fewer keystreams and running faster than previous methods. Specifically, we achieve a DFA that requires only 3000 keystreams, which is one-fifth of the previous best result. We also successfully mount a practical DFA on Gabriel-4 and provide a theoretical DFA for Elisabeth-b4.For the latest patch, Margrethe-18-4, which follows the more general Mixed Filter Permutator (MFP) paradigm, we present a DFA in a stronger model. To the best of our knowledge, these are the first DFA results on the patches of Elisabeth-4. Finally, we derive security margins to prevent shortcut attacks on a broad sub-family of MFP ciphers, which can serve as parameter recommendations for designers.
2025
TCHES
Sieving with Streaming Memory Access
Abstract
We implement an optimized BGJ (Becker–Gama–Joux 2015) sieve and analyze its behavior in a study of RAM access overheads (and their minimization) in sieving algorithms for large lattice problems. Both experiment and theory points to BGJ’s inherent structure being much more memory-efficient than the BDGL (Becker–Ducas– Gama–Laahoven 2016) sieve, which uses asymptotically the fewest logical operations. In particular, a dimension-n BGJ sieve uses only 20.2075n+o(n) streaming (non-random) main memory accesses. A key insight: Bucket sizes decrease by orders of magnitude after each BGJ filtering layer, so that sub-buckets fit into successively much smaller (hence faster) storage areas. Our refined BGJ is competitive at cryptographic sizes and should outperform BDGL for all practically achievable dimensions.The above is corroborated by the results from our efficient CPU-based BGJ implementation in an optimized framework, which saves about 40% RAM footprint and is ≥ 24.5x more efficient gate-count-wise compared to the Ducas–Stevens–van Woerden 2021 4-GPU implementation, which like most prior sieving-based SVP computations is a HK3 (Herold–Kirshanova 2017) sieve. Notably, we solved the 183-dimensional SVP Darmstadt Challenge in 30 days on a 112-core server and 0.87 TB of RAM; similarly we also found a short vector in the 796-dimensional Ideal-SVP Challenge. Our implementation may offer further insights into the behavior of asymptotically “fast” sieving algorithms when applied to large-scale problems. Moreover, our refined cost estimation of SVP based on this implementation suggests that some NIST PQC candidates (e.g. Falcon-512), are not sure to meet NIST’s security requirements.
2025
TCHES
SimdMSM: SIMD-accelerated Multi-Scalar Multiplication Framework for zkSNARKs
Abstract
Multi-scalar multiplication (MSM) is the primary building block in many pairing-based zero-knowledge proof (ZKP) systems. MSM at large scales has become the main bottleneck in ZKP implementations. Inspired by existing SIMD-accelerated work, we are focused on accelerating MSM computing efficiency using SIMD instructions in a single CPU environment. First, we propose a SIMD-accelerated MSM computing architecture with no write conflicts and constant memory overheads. This architecture utilizes multithreading to achieve task-level and loop-level parallelism and employs a three-tier buffer mechanism to maximize the utilization of the SIMD engine. Instanced with AVX512-IFMA instructions, we implement six SIMD elliptic curve arithmetic engines for different point addition in three coordinate systems and two groups. Moreover, we integrate our AVX-MSM implementation into the libsnark library, naming it AVX-ZK. In more detail, point deduplication and “Three-Stage” memory optimization are proposed to address problems existing in practical applications. Based on the RELIC library, our performance results on the BLS12-381 curve show that our AVX-MSM achieves up to 27.86x speedup over the most popular Pippenger algorithm. Compared with libsnark, our AVX-ZK implementation achieves over 11.53x (up to 20.26x) speedup under standard benchmarks.
2025
TCHES
Skyscraper: Fast Hashing on Big Primes
Abstract
Arithmetic hash functions defined over prime fields have been actively developed and used in verifiable computation (VC) protocols. Among those, ellipticcurve- based SNARKs require large (256-bit and higher) primes. Such hash functions are notably slow, losing a factor of up to 1000 compared to regular constructions like SHA-2/3.In this paper, we present the hash function Skyscraper, which is aimed at large prime fields and provides major improvements compared to Reinforced Concrete and Monolith. First, the design is exactly the same for all large primes, which simplifies analysis and deployment. Secondly, it achieves a performance comparable to cryptographic hash standards by using low-degree non-invertible transformations and minimizing modulo reductions. Concretely, it hashes two 256-bit prime field (BLS12-381 curve scalar field) elements in 135 nanoseconds, whereas SHA-256 needs 42 nanoseconds on the same machine.The low circuit complexity of Skyscraper, together with its high native speed, should allow a substantial reduction in many VC scenarios, particularly in recursive proofs.
2025
TCHES
TFHE Gets Real: an Efficient and Flexible Homomorphic Floating-Point Arithmetic
Abstract
Floating-point arithmetic plays a central role in computer science and is used in various domains where precision and computational scale are essential. One notable application is in machine learning, where Fully Homomorphic Encryption (FHE) can play a crucial role in safeguarding user privacy. In this paper, we focus on TFHE and develop novel homomorphic operators designed to enable the construction of precise and adaptable homomorphic floating-point operations. Integrating floating-point arithmetic within the context of FHE is particularly challenging due to constraints such as small message space and the lack of information during computation. Despite these challenges, we were able to determine parameters for common precisions (e.g., 32-bit, 64-bit) and achieve remarkable computational speeds, with 32-bit floating-point additions completing in 2.5 seconds and multiplications in approximately 1 second in a multi-threaded environment. These metrics provide empirical evidence of the efficiency and practicality of our proposed methods, which significantly outperform previous efforts. Our results demonstrate a significant advancement in the practical application of FHE, making it more viable for real-world scenarios and bridging the gap between theoretical encryption techniques and practical usability.