A leap year check in three instructions

With the following code, we can check whether a year 0 ≤ y ≤ 102499 is a leap year with only about 3 CPU instructions:

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

How does this work? The answer is surprisingly complex. This article explains it, mostly to have some fun with bit-twiddling; at the end, I'll briefly discuss the practical use.

This is how a leap year check is typically implemented:

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

We are using the proleptic Gregorian calendar, which extends the Gregorian calendar backward from its introduction in 1582 and includes a year 0. Thus, we don't need to treat years before 1582 any differently. For simplicity, we ignore negative years and use an unsigned year.

Optimizing the standard approach

Let's first do some simple speedups so that we get a good baseline. I'm not sure where to assign credit – these tricks have probably been reinvented independently many times.

We can replace (y % 100) != 0 by (y % 25) != 0: we already know that y is a multiple of 2², so if it is also a multiple of 5², it is a multiple of 2² ⋅ 5² = 100. Similarly, we can replace (y % 400) == 0 by (y % 16) == 0: we already know that y is a multiple of 5², so if it is also a multiple of 2⁴, it is a multiple of 5² ⋅ 2⁴ = 400.

bool is_leap_year1(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 25) != 0) return true;
    if ((y % 16) == 0) return true;
    return false;
}

This is useful because we can now replace modulo by 4 and by 16 with bit masking. There is a further trick well-known to compiler implementors on how to lower the modulo by 25. Compiling (x % 25) != 0 with gcc and translating back to C, we get x * 3264175145 > 171798691. With multiplication at a typical latency of 3 cycles and modulo of at least 20 cycles, this is a great improvement. I will only give an intuition on how it works; more details can be found in

Faster remainder by direct computation: Applications to compilers and software libraries by Daniel Lemire, Owen Kaser, and Nathan Kurz for lowering modulo by a constant in general;
Euclidean affine functions and their application to calendar algorithms by Cassio Neri and Lorenz Schneider more specifically for calendar calculations; and
Identifying leap years by David Turner for the leap year check (including formal proofs!).

Where do the magic numbers come from? We have

2³² ⋅ 19/25 = 3264175144.96 (exactly).

So by multiplying with 3264175145, we calculate approximately the fractional part of multiplying with (19/25). If this was an exact multiplication, we would get a fractional part of zero for multiples of 25. We multiply with a number that is too high by 0.04 though, so the error can be up to 0.04 ⋅ (2³² - 1) = 171798691.8, which is where the second magic number comes from.

This trick doesn't work quite as nicely for x % 100, where we need one more fixup instruction, so the reduction from y % 100 to y % 25 was helpful.

Our leap year check now looks like this:

bool is_leap_year2(uint32_t y) {
    if ((y & 3) != 0) return false;
    if (y * 3264175145u > 171798691u) return true;
    if ((y & 15) == 0) return true;
    return false;
}

Note that modern compilers like gcc or clang will produce something like is_leap_year2 from is_leap_year1, so there is not much point in doing this in C source, but it might be useful in other programming languages.

This code typically compiles to branching assembly. In practice, inputs are typically fairly predictable, so that is not necessarily bad. If we want to avoid branch misprediction penalties, at the price of also slowing down the best case, we can shuffle things around a bit and get branchless code (and also a decent code golf candidate):

bool is_leap_year3(uint32_t y) {
    return !(y & ((y % 25) ? 3 : 15));
}

If you want to see some more calendar calculation speedups, make sure to check out Optimizing with Novel Calendrical Algorithms by Jacob Pratt.

Finding a bit-twiddling approach

Can we improve the leap year calculation by giving up on correctness for all inputs? After all, we typically don't care whether the year 3584536493 is a leap year; indeed, Python, C#, and Go only support years 0 (or 1) to 9999 (at which point the drift relative to the seasons is already more than 4 days). My thought was that if something shorter exists, it will basically look like some weird hashing using magic constants, so I wanted to try some small forms and guess the constants by brute force. The form (y * f) <= t seems useful but is not powerful enough yet. One of my candidates was adding a mask: ((y * f) & m) <= t. Now we have 96 bits to guess, which is not feasible by brute force alone. Let's use z3, a solver that supports bitvector constraints that is perfectly suited for this:

import z3

BITS = 32
f, m, t, y = z3.BitVecs('f m t y', BITS)

def target(y):
    return z3.And((y & 3) == 0, z3.Or(z3.URem(y, 25) != 0, (y & 15) == 0))

def candidate(x):
    return z3.ULE((x * f) & m, t)

solver = z3.Solver()
solver.add(z3.ForAll(y, z3.Implies(z3.ULE(y, 400),
                                   candidate(y) == target(y))))

if solver.check() == z3.sat:
    print(f'found solution: {solver.model()}')
else:
    print('no solution found')

In a few seconds, this finds constants that yield the correct result for a nontrivial year range. Extending the range, I eventually found constants that yield the correct result from year 0 up to the year 102499, in about half an hour of computation time, and proved that this is the optimum for 32 bits:

bool is_leap_year_fast(uint32_t y) {
    const uint32_t f = 1073750999u;
    const uint32_t m = 3221352463u;
    const uint32_t t = 126976u;
    return ((y * f) & m) <= t;
}

Explanation

How does it work? It seems surprising, almost magic, that we can squeeze all these computations into three instructions. However, the considerations above give us most of the tools to understand it.

Here's our constants in binary, with four relevant bit ranges indicated by letters:

   A                 D
f: 01000000000000000010001111010111
m: 11000000000000011111000000001111
t: 00000000000000011111000000000000
                  B            C

Let's first consider for the product p := y ⋅ f what the masking with m and comparison with t does. In block A, the bits in t are 0, and therefore, the result is false as soon as any bit in A is set in p. Otherwise, block B becomes relevant. Here all bits in t are 1, so the result is true as soon as any bit in B is unset in p. Otherwise, for block C we require that all bits are unset in p. In this way, a bunch of comparisons of bit ranges are all united into a single <=.

Thus, we could rewrite is_leap_year_fast as follows:

bool is_leap_year_fast2(uint32_t y) {
    uint32_t p = y * 1073750999u;
    const uint32_t A = 0b11000000000000000000000000000000;
    const uint32_t B = 0b00000000000000011111000000000000;
    const uint32_t C = 0b00000000000000000000000000001111;
    if ((p & A) != 0) return false;
    if ((p & B) != B) return true;
    if ((p & C) == 0) return true;
    return false;
}

That looks suspiciously like is_leap_year2! And indeed, the three conditions have exactly the same purpose. We'll show

(p & A) != 0 triggers when (y % 4) != 0;
(p & B) != B triggers when (y % 100) != 0;
(p & C) == 0 triggers when (y % 16) == 0 (and therefore (y % 400) == 0, since we already know y is a multiple of 25).

Two easy cases: (1) and (3)

(1): The 1 bit in A in f reproduces the two low bits of y into p at A. This cannot be clobbered by the result of the multiplication with the bits in D: the maximum value we can get is 102499 ⋅ (f & D) = 940428325, which has only 30 bits. Thus, checking that A is zero in p is equivalent to checking whether y is 0 modulo 4.

(3): Checking that none of the lowest 4 bits is set in p is checking whether p is 0 modulo 16. However, we actually want to check y. This is no problem: It suffices to look only at the lowest 4 bits of f, and f is 0111₂ = 7 there. Multiplying with 7 introduces no additional factor of 2, so divisibility by 16 is unchanged.

The interesting case: (2)

Next, let's try to find out for which numbers p & B ≠ B. For this, the 1 bit in f & A does not play a role, so consider the bits in f & D. They are 10001111010111₂ = 9175. Let's see for which numbers the test hits.

>>> B = 0b00000000000000011111000000000000
>>> s = [y for y in range(5000) if ((y * 9175) & B) == B]
>>> for i in range(0, len(s), 16): print(*(f'{n:4d}' for n in s[i:i+16]))
  14   57   71  100  114  157  171  200  214  257  271  300  314  357  371  400
 414  457  471  500  514  557  571  600  614  657  671  700  714  757  771  800
 814  857  871  900  914  957  971 1000 1014 1057 1071 1100 1114 1157 1171 1200
1214 1257 1271 1300 1314 1357 1371 1400 1414 1457 1471 1500 1514 1557 1571 1600
1614 1657 1671 1700 1714 1757 1771 1800 1814 1857 1871 1900 1914 1957 1971 2000
2014 2057 2071 2100 2114 2157 2171 2200 2214 2257 2271 2300 2314 2357 2371 2400
2414 2457 2471 2500 2514 2557 2571 2600 2614 2657 2671 2700 2714 2757 2771 2800
2814 2857 2871 2900 2914 2957 2971 3000 3014 3057 3071 3100 3114 3157 3171 3200
3214 3257 3271 3300 3314 3357 3371 3400 3414 3457 3471 3500 3514 3557 3571 3600
3614 3657 3671 3700 3714 3757 3771 3800 3814 3857 3871 3900 3914 3957 3971 4000
4014 4057 4071 4100 4114 4157 4200 4214 4257 4300 4314 4357 4400 4414 4457 4500
4514 4557 4600 4614 4657 4700 4714 4757 4800 4814 4857 4900 4914 4957

Multiples of 100 are here as desired, but also a bunch of other numbers. This is not a problem as long as none of them are a multiple of 4, since those were already filtered out in the previous step. Also, 0 is missing, but that's not a problem either, since 0 is also a multiple of 400.

Let's try to understand the pattern. At first glance, it looks very simple: we have *14, *57, *71, and *00. However, from 4171 on, *71 drops out (did you notice?). We also have new patterns popping up later. Let's analyze this. The following Python program

def test(y):
    B = 126976
    return ((y * 9175) & B) == B

active = set()
for y in range(120000):
    r = y % 100
    if test(y):
        if r not in active:
            print(f'{y:6}: started *{r:02}')
            active.add(r)
    else:
        if r in active:
            print(f'{y:6}: stopped *{r:02}')
            active.remove(r)

outputs

    14: started *14
    57: started *57
    71: started *71
   100: started *00
  4171: stopped *71
 32843: started *43
 36914: stopped *14
 65586: started *86
 69657: stopped *57
 98329: started *29
102500: stopped *00

So, from 102500 on, we no longer catch multiples of 100, which explains why 102499 is the last number for which is_leap_year_fast returns the correct result. We also see that below that, no number is a multiple of 4 except for the multiples of 100 (conveniently, we can check that knowing only the last two decimal digits). This concludes the proof of condition (2), if we trust this brute-force enumeration; but let's continue and understand better why we get exactly these numbers.

Let's look into why we're getting the multiples of 100 first. The factor 9175 is close to a multiple of 1/100 in 17-bit fixed point representation:

2¹⁷ ⋅ 7/100 = 9175.04 (exactly).

Multiplying a multiple of 100 by 9175.04 gives an integer (a multiple of 7) at bits 17 and above and 17 zero bits below that. E.g.

9175.04 ⋅ 500 = 10001100000000000000000₂, with 100011₂ = 35 = 5 ⋅ 7.

Multiplying a multiple of 100 by 9175 gives slightly less:

9175 ⋅ 500 = 10001100000000000000000₂ − 500 ⋅ 0.04 = 10001011111111111101100₂.

In general, subtracting a little bit from a number ending in a lot of zeroes creates a number that ends in a lot of ones, except at the very end. Here, we check the 5 bits in B. For y a multiple of 100, these are guaranteed be all 1 until the accumulated error reaches the lower end of B, which happens only after y = 2¹² / 0.04 = 102400, so this fits.

Now where do the other numbers like 14, 57, and 71 come from? Let's look at it from a different angle. We have 9175 = 2¹⁷ ⋅ 0.06999969482421875 (exactly) and B = 2¹⁷ ⋅ 0.96875, so

p & B = B ⇔ {y ⋅ 0.06999969482421875} ≥ 0.96875, where {x} is the fractional part of x ⇔ 6.999969482421875y mod 100 ≥ 96.875

This is another way to see why multiples of 100 are accepted: For them, 7y mod 100 is 0, so 6.999969482421875y mod 100 comes out slightly below 100, and drops below 96.875 only after y = (100 − 96.875) / (7 − 6.999969482421875) = 102400.

To understand the other numbers showing up in our sequence, let's first consider what the solutions would be if we had a clean 7 in this inequality:

7y mod 100 ≥ 96.875 ⇔ 7y mod 100 ∈ {97, 98, 99}.

To find the solutions to this, we first need the modular inverse of 7 modulo 100, that is, the number x such that 7x mod 100 = 1. We can calculate it by using the extended Euclidean algorithm, or just using some online calculator, which tells us the result is 43. Then the solutions are 43 ⋅ 97 (mod 100), 43 ⋅ 98 (mod 100), and 43 ⋅ 99 (mod 100), which is 71, 14, and 57 (mod 100), respectively. This explains why we're initially seeing numbers of the form *14, *57, and *71. It also explains why we stop seeing e.g. *71 after 4071: While 7 ⋅ 4171 = 29197, we have 6.999969482421875 ⋅ 4171 = 29196.872711181640625, which (modulo 100) is less than 96.875. Similarly, 32843 pops up because the accumulated error (7 − 6.999969482421875) ⋅ 32843 = 1.002288818359375 exceeds one. With a bit more effort here, we could manually reproduce the output of the Python program above and check that none of these numbers is a multiple of 4.

Extension to other bitwidths

Now that we understand how the trick works, we can try to find parameters for other bitwidths. The variable parts are the location of block B and the fraction of 100 in f & D.

uint64_t score(uint64_t f, uint64_t m, uint64_t t) {
    for (uint64_t y = 0; ; y++)
        if ((((y * f) & m) <= t) != is_leap_year(y))
            return y;
}

int main() {
    uint64_t best_score = 0;
    for (int k = 0; k < BITS; k++) {
        for (int k2 = 0; k2 < k; k2++) {
            uint64_t t = (1ULL << k) - (1ULL << k2);
            uint64_t m = (0b11ULL << (BITS - 2)) | t | 0b1111;
            for (int n = 0; n < 100; n++) {
                uint64_t f = (0b01ULL << (BITS - 2)) | (((1ULL << k) * n) / 100);
                uint64_t new_score = score(f, m, t);
                if (new_score > best_score) {
                    printf("%llu %llu %llu: %llu (%d %d %d)\n",
                           f, m, t, new_score, k, k - k2, n);
                    best_score = new_score;
                }
            }
        }
    }
    return 0;
}

For BITS = 64, in about 7 minutes we find f = 4611686019114582671, m = 13835058121854156815, t = 66571993088, which is correct up to y = 5965232499. This is nice because 5965232499 > 2³², so any 32-bit year can be tested correctly with this variant.

Is this the highest we can go with 64 bit? Perhaps there's some other constants that work even better? I couldn't immediately find a way to prove it, so I employed the tried-and-true method of getting someone else to do it for me by posting it to the Code Golf StackExchange. Indeed after only 1h user ovs posted a very good result, and two days later user Exalted Toast posted a proof that 5965232499 is indeed the best possible range for 64 bits, also employing the z3 solver.

Benchmark

Getting meaningful benchmarks is tricky, since the function takes just such a tiny amount of time, and moreover for the branching versions execution time depends on the patterns of the input. Let's try two extreme cases: always the year 2025, and completely random years. These are the results of the benchmark on an i7-8700K (Coffee Lake, 4.7 GHz), compiled wirh g++ -O3 -fno-tree-vectorize:

	2025 (ns)	random (ns)
`is_leap_year`	0.65	2.61
`is_leap_year2`	0.65	2.75
`is_leap_year3`	0.67	0.88
`is_leap_year_fast`	0.69	0.69

Some weird things stand out here:

is_leap_year2 is actually marginally slower than is_leap_year in the random case. This is surprising because y % 100 needs one instruction more than the trick in is_leap_year2.
Edit 2025-05-19: As pointed out by Cassio Neri in a comment below, an explanation is that only one out of 100 instead one out of 25 branches gets mispredicted on average.
is_leap_year3 is a bit slower for random data than for the fixed value. This is surprising because it doesn't do any branching. I have no explanation for this other than benchmarking is hard.

The new function is_leap_year_fast is 3.8 times faster than the standard implementation for random data, and about 6% slower for the perfectly predictable input. Overall, that seems pretty solid.

Conclusion

So, is this really worthwhile? Should we replace e.g. the CPython datetime implementation with this trick? Well, it depends. Most years being queried in practice will be the current year, or at least fairly predictable, in which case we don't have a great advantage. To fully justify a change, ideally we would have a benchmark with realistic data that employs a leap year check as a subroutine, rather than just microbenchmarking. I would be happy to hear about any such results!