Related
This is the python version of the same C++ question.
Given a number, num, what is the fastest way to strip off the trailing zeros from its binary representation?
For example, let num = 232. We have bin(num) equal to 0b11101000 and we would like to strip the trailing zeros, which would produce 0b11101. This can be done via string manipulation, but it'd probably be faster via bit manipulation. So far, I have thought of something using num & -num
Assuming num != 0, num & -num produces the binary 0b1<trailing zeros>. For example,
num 0b11101000
-num 0b00011000
& 0b1000
If we have a dict having powers of two as keys and the powers as values, we could use that to know by how much to right bit shift num in order to strip just the trailing zeros:
# 0b1 0b10 0b100 0b1000
POW2s = { 1: 0, 2: 1, 4: 2, 8: 3, ... }
def stripTrailingZeros(num):
pow2 = num & -num
pow_ = POW2s[pow2] # equivalent to math.log2(pow2), but hopefully faster
return num >> pow_
The use of dictionary POW2s trades space for speed - the alternative is to use math.log2(pow2).
Is there a faster way?
Perhaps another useful tidbit is num ^ (num - 1) which produces 0b1!<trailing zeros> where !<trailing zeros> means take the trailing zeros and flip them into ones. For example,
num 0b11101000
num-1 0b11100111
^ 0b1111
Yet another alternative is to use a while loop
def stripTrailingZeros_iterative(num):
while num & 0b1 == 0: # equivalent to `num % 2 == 0`
num >>= 1
return num
Ultimately, I need to execute this function on a big list of numbers. Once I do that, I want the maximum. So if I have [64, 38, 22, 20] to begin with, I would have [1, 19, 11, 5] after performing the stripping. Then I would want the maximum of that, which is 19.
There's really no answer to questions like this in the absence of specifying the expected distribution of inputs. For example, if all inputs are in range(256), you can't beat a single indexed lookup into a precomputed list of the 256 possible cases.
If inputs can be two bytes, but you don't want to burn the space for 2**16 precomputed results, it's hard to beat (assuming that_table[i] gives the count of trailing zeroes in i):
low = i & 0xff
result = that_table[low] if low else 8 + that_table[i >> 8]
And so on.
You do not want to rely on log2(). The accuracy of that is entirely up to the C library on the platform CPython is compiled for.
What I actually use, in a context where ints can be up to hundreds of millions of bits:
assert d
if d & 1 == 0:
ntz = (d & -d).bit_length() - 1
d >>= ntz
A while loop would be a disaster in this context, taking time quadratic in the number of bits shifted off. Even one needless shift in that context would be a significant expense, which is why the code above first checks to see that at least one bit needs to shifted off. But if ints "are much smaller", that check would probably cost more than it saves. "No answer in the absence of specifying the expected distribution of inputs."
On my computer, a simple integer divide is fastest:
import timeit
timeit.timeit(setup='num=232', stmt='num // (num & -num)')
0.1088077999993402
timeit.timeit(setup='d = { 1: 0, 2 : 1, 4: 2, 8 : 3, 16 : 4, 32 : 5 }; num=232', stmt='num >> d[num & -num]')
0.13014470000052825
timeit.timeit(setup='import math; num=232', stmt='num >> int(math.log2(num & -num))')
0.2980690999993385
You say you "Ultimately, [..] execute this function on a big list of numbers to get odd numbers and find the maximum of said odd numbers."
So why not simply:
from random import randint
numbers = [randint(0, 10000) for _ in range(5000)]
odd_numbers = [n for n in numbers if n & 1]
max_odd = max(odd_numbers)
print(max_odd)
To do what you say you want to do ultimately, there seems to be little point in performing the "shift right until the result is odd" operation? Unless you want the maximum of the result of that operation performed on all elements, which is not what you stated?
I agree with #TimPeters answer, but if you put Python through its paces and actually generate some data sets and try the various solutions proposed, they maintain their spread for any number of integer size when using Python ints, so your best option is integer division for numbers up to 32-bits, after that see the chart below:
from pandas import DataFrame
from timeit import timeit
import math
from random import randint
def reduce0(ns):
return [n // (n & -n)
for n in ns]
def reduce1(ns, d):
return [n >> d[n & -n]
for n in ns]
def reduce2(ns):
return [n >> int(math.log2(n & -n))
for n in ns]
def reduce3(ns, t):
return [n >> t.index(n & -n)
for n in ns]
def reduce4(ns):
return [n if n & 1 else n >> ((n & -n).bit_length() - 1)
for n in ns]
def single5(n):
while (n & 0xffffffff) == 0:
n >>= 32
if (n & 0xffff) == 0:
n >>= 16
if (n & 0xff) == 0:
n >>= 8
if (n & 0xf) == 0:
n >>= 4
if (n & 0x3) == 0:
n >>= 2
if (n & 0x1) == 0:
n >>= 1
return n
def reduce5(ns):
return [single5(n)
for n in ns]
numbers = [randint(1, 2 ** 16 - 1) for _ in range(5000)]
d = {2 ** n: n for n in range(16)}
t = tuple(2 ** n for n in range(16))
assert(reduce0(numbers) == reduce1(numbers, d) == reduce2(numbers) == reduce3(numbers, t) == reduce4(numbers) == reduce5(numbers))
df = DataFrame([{}, {}, {}, {}, {}, {}])
for p in range(1, 16):
p = 2 ** p
numbers = [randint(1, 2 ** p - 1) for _ in range(4096)]
d = {2**n: n for n in range(p)}
t = tuple(2 ** n for n in range(p))
df[p] = [
timeit(lambda: reduce0(numbers), number=100),
timeit(lambda: reduce1(numbers, d), number=100),
timeit(lambda: reduce2(numbers), number=100),
timeit(lambda: reduce3(numbers, t), number=100),
timeit(lambda: reduce4(numbers), number=100),
timeit(lambda: reduce5(numbers), number=100)
]
print(f'Complete for {p} bit numbers.')
print(df)
df.to_csv('test_results.csv')
Result (when plotted in Excel):
Note that the plot that was previously here was wrong! The code and data were not though. The code has been updated to include #MarkRansom's solution, since it turns out to be the optimal solution for very large numbers (over 4k-bit numbers).
while (num & 0xffffffff) == 0:
num >>= 32
if (num & 0xffff) == 0:
num >>= 16
if (num & 0xff) == 0:
num >>= 8
if (num & 0xf) == 0:
num >>= 4
if (num & 0x3) == 0:
num >>= 2
if (num & 0x1) == 0:
num >>= 1
The idea here is to perform as few shifts as possible. The initial while loop handles numbers that are over 32 bits long, which I consider unlikely but it has to be provided for completeness. After that each statement shifts half as many bits; if you can't shift by 16, then the most you could shift is 15 which is (8+4+2+1). All possible cases are covered by those 5 if statements.
This question already has answers here:
calculate mod using pow function python
(3 answers)
Closed 1 year ago.
Is there any way to make this code run faster?
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
print(g**a%n)
It is running for too long I want to make it faster
I also tried:
import math
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
print(math.pow(g**a)%n)
and
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
def power(a,b):
ans = 1
for i in range(b):
ans *= a
return ans
print(power(g,a)%n)
All of these code is running for very long
First of all you need to know about Binary exponentiation algorithm. The idea is that instead of computing e.g. 5^46 like 5*5*5*5... 46 times, you can do
5^46 == 5^2 * 5^4 * 5^8 * 5^32
The key here, is that you can compute 5^2 fast from 5 (just square it), then 5^4 fast from 5^2 (just square it), then 5^8 from 5^4 (just square it) and so on. To determine which 5^K numbers you should multiply and which not, you can represent the power as a binary number, and multiply to the final result only those components, that correspond to 1 in this binary representation. E.g.
decimal 46 == binary 101110
Thus
5^1 is skipped (corresponds to right most 0), 5^2 is multiplied (corresponds to right most 1), 5^4 is multiplied(second from the right 1), 5^8 is multiplied (third from the right 1), 5^16 is skipped (the left most 0) and 5^32 is multiplied (the left most 1).
Next, you need to compute a very huge power, it's impractically big. But there is a shortcut, since you use modulo operation.
You see, there's a rule that
(a*b % n) == ( (a % n)*(b % n) ) % n
So these should be equivalent
5^46 % n == ( ( ( (5^2 % n) * (5^4 % n) % n) * (5^8 % n) % n) * (5^32 % n) % n)
Notice that each number we multiply won't ever exceed n, so the overall multiplication chain will not take forever, as n is big, but not even remotely as gigantic as g**a
In the code, all of that looks like that. It computes instantly
def pow_modulo_n(base, power, n):
result = 1
multiplier = base
while power > 0:
power, binary_digit = divmod(power, 2)
if binary_digit == 1:
result = (result * multiplier) % n
multiplier = (multiplier**2) % n
return result % n
g = 53710316114328094
a = 995443176435632644
n = 926093738455418579
print(pow_modulo_n(g, a, n))
This prints
434839845697636246
Examples,
1.Input=4
Output=111
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
------------------------
sum = 111(output)
1.Input=5
Output=237
Explanation,
1 = 1³(divisors of 1)
2 = 1³ + 2³(divisors of 2)
3 = 1³ + 3³(divisors of 3)
4 = 1³ + 2³ + 4³(divisors of 4)
5 = 1³ + 5³(divisors of 5)
-----------------------------
sum = 237 (output)
x=int(raw_input().strip())
tot=0
for i in range(1,x+1):
for j in range(1,i+1):
if(i%j==0):
tot+=j**3
print tot
Using this code I can find the answer for small number less than one million.
But I want to find the answer for very large numbers. Is there any algorithm
for how to solve it easily for large numbers?
Offhand I don't see a slick way to make this truly efficient, but it's easy to make it a whole lot faster. If you view your examples as matrices, you're summing them a row at a time. This requires, for each i, finding all the divisors of i and summing their cubes. In all, this requires a number of operations proportional to x**2.
You can easily cut that to a number of operations proportional to x, by summing the matrix by columns instead. Given an integer j, how many integers in 1..x are divisible by j? That's easy: there are x//j multiples of j in the range, so divisor j contributes j**3 * (x // j) to the grand total.
def better(x):
return sum(j**3 * (x // j) for j in range(1, x+1))
That runs much faster, but still takes time proportional to x.
There are lower-level tricks you can play to speed that in turn by constant factors, but they still take O(x) time overall. For example, note that x // j == 1 for all j such that x // 2 < j <= x. So about half the terms in the sum can be skipped, replaced by closed-form expressions for a sum of consecutive cubes:
def sum3(x):
"""Return sum(i**3 for i in range(1, x+1))"""
return (x * (x+1) // 2)**2
def better2(x):
result = sum(j**3 * (x // j) for j in range(1, x//2 + 1))
result += sum3(x) - sum3(x//2)
return result
better2() is about twice as fast as better(), but to get faster than O(x) would require deeper insight.
Quicker
Thinking about this in spare moments, I still don't have a truly clever idea. But the last idea I gave can be carried to a logical conclusion: don't just group together divisors with only one multiple in range, but also those with two multiples in range, and three, and four, and ... That leads to better3() below, which does a number of operations roughly proportional to the square root of x:
def better3(x):
result = 0
for i in range(1, x+1):
q1 = x // i
# value i has q1 multiples in range
result += i**3 * q1
# which values have i multiples?
q2 = x // (i+1) + 1
assert x // q1 == i == x // q2
if i < q2:
result += i * (sum3(q1) - sum3(q2 - 1))
if i+1 >= q2: # this becomes true when i reaches roughly sqrt(x)
break
return result
Of course O(sqrt(x)) is an enormous improvement over the original O(x**2), but for very large arguments it's still impractical. For example better3(10**6) appears to complete instantly, but better3(10**12) takes a few seconds, and better3(10**16) is time for a coffee break ;-)
Note: I'm using Python 3. If you're using Python 2, use xrange() instead of range().
One more
better4() has the same O(sqrt(x)) time behavior as better3(), but does the summations in a different order that allows for simpler code and fewer calls to sum3(). For "large" arguments, it's about 50% faster than better3() on my box.
def better4(x):
result = 0
for i in range(1, x+1):
d = x // i
if d >= i:
# d is the largest divisor that appears `i` times, and
# all divisors less than `d` also appear at least that
# often. Account for one occurence of each.
result += sum3(d)
else:
i -= 1
lastd = x // i
# We already accounted for i occurrences of all divisors
# < lastd, and all occurrences of divisors >= lastd.
# Account for the rest.
result += sum(j**3 * (x // j - i)
for j in range(1, lastd))
break
return result
It may be possible to do better by extending the algorithm in "A Successive Approximation Algorithm for Computing the Divisor Summatory Function". That takes O(cube_root(x)) time for the possibly simpler problem of summing the number of divisors. But it's much more involved, and I don't care enough about this problem to pursue it myself ;-)
Subtlety
There's a subtlety in the math that's easy to miss, so I'll spell it out, but only as it pertains to better4().
After d = x // i, the comment claims that d is the largest divisor that appears i times. But is that true? The actual number of times d appears is x // d, which we did not compute. How do we know that x // d in fact equals i?
That's the purpose of the if d >= i: guarding that comment. After d = x // i we know that
x == d*i + r
for some integer r satisfying 0 <= r < i. That's essentially what floor division means. But since d >= i is also known (that's what the if test ensures), it must also be the case that 0 <= r < d. And that's how we know x // d is i.
This can break down when d >= i is not true, which is why a different method needs to be used then. For example, if x == 500 and i == 51, d (x // i) is 9, but it's certainly not the case that 9 is the largest divisor that appears 51 times. In fact, 9 appears 500 // 9 == 55 times. While for positive real numbers
d == x/i
if and only if
i == x/d
that's not always so for floor division. But, as above, the first does imply the second if we also know that d >= i.
Just for Fun
better5() rewrites better4() for about another 10% speed gain. The real pedagogical point is to show that it's easy to compute all the loop limits in advance. Part of the point of the odd code structure above is that it magically returns 0 for a 0 input without needing to test for that. better5() gives up on that:
def isqrt(n):
"Return floor(sqrt(n)) for int n > 0."
g = 1 << ((n.bit_length() + 1) >> 1)
d = n // g
while d < g:
g = (d + g) >> 1
d = n // g
return g
def better5(x):
assert x > 0
u = isqrt(x)
v = x // u
return (sum(map(sum3, (x // d for d in range(1, u+1)))) +
sum(x // i * i**3 for i in range(1, v)) -
u * sum3(v-1))
def sum_divisors(n):
sum = 0
i = 0
for i in range (1, n) :
if n % i == 0 and n != 0 :
sum = sum + i
# Return the sum of all divisors of n, not including n
return sum
print(sum_divisors(0))
# 0
print(sum_divisors(3)) # Should sum of 1
# 1
print(sum_divisors(36)) # Should sum of 1+2+3+4+6+9+12+18
# 55
print(sum_divisors(102)) # Should be sum of 2+3+6+17+34+51
# 114
You are given two 32-bit numbers, N and M, and two bit positions, i
and j. Write a method to set all bits between i and j in N equal to M
(e.g., M becomes a substring of N located at i and starting at j).
EXAMPLE: Input: N = 10000000000, M = 10101, i = 2, j = 6 Output: N =
10001010100
This problem is from Cracking the Coding interview. I was able to solve it using the following O(j - i) algorithm:
def set_bits(a, b, i, j):
if not b: return a
while i <= j:
if b & 1 == 1:
last_bit = (b & 1) << i
a |= last_bit
else:
set_bit = ~(1 << i)
a &= set_bit
b >>= 1
i += 1
return a
The author gave this O(1) algorithm as a solution:
def update_bits(n, m, i, j):
max = ~0 # All 1s
# 1s through position j, then zeroes
left = max - ((1 << j) - 1)
# 1s after position i
right = ((1 << i) - 1)
# 1’s, with 0s between i and j
mask = left | right
# Clear i through j, then put m in there
return (n & mask) | (m << i)
I noticed that for some test cases the author's algorithm seems to be outputting the wrong answer. For example for N = 488, M = 5, i = 2, j = 6 it outputs 468. When the output should be 404, as my O(j - i) algorithm does.
Question: Is there a way to get a constant time algorithm which works for all cases?
I think the author of the algorithm assumes the bound of j (six in your example) to be exclusive; this boils down to the question whether a range from 2 to 6 should include 6 (in Python that is not the case). In other words, if the algorithm is modified to:
def update_bits(n, m, i, j):
max = ~0 # All 1s
# 1s through position j, then zeroes
left = max - ((1 << (j+1)) - 1)
# 1s after position i
right = ((1 << i) - 1)
# 1’s, with 0s between i and j
mask = left | right
# Clear i through j, then put m in there
return (n & mask) | (m << i)
It works.
Nevertheless you can speed up things a bit as follows:
def update_bits(n, m, i, j):
# 1s through position j, then zeroes
left = (~0) << (j+1)
# 1s after position i
right = ((1 << i) - 1)
# 1’s, with 0s between i and j
mask = left | right
# Clear i through j, then put m in there
return (n & mask) | (m << i)
In this example, we simply shift the ones out of the register.
Note that you made an error in your own algorithm, in case b = 0, that does not mean you can simply return a, since for that range, the bits should be cleared. Say a = '0b1011001111101111' and b = '0b0' and i and j are 6 and 8 respectively, one expects the result to be '0b1011001000101111'. The algorithm thus should be:
def set_bits(a, b, i, j):
while i <= j:
if b & 1 == 1:
last_bit = (b & 1) << i
a |= last_bit
else:
set_bit = ~(1 << i)
a &= set_bit
b >>= 1
i += 1
return a
If I make this modification and I test the program with 10'000'000 random inputs, both algorithms always produce the same result:
for i in range(10000000):
m = randint(0,65536)
i = randint(0,15)
j = randint(i,16)
n = randint(0,2**(j-i))
if set_bits(m,n,i,j) != update_bits(m,n,i,j):
print((bin(m),bin(n),i,j,bin(set_bits(m,n,i,j)),bin(update_bits(m,n,i,j)))) #This line is never printed.
Of course this is not a proof both algorithms are equivalent (perhaps there is a tiny cornercase where they differ), but I'm quite confident that for valid input (i and j positive, i < j, etc.) both should always produce the same result.
I think there is one mistake in the proposed solution.
It should be:
def update_bits(n, m, i, j):
max = ~0 # All 1s
# 1s through position j + 1, then zeroes
left = max - ((1 << (j + 1)) - 1)
# 1s after position i
right = ((1 << i) - 1)
# 1’s, with 0s between i and j
mask = left | right
# Clear i through j, then put m in there
return (n & mask) | (m << i)
Because it said we should populate starting from j to i, so we need to clear bit j also. Result is 404 as expected.
To go a little bit further, in case m has more than (j - i + 1) bits, we need to change the return statement:
return (n & mask) | ((m << i) & ~mask)
create mask m which has set bits for all bits between <i,j>
you can use arithmetic bit shift left to create powers of 2 exploiting that powers of 2 minus one are numbers with all set bits up to the exponent-1
so set all bit <0,j> and then clear bits up to i-1
copy bits from M to N
so use m to clear bits in N and then copy the M bits instead of them. Do not forget to shift left M by i to match your case...
In C++ (sorry do not use python) is O(1) like this:
DWORD bitcopy(DWORD N,DWORD M,int i,int j)
{
DWORD m;
// set bits <0..j>
m =(2<<j)-1;
// clears <0..i)
if (i) m^=(2<<(i-1))-1;
// clear space for copied bits
N&=0xFFFFFFFF-m;
// copy bits M->N
N|=(M<<i)&m;
return N;
}
You can also use LUT for the i,j bits parts of m instead... as you got 32 bit numbers it needs just 32 or 64 numbers if you are not comfortable with the bit shifts...
This version seems to work well too, provided i <= j
def set_bits(n, m, i, j):
mask = (1 << (j + 1)) - (1 << i)
return n & ~mask | (m << i) & mask
Its Easy,You can implement it at your own once you've idea what you have to do...
heres 32 bit example:
given:
n = 10000000000000000000000000000000
m = 10101
i=2 , j=6
Step 1: Create Mask ->
int x=(~0); //All Ones 11111111111111111111111111111111
int left=(1<<i)-1; // 11
int right=x-((1<<j)-1); // 11111111111111111111111111000000
int mask=left | right; // 11111111111111111111111111000011
Step 2 : Clear bits between i and j in given number 'n' ->
int cleared = n & mask; // 10000000000000000000000000000000
Step 3 : Putting m in n between i and j (cleared bits) ->
int ans= cleared | m<<i; 10000000000000000000000001010100
In Wikipedia this is one of the given algorithms to generate prime numbers:
def eratosthenes_sieve(n):
# Create a candidate list within which non-primes will be
# marked as None; only candidates below sqrt(n) need be checked.
candidates = [i for i in range(n + 1)]
fin = int(n ** 0.5)
# Loop over the candidates, marking out each multiple.
for i in range(2, fin + 1):
if not candidates[i]:
continue
candidates[i + i::i] = [None] * (n // i - 1)
# Filter out non-primes and return the list.
return [i for i in candidates[2:] if i]
I changed the algorithm slightly.
def eratosthenes_sieve(n):
# Create a candidate list within which non-primes will be
# marked as None; only candidates below sqrt(n) need be checked.
candidates = [i for i in range(n + 1)]
fin = int(n ** 0.5)
# Loop over the candidates, marking out each multiple.
candidates[4::2] = [None] * (n // 2 - 1)
for i in range(3, fin + 1, 2):
if not candidates[i]:
continue
candidates[i + i::i] = [None] * (n // i - 1)
# Filter out non-primes and return the list.
return [i for i in candidates[2:] if i]
I first marked off all the multiples of 2, and then I considered odd numbers only. When I timed both algorithms (tried 40.000.000) the first one was always better (albeit very slightly). I don't understand why. Can somebody please explain?
P.S.: When I try 100.000.000, my computer freezes. Why is that? I have Core Duo E8500, 4GB RAM, Windows 7 Pro 64 Bit.
Update 1: This is Python 3.
Update 2: This is how I timed:
start = time.time()
a = eratosthenes_sieve(40000000)
end = time.time()
print(end - start)
UPDATE: Upon valuable comments (especially by nightcracker and Winston Ewert) I managed to code what I intended in the first place:
def eratosthenes_sieve(n):
# Create a candidate list within which non-primes will be
# marked as None; only c below sqrt(n) need be checked.
c = [i for i in range(3, n + 1, 2)]
fin = int(n ** 0.5) // 2
# Loop over the c, marking out each multiple.
for i in range(fin):
if not c[i]:
continue
c[c[i] + i::c[i]] = [None] * ((n // c[i]) - (n // (2 * c[i])) - 1)
# Filter out non-primes and return the list.
return [2] + [i for i in c if i]
This algorithm improves the original algorithm (mentioned at the top) by (usually) 50%. (Still, worse than the algorithm mentioned by nightcracker, naturally).
A question to Python Masters: Is there a more Pythonic way to express this last code, in a more "functional" way?
UPDATE 2: I still couldn't decode the algorithm mentioned by nightcracker. I guess I'm too stupid.
The question is, why would it even be faster? In both examples you are filtering multiples of two, the hard way. It doesn't matter whether you hardcode candidates[4::2] = [None] * (n // 2 - 1) or that it gets executed in the first loop of for i in range(2, fin + 1):.
If you are interested in an optimized sieve of Eratosthenes, here you go:
def primesbelow(N):
# https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
#""" Input N>=6, Returns a list of primes, 2 <= p < N """
correction = N % 6 > 1
N = (N, N-1, N+4, N+3, N+2, N+1)[N%6]
sieve = [True] * (N // 3)
sieve[0] = False
for i in range(int(N ** .5) // 3 + 1):
if sieve[i]:
k = (3 * i + 1) | 1
sieve[k*k // 3::2*k] = [False] * ((N//6 - (k*k)//6 - 1)//k + 1)
sieve[(k*k + 4*k - 2*k*(i%2)) // 3::2*k] = [False] * ((N // 6 - (k*k + 4*k - 2*k*(i%2))//6 - 1) // k + 1)
return [2, 3] + [(3 * i + 1) | 1 for i in range(1, N//3 - correction) if sieve[i]]
Explanation here: Porting optimized Sieve of Eratosthenes from Python to C++
The original source is here, but there was no explanation. In short this primesieve skips multiples of 2 and 3 and uses a few hacks to make use of fast Python assignment.
You do not save a lot of time avoiding the evens. Most of the computation time within the algorithm is spent doing this:
candidates[i + i::i] = [None] * (n // i - 1)
That line causes a lot of action on the part of the computer. Whenever the number in question is even, this is not run as the loop bails on the if statement. The time spent running the loop for even numbers is thus really really small. So eliminating those even rounds does not produce a significant change in the timing of the loop. That's why your method isn't considerably faster.
When python produces numbers for range it uses a formula: start + index * step. Multiplying by two as you do in your case is going to be slightly more expensive then one as in the original case.
There is also quite possibly a small overhead to having a longer function.
Neither are of those are really significant speed issues, but they override the very small amount of benefit your version brings.
Its probably slightly slower because you are performing extra set up to do something that was done in the first case anyway (marking off multiples of two). That setup time might be what you see if it is as slight as you say
Your extra step is unnecessary and will actually traverse the whole collection n once doing that 'get rid of evens' operation rather than just operating on n^1/2.