Optimizing bigint calls - python

I am currently using the book 'Programming in D' for learning D. I tried to solve a problem of summing up the squares of numbers from 1 to 10000000. I first made a functional approach to solve the problem with the map and reduce but as the numbers get bigger I have to cast the numbers to bigint to get the correct output.
long num = 10000001;
BigInt result;
result = iota(1,num).map!(a => to!BigInt(a * a)).reduce!((a,b) => (a + b));
writeln("The sum is : ", result);
The above takes 7s to finish when compiled with dmd -O . I profiled the program and most of the time is wasted on BigInt calls. Though the square of the number can fit into a long I have to typecast them to bigint so that reduce function sums and returns the appropriate sum. The python program takes only 3 seconds to finish it. When num = 100000000 D program gets to 1 minute and 13 seconds to finish. Is there a way to optimize the calls to bigint. The products can themselves be long but they have to be typecasted as bigint objects so that they give right results from reduce operations. I tried pushing the square of the numbers into a bigint array but its also slower. I tried to typecast all the numbers as Bigint
auto bigs_map_nums = iota(1,num).map!(a => to!BigInt(a)).array;
auto bigs_map = sum(bigs_map_nums.map!(a => (a * a)).array);
But its also slower. I read the answers at How to optimize this short factorial function in scala? (Creating 50000 BigInts). Is it a problem with the implementation of multiplication for bigger integers in D too ? Is there a way to optimize the function calls to BigInt ?
python code :
timeit.timeit('print sum(map(lambda num : num * num, range(1,10000000)))',number=1)
333333283333335000000
3.58552622795105
The code was executed on a dual-core 64 bit linux laptop with 2 GB RAM.
python : 2.7.4
dmd : DMD64 D Compiler v2.066.1

Without range coolness: foreach(x; 0 .. num) result += x * x;
With range cool(?)ness:
import std.functional: reverseArgs;
result = iota(1, num)
.map!(a => a * a)
.reverseArgs!(reduce!((a, b) => a + b))(BigInt(0) /* seed */);
The key is to avoid BigInting every element, of course.
The range version is a little slower than the non-range one. Both are significantly faster than the python version.
Edit: Oh! Oh! It can be made much more pleasant with std.algorithm.sum:
result = iota(1, num)
.map!(a => a * a)
.sum(BigInt(0));

The python code is not equivalent to the D code, in fact it does a lot less.
Python uses an int, then it promotes that int to long() when the result is bigger than what can be stored in an int() type. Internally, (at least CPython) uses a long number to store integer bigger than 256, which is at least 32bits. Up until that overflow normal cpu instructions can be used for the multiplication which are quite faster than bigint multiplication.
D's BigInt implementation treats the numbers as BigInt from the start and uses the expensive multiplication operation from 1 until the end. Much more work to be done there.
It's interesting how complicated the multiplication can be when we talk about BigInts.
The D implementation is
https://github.com/D-Programming-Language/phobos/blob/v2.066.1/std/internal/math/biguintcore.d#L1246
Python starts by doing
static PyObject *
int_mul(PyObject *v, PyObject *w)
{
long a, b;
long longprod; /* a*b in native long arithmetic */
double doubled_longprod; /* (double)longprod */
double doubleprod; /* (double)a * (double)b */
CONVERT_TO_LONG(v, a);
CONVERT_TO_LONG(w, b);
/* casts in the next line avoid undefined behaviour on overflow */
longprod = (long)((unsigned long)a * b);
... //check if we have overflowed
{
const double diff = doubled_longprod - doubleprod;
const double absdiff = diff >= 0.0 ? diff : -diff;
const double absprod = doubleprod >= 0.0 ? doubleprod :
-doubleprod;
/* absdiff/absprod <= 1/32 iff
32 * absdiff <= absprod -- 5 good bits is "close enough" */
if (32.0 * absdiff <= absprod)
return PyInt_FromLong(longprod);
else
return PyLong_Type.tp_as_number->nb_multiply(v, w);
}
}
and if the number is bigger than what a long can hold it does a karatsuba multiplication. Implementation in :
http://svn.python.org/projects/python/trunk/Objects/longobject.c (k_mul function)
The equivalent code would wait to use BigInts until they are no native data types that can hold the number in question.

DMD's backend does not emit highly optimized code. For fast programs, compile with GDC or LDC.
On my computer, I get these timings:
Python: 3.01
dmd -O -inline -release: 3.92
ldmd2 -O -inline -release: 2.14

Related

Problem in handling large number in Python

I was solving a problem on codeforces:- Here is the Question
I wrote python code to solve the same:-
n=int(input())
print(0 if ((n*(n+1))/2)%2==0 else 1)
But it failed for the test-case: 1999999997 See Submission-[TestCase-6]
Why it failed despite Python can handle large numbers effectively ? [See this Thread]
Also the similar logic worked flawlessly when I coded it in CPP [See Submission Here]:-
#include<bits/stdc++.h>
using namespace std;
int main(){
int n;
cin>>n;
long long int sum=1ll*(n*(n+1))/2;
if(sum%2==0) cout<<0;
else cout<<1;
return 0;
}
Ran a test based on the insight from #juanpa.arrivillaga and this has been a great rabbit hole:
number = 1999999997
temp = n * (n+1)
# type(temp) is int, n is 3999999990000000006. We can clearly see that after dividing by 2 we should get an odd number, and therefore output 1
divided = temp / 2
# type(divided) is float. Printing divided for me gives 1.999999995e+18
# divided % 2 is 0
divided_int = temp // 2
# type(divided_int) is int. Printing divided for me gives 1999999995000000003
// Forces integer division, and will always return an integer: 7 // 2 will be equal to 3, not 3.5
As per the other answer you have linked, the int type in python can handle very large numbers.
Float can also handle large numbers, but there are issues with our ability to represent floats across languages. The crux of it is that not all floats can be captured accurately: In many scenarios the difference between 1.999999995e+18 and 1.999999995000000003e+18 is so minute it won't matter, but this is a scenario where it does, as you care a lot about the final digit of the number.
You can learn more about this by watching this video
As mentioned by #juanpa.arrivillaga and #DarrylG in comments, I should have used floor operator// for integer division, the anomaly was cause due to float division by / division operator.
So, the correct code should be:-
n=int(input())
print(0 if (n*(n+1)//2)%2==0 else 1)

For loop computing recurrence relation takes very long

Q(x)=[Q(x−1)+Q(x−2)]^2
Q(0)=0, Q(1)=1
I need to find Q(29). I wrote a code in python but it is taking too long. How to get the output (any language would be fine)?
Here is the code I wrote:
a=0
b=1
for i in range(28):
c=(a+b)*(a+b)
a=b
b=c
print(b)
I don't think this is a tractable problem with programming. The reason why your code is slow is that the numbers within grow very rapidly, and python uses infinite-precision integers, so it takes its time computing the result.
Try your code with double-precision floats:
a=0.0
b=1.0
for i in range(28):
c=(a+b)*(a+b)
a=b
b=c
print(b)
The answer is inf. This is because the answer is much much larger than the largest representable double-precision number, which is rougly 10^308. You could try using finite-precision integers, but those will have an even smaller representable maximum. Note that using doubles will lead to loss of precision, but surely you don't want to know every single digit of your huuuge number (side note: I happen to know that you do, making your job even harder).
So here's some math background for my skepticism: Your recurrence relation goes
Q[k] = (Q[k-2] + Q[k-1])^2
You can formulate a more tractable sequence from the square root of this sequence:
P[k] = sqrt(Q[k])
P[k] = P[k-2]^2 + P[k-1]^2
If you can solve for P, you'll know Q = P^2.
Now, consider this sequence:
R[k] = R[k-1]^2
Starting from the same initial values, this will always be smaller than P[k], since
P[k] = P[k-2]^2 + P[k-1]^2 >= P[k-1]^2
(but this will be a "pretty close" lower bound as the first term will always be insignificant compared to the second). We can construct this sequence:
R[k] = R[k-1]^2 = R[k-2]^4 = R[k-3]^6 = R[k-m]^(2^m) = R[0]^(2^k)
Since P[1 give or take] starts with value 2, we should consider
R[k] = 2^(2^k)
as a lower bound for P[k], give or take a few exponents of 2. For k=28 this is
P[28] > 2^(2^28) = 2^(268435456) = 10^(log10(2)*2^28) ~ 10^80807124
That's at least 80807124 digits for the final value of P, which is the square root of the number you're looking for. That makes Q[28] larger than 10^1.6e8. If you printed that number into a text file, it would take more than 150 megabytes.
If you imagine you're trying to handle these integers exactly, you'll see why it takes so long, and why you should reconsider your approach. What if you could compute that huge number? What would you do with it? How long would it take python to print that number on your screen? None of this is trivial, so I suggest that you try to solve your problem on paper, or find a way around it.
Note that you can use a symbolic math package such as sympy in python to get a feeling of how hard your problem is:
import sympy as sym
a,b,c,b0 = sym.symbols('a,b,c,b0')
a = 0
b = b0
for k in range(28):
c = (a+b)**2
a = b
b = c
print(c)
This will take a while, but it will fill your screen with the explicit expression for Q[k] with only b0 as parameter. You would "only" have to substitute your values into that monster to obtain the exact result. You could also try sym.simplify on the expression, but I couldn't wait for that to return anything meaningful.
During lunch time I let your loop run, and it finished. The result has
>>> import math
>>> print(math.log10(c))
49287457.71120789
So my lower bound for k=28 is a bit large, probably due to off-by-one errors in the exponent. The memory needed to store this integer is
>>> import sys
>>> sys.getsizeof(c)
21830612
that is roughly 20 MB.
This can be solved with brute force but it is still an interesting problem since it uses two different "slow" operations and there are trade-offs in choosing the correct approach.
There are two places where the native Python implementation of algorithm is slow: the multiplication of large numbers and the conversion of large numbers to a string.
Python uses the Karatsuba algorithm for multiplication. It has a running time of O(n^1.585) where n is the length of the numbers. It does get slower as the numbers get larger but you can compute Q(29).
The algorithm for converting a Python integer to its decimal representation is much slower. It has running time of O(n^2). For large numbers, it is much slower than multiplication.
Note: the times for conversion to a string also include the actual calculation time.
On my computer, computing Q(25) requires ~2.5 seconds but conversion to a string requires ~3 minutes 9 seconds. Computing Q(26) requires ~7.5 seconds but conversion to a string requires ~12 minutes 36 seconds. As the size of the number doubles, multiplication time increases by a factor of 3 and the running time of string conversion increases by a factor of 4. The running time of the conversion to string dominates. Computing Q(29) takes about 3 minutes and 20 seconds but conversion to a string will take more than 12 hours (I didn't actually wait that long).
One option is the gmpy2 module that provides access the very fast GMP library. With gmpy2, Q(26) can be calculated in ~0.2 seconds and converted into a string in ~1.2 seconds. Q(29) can be calculated in ~1.7 seconds and converted into a string in ~15 seconds. Multiplication in GMP is O(n*ln(n)). Conversion to decimal is faster that Python's O(n^2) algorithm but still slower than multiplication.
The fastest option is Python's decimal module. Instead of using a radix-2, or binary, internal representation, it uses a radix-10 (actually of power of 10) internal representation. Calculations are slightly slower but conversion to a string is very fast; it is just O(n). Calculating Q(29) requires ~9.2 seconds but calculating and conversion together only requires ~9.5 seconds. The time for conversion to string is only ~0.3 seconds.
Here is an example program using decimal. It also sums the individual digits of the final value.
import decimal
decimal.getcontext().prec = 200000000
decimal.getcontext().Emax = 200000000
decimal.getcontext().Emin = -200000000
def sum_of_digits(x):
return sum(map(int, (t for t in str(x))))
a = decimal.Decimal(0)
b = decimal.Decimal(1)
for i in range(28):
c = (a + b) * (a + b)
a = b
b = c
temp = str(b)
print(i, len(temp), sum_of_digits(temp))
I didn't include the time for converting the millions of digits into strings and adding them in the discussion above. That time should be the same for each version.
This WILL take too long, since is a kind of geometric progression which tends to infinity.
Example:
a=0
b=1
c=1*1 = 1
a=1
b=1
c=2*2 = 4
a=1
b=4
c=5*5 = 25
a=4
b=25
c= 29*29 = 841
a=25
b=841
.
.
.
You can check if c%10==0 and then divide it, and in the end multiplyit number of times you divided it but in the end it'll be the same large number. If you really need to do this calculation try using C++ it should run it faster than Python.
Here's your code written in C++
#include <cstdlib>
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
long long int a=0;
long long int b=1;
long long int c=0;
for(int i=0;i<28;i++){
c=(a+b)*(a+b);
a=b;
b=c;
}
cout << c;
return 0;
}

Long numbers in C++

I have this program in Python:
# ...
print 2 ** (int(input())-1) % 1000000007
The problem is that this program works a long time on big numbers. I rewrote my code using C++, but sometimes I have a wrong answer. For example, in Python code for number 12345678 I've got 749037894 and its correct, but in C++ I've got -291172004.
This is the C++ code:
#include <iostream>
#include <cmath>
using namespace std;
const int MOD = 1e9 + 7;
int main() {
// ...
long long x;
cin >> x;
long long a =pow(2, (x-1));
cout << a % MOD;
}
As already mentioned, your problem is that for large exponent you have integer overflow.
To overcome this, remember that modular multiplication has such property that:
(A * B) mod C = (A mod C * B mod C) mod C
And then you can implement 'e to the power p modulo m' function using fast exponentiation scheme.
Assuming no negative powers:
long long powmod(long long e, long long p, long long m){
if (p == 0){
return 1;
}
long long a = 1;
while (p > 1){
if (p % 2 == 0){
e = (e * e) % m;
p /= 2;
} else{
a = (a * e) % m;
e = (e * e) % m;
p = (p - 1) / 2;
}
}
return (a * e) % m;
}
Note that remainder is taken after every multiplication, so no overflow can occur, if single multiplication doesn't overflow (and that's true for 1000000007 as m and long long).
You seem to be dealing with positive numbers and those are overflowing the number of bits you've allocated for their storage. Also keep in mind that there is a difference between Python and C/C++ in the way a modulo on a negative value is computed. To get a similar computation, you will need to add the Modulo to the value so it's positive before you take the modulo which is the way it works in Python:
cout << (a+MOD) % MOD;
You may have to add MOD n times till the temporary value is positive before taking its modulo.
Like has been mentioned by many of the other answers, your problem lies in integer overflow.
You can do like suggested by deniss and implement your own modmul() and modpow() functions.
If, however, this is part of a project that will need to do plenty of calculations with very large numbers, I would suggest using a "big number library" like GNU GMP or mbedTLS Bignum library.
In C++ the various fundamental types have fixed sizes. For example a long long is typically 64 bits wide. But width varies with system type and other factors. As suggested above you can check climits.h for your particular environment's limits.
Raising 2 to the power 12345677 would involve shifting the binary number 10 left by 12345676 bits which wouldn't fit in a 64 bit long long (and I suspect is unlikely to fit in most long long implementations).
Another factor to consider is that pow returns a double (or long double) depending on the overload used. You don't say what compiler you are using but most likely you got a warning about possible truncation or data loss when the result of calling pow is assigned to the long long variable a.
Finally, even if pow is returning a long double I suspect the exponent 12345677 is too large to be stored in a long double so pow is probably returning positive infinity which then gets truncated to some bit pattern that will fit in a long long. You could certainly check that by introducing an intermediate long double variable to receive the value of pow which you could then examine in a debugger.

Translate a hashing algorithm from C to Python

My client is a Python programmer and I have created a C++ backend for him which includes license generation and checking. For additional safety, the Python front-end will also perform a validity check of the license.
The license generation and checking algorithm however is based on hashing methods which rely on the fact that an integer is of a fixed byte size and bit-shifting a value will not extend the integers byte count.
This is a simplified example code:
unsigned int HashString(const char* str) {
unsigned int hash = 3151;
while (*str != 0) {
hash = (hash << 3) + (*str << 2) * 3;
str++;
}
return hash;
}
How can this be translated to Python? The direct translation obviously yields a different result:
def hash_string(str):
hash = 3151
for c in str:
hash = (hash << 3) + (ord(c) << 2) * 3
return hash
For instance:
hash_string("foo bar spam") # 228667414299004
HashString("foo bar spam") // 3355459964
Edit: The same would also be necessary for PHP since the online shop should be able to generate valid licenses, too.
Mask the hash value with &:
def hash_string(str, _width=2**32-1):
hash = 3151
for c in str:
hash = ((hash << 3) + (ord(c) << 2) * 3)
return hash & _width
This manually cuts the hash back to size. You only need to limit the result once; it's not as if those higher bits make a difference for the final result.
Demo:
>>> hash_string("foo bar spam")
3355459964
The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.
The easiest fix is just to correct at the end:
return hash % (1 << 32)
For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.
PHP may have the same problem, or a different one.
PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)
But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.
Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.

Trouble porting a function in Python to C using the Python C API

I have a checksum function in Python:
def checksum(data):
a = b = 0
l = len(data)
for i in range(l):
a += ord(data[i])
b += (l - i)*ord(data[i])
return (b << 16) | a, a, b
that I am trying to port to a C module for speed. Here's the C function:
static PyObject *
checksum(PyObject *self, PyObject *args)
{
int i, length;
unsigned long long a = 0, b = 0;
unsigned long long checksum = 0;
char *data;
if (!PyArg_ParseTuple(args, "s#", &data, &length)) {
return NULL;
}
for (i = 0; i < length; i++) {
a += (int)data[i];
b += (length - i) * (int)data[i];
}
checksum = (b << 16) | a;
return Py_BuildValue("(Kii)", checksum, (int)a, (int)b);
}
I use it by opening a file and feeding it a 4096 block of data. They both return the same values for small strings, but when I feed it binary data straight from a file, the C version returns wildly different values. Any help would be appreciated.
I would guess that you have some kind of overflow in your local variables. Probably b gets to large. Just dump the values for debugging purposes and you should see if it's the problem. As you mention, that you are porting the method for performance reasons. Have you checked psyco? Might be fast enough and much easier. There are more other tools which compile parts of python code on the fly to C, but I don't have the names in my head.
I'd suggest that the original checksum function is "incorrect". The value returned for checksum is of unlimited size (for any given size in MB, you could construct an input for which the checksum will be at least of this size). If my calculations are correct, the value can fit in 64 bits for inputs of less than 260 MB, and b can fit in an integer for anything less than 4096 bytes. Now, I might be off with the number, but it means that for larger inputs the two functions are guaranteed to work differently.
To translate the first function to C, you'd need to keep b and c in Python integers, and to perform the last calculation as a Python expression. This can be improved, though:
You could use C long long variables to store an intermediate sum and add it to the Python integers after a certain number of iterations. If the number of iterations is n, the maximum value for a is n * 255, and for b is len(data) * n * 255. Try to keep those under 2**63-1 when storing them in C long long variables.
You can use long long instead of unsigned long long, and raise a RuntimeError every time it gets negative in debug mode.
Another solution would be to limit the Python equivalent to 64 bits by using a & 0xffffffffffffffff and b & 0xffffffffffffffff.
The best solution would be to use another kind of checksum, like binascii.crc32.

Categories