I have some code like:
import math, csv, sys, re, time, datetime, pickle, os, gzip
from numpy import *
x = [1, 2, 3, ... ]
y = sum(x)
The sum of the actual values in x is 2165496761, which is larger than the limit of 32bit integer. The reported y value is -2129470535, implying integer overflow.
Why did this happen? I thought the built-in sum was supposed to use Python's arbitrary-size integers?
See How to restore a builtin that I overwrote by accident? if you've accidentally done something like this at the REPL (interpreter prompt).
Doing from numpy import * causes the built-in sum function to be replaced with numpy.sum:
>>> sum(xrange(10**7))
49999995000000L
>>> from numpy import sum
>>> sum(xrange(10**7)) # assuming a 32-bit platform
-2014260032
To verify that numpy.sum is in use, try to check the type of the result:
>>> sum([721832253, 721832254, 721832254])
-2129470535
>>> type(sum([721832253, 721832254, 721832254]))
<type 'numpy.int32'>
To avoid this problem, don't use star import.
If you must use numpy.sum and want an arbitrary-sized integer result, specify a dtype for the result like so:
>>> sum([721832253, 721832254, 721832254],dtype=object)
2165496761L
or refer to the builtin sum explicitly (possibly giving it a more convenient binding):
>>> __builtins__.sum([721832253, 721832254, 721832254])
2165496761L
The reason why you get this invalid value is that you're using np.sum on a int32. Nothing prevents you from not using a np.int32 but a np.int64 or np.int128 dtype to represent your data. You could for example just use
x.view(np.int64).sum()
On a side note, please make sure that you never use from numpy import *. It's a terrible practice and a habit you must get rid of as soon as possible. When you use the from ... import *, you might be overwriting some Python built-ins which makes it very difficult to debug. Typical example, your overwriting of functions like sum or max...
Python handles large numbers with arbitrary precision:
>>> sum([721832253, 721832254, 721832254])
2165496761
Just sum them up!
To make sure you don't use numpy.sum, try __builtins__.sum() instead.
Related
I know what random.seed(int) does, like below:
random.seed(10)
But I saw a code which uses random.seed([list of int]), like below:
random.seed([1, 2, 1000])
What is the difference between passing a list and int to random.seed ?
The answer is basically in the comments, but putting it together: it appears the code you found imports random from numpy, instead of importing the standard Python random module:
from numpy import random
random.seed([1, 2, 1000])
Not recommended, to avoid exactly the confusion you're running into.
numpy can use a 1d array of integers as a seed (presumably because it uses a different pseudo-random function than Python itself to generate 'random' numbers, which can use a more complex seed), as described in the documentation for numpy.RandomState
Request
I was wondering if it's possible to take the default jsons.dump behavior and make it idempotent (return the input string) for python IPAdresses.
This would enable me to use an object while in python and use the same string in all serializations and deserializations. That way when we load the serialized JSON we don't need different control paths for the first program that loads the data and the second + N programs that load it.
Current Behavior
>>> import ipaddress
>>> import jsons
>>> import ipaddress
>>> ipaddress.IPv4Address("192.0.0.1")
IPv4Address('192.0.0.1')
>>> jsons.dump(ipaddress.IPv4Address("192.0.0.1"))
{'_ip': 3221225473}
>>> jsons.load(jsons.dump(ipaddress.IPv4Address("192.0.0.1")))
{'_ip': 3221225473}
Desired Behavior
>>> jsons.load(jsons.dump(ipaddress.IPv4Address("192.0.0.1")))
"192.0.0.1"
Desired but Probably Asking too Much
>>> jsons.load(jsons.dump(ipaddress.IPv4Address("192.0.0.1")))
IPv4Address('192.0.0.1')
Current workaround
I've changed the __repr__ method to do type conversions to string for now. But this means I have to do jsons.dump(repr(<variable>)) and this means other developers that work with my code have a potential landmine they need to be aware of.
Is it possible, and if yes how to do it: How can I make scipy use by default numpy.float128. For example
>>> from scipy.stats import norm
>>> type(norm.pdf(10, 10, 1))
<class 'numpy.float64'>
and I want it to be
>>> from scipy.stats import norm
>>> type(norm.pdf(10, 10, 1))
<class 'numpy.float128'>
If not, I will need to implement norm.pdf function by myself, which is easy, but do not solve my problem.
In general, you can't. Some of the scipy routines are wrappers of code written in C or Fortran that are only available in double precision. Even if you figure out which ones are pure python+numpy, and manage to ensure that the operations performed in the computation preserve the data type, you'll find that many of the functions use hardcoded 64 bit constants such as numpy.pi.
For a very large integer range, xrange (Python 2) should be used, which is renamed to range in Python 3. I assumed module six can provide a consistent why of writing.
But I found six.moves.builtins.range returns a list in Python 2 and an iterable non-list object in Python 3, just as the name range does.
Also, six.moves.builtins.xrange does not exist in Python 2.
Was I using the wrong function in six? Or does six simply not provide a solution for the range and xrange functions?
I know I can test sys.version[0] and rename the function accordingly. I was just looking for a "Don't Repeat Yourself" solution.
APPEND
As mentioned by mgilson:
>>> import six
>>> six.moves.range
AttributeError: '_MovedItems' object has no attribute 'range'
Is it something related to the version of six, or is there no such thing as six.moves.range?
I believe you just want six.moves.range. Not, six.moves.builtins.range.
>>> # tested on python2.x..
>>> import six.moves as sm
>>> sm.range
<type 'xrange'>
The reason here is that six.moves.builtins is the version agnostic "builtins" module. That just gives you access to the builtins -- it doesn't actually change what any of the builtins are.
Typically, I don't feel the need to introduce the external dependency in cases like this. I usually just add something like this to the top of my source file:
try:
xrange
except NameError: # python3
xrange = range
Is there a way to make Python floating point numbers follow numpy's rules regarding +/- Inf and NaN? For instance, making 1.0/0.0 = Inf.
>>> from numpy import *
>>> ones(1)/0
array([ Inf])
>>> 1.0/0.0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: float division
Numpy's divide function divide(1.0,0.0)=Inf however it is not clear if it can be used similar to from __future__ import division.
You should have a look at how Sage does it. IIRC they wrap the Python REPL in their own preprocessor.
I tried to do something similar, and I never figured out how to do it nicely. But, I can tell you a few things I tried, that didn't work:
Setting float = numpy.float -- python still uses the old float
trying to change float.div to a user-defined function -- "TypeError: can't set attributes of built-in/extension type float". Also, python doesn't like you mucking with the dict object in built-in objects.
I decided to go in and change the actual cpython source code to have it do what I wanted, which is obviously not practical, but it worked.
I think the reason why something like this is not possible is that float/int/list are implemented in C in the background, and their behavior cannot be changed cleanly from inside the language.
You could wrap all your floats in numpy.float64, which is the numpy float type.
a = float64(1.)
a/0 # Inf
In fact, you only need to wrap the floats on the left of arithmetic operations, obviously.