Question
Suppose we are given a numpy array arr of doubles and a small positive integer n. I am looking for an efficient way to set the n least significant entries of each element of arr to 0 or to 1. Is there a ufunc for that? If not, are there suitable C functions that I could apply to the elements from Cython?
Motivation
Below I will provide the motivation for the question. If you find that an answer to the question above is not needed to fulfill the end goal, I am happy to receive respective comments. I will then create a separate question in order to keep things sorted.
The motivation for this question is to implement a version of np.unique(arr, True) that accepts a relative tolerance parameter. Thereby, the second argument of np.unique is of importance: I need to know the indices of the unique elements (first occurrence!) in the original array. Thereby, it is not important that the elements are sorted.
I am aware of questions and solutions on np.unique with tolerance. However, I have not found a solution that also returns the indices of the first occurrences of the unique elements in the original array. Furthermore, the solutions that I have seen were based on sorting, which runs in O(arr.size log(arr.size)). However, a constant-time solution is possible with a hash map.
The idea is to round each element in arr up and down and to put these elements in a hash map. If either of the values is in the hash map already, an entry is ignored. Otherwise, the element is included in the result. As insertion and lookup run in constant average time for hash maps, this method should be faster than a sorting based method in theory.
Below find my Cython implementation:
import numpy as np
cimport numpy as np
import cython
from libcpp.unordered_map cimport unordered_map
#cython.boundscheck(False)
#cython.wraparound(False)
def unique_tol(np.ndarray[DOUBLE_t, ndim=1] lower,
np.ndarray[DOUBLE_t, ndim=1] higher):
cdef long i, count
cdef long endIndex = lower.size
cdef unordered_map[double, short] vals = unordered_map[double, short]()
cdef np.ndarray[DOUBLE_t, ndim=1] result_vals = np.empty_like(lower)
cdef np.ndarray[INT_t, ndim=1] result_indices = np.empty_like(lower,
dtype=int)
count = 0
for i in range(endIndex):
if not vals.count(lower[i]) and not vals.count(higher[i]):
# insert in result
result_vals[count] = lower[i]
result_indices[count] = i
# put lowerVal and higherVal in the hashMap
vals[lower[i]]
vals[higher[i]]
# update the index in the result
count += 1
return result_vals[:count], result_indices[:count]
This method called with appropriate rounding does the job. For example, if differences less than 10^-6 shall be ignored, we would write
unique_tol(np.round(a, 6), np.round(a+1e-6, 6))
Now I would like to replace np.round with a relative rounding procedure based on manipulation of the mantissa. I am aware of alternative ways of relative rounding, but I think manipulating the mantissa directly should be more efficient and elegant. (Admittedly, I do not think that the performance gain is significant. But I would be interested in the solution.)
EDIT
The solution by Warren Weckesser works like charm. However, the result is not applicable as I was hoping for, since two numbers with very small difference can have different exponents. Unifying the mantissa will then not lead to similar numbers. I guess I have to stick with the relative rounding solutions that are out there.
"I am looking for an efficient way to set the n least significant entries of each element of arr to 0 or to 1."
You can create a view of the array with data type numpy.uint64, and then manipulate the bits in that view as needed.
For example, I'll set the lowest 21 bits in the mantissa of this array to 0.
In [46]: np.set_printoptions(precision=15)
In [47]: x = np.array([0.0, -1/3, 1/5, -1/7, np.pi, 6.02214076e23])
In [48]: x
Out[48]:
array([ 0.000000000000000e+00, -3.333333333333333e-01,
2.000000000000000e-01, -1.428571428571428e-01,
3.141592653589793e+00, 6.022140760000000e+23])
Create a view of the data in x with data type numpy.uint64:
In [49]: u = x.view(np.uint64)
Take a look at the binary representation of the values.
In [50]: [np.binary_repr(t, width=64) for t in u]
Out[50]:
['0000000000000000000000000000000000000000000000000000000000000000',
'1011111111010101010101010101010101010101010101010101010101010101',
'0011111111001001100110011001100110011001100110011001100110011010',
'1011111111000010010010010010010010010010010010010010010010010010',
'0100000000001001001000011111101101010100010001000010110100011000',
'0100010011011111111000011000010111001010010101111100010100010111']
Set the lower n bits to 0, and take another look.
In [51]: n = 21
In [52]: u &= ~np.uint64(2**n-1)
In [53]: [np.binary_repr(t, width=64) for t in u]
Out[53]:
['0000000000000000000000000000000000000000000000000000000000000000',
'1011111111010101010101010101010101010101010000000000000000000000',
'0011111111001001100110011001100110011001100000000000000000000000',
'1011111111000010010010010010010010010010010000000000000000000000',
'0100000000001001001000011111101101010100010000000000000000000000',
'0100010011011111111000011000010111001010010000000000000000000000']
Because u is a view of the same data as in x, x has also been modified in-place.
In [54]: x
Out[54]:
array([ 0.000000000000000e+00, -3.333333332557231e-01,
1.999999999534339e-01, -1.428571428405121e-01,
3.141592653468251e+00, 6.022140758954589e+23])
Similar to #WarrenWeckesser's but without the black magic using "official" ufuncs instead. Downside: I'm pretty sure it's slower, quite possibly significantly so:
>>> a = np.random.normal(size=10)**5
>>> a
array([ 9.87664561e-12, -1.79654870e-03, 4.36740261e-01, 7.49256141e+00,
-8.76894617e-01, 2.93850753e+00, -1.44149959e-02, -1.03026094e-03,
3.18390143e-03, 3.05521581e-03])
>>>
>>> mant,expn = np.frexp(a)
>>> mant
array([ 0.67871792, -0.91983293, 0.87348052, 0.93657018, -0.87689462,
0.73462688, -0.92255974, -0.5274936 , 0.81507877, 0.78213525])
>>> expn
array([-36, -9, -1, 3, 0, 2, -6, -9, -8, -8], dtype=int32)
>>> a_binned = np.ldexp(np.round(mant,5),expn)
>>> a_binned
array([ 9.87667590e-12, -1.79654297e-03, 4.36740000e-01, 7.49256000e+00,
-8.76890000e-01, 2.93852000e+00, -1.44150000e-02, -1.03025391e-03,
3.18390625e-03, 3.05523437e-03])
Related
I have a 2 * N integer array ids representing intervals, where N is about a million. It looks like this
0 2 1 ...
3 4 3 ...
The ints in the arrays can be 0, 1, ... , M-1, where M <= 2N - 1. (Detail: if M = 2N, then the ints span all the 2N integers; if M < 2N, then there are some integers that have the same values.)
I need to calculate a kind of inverse map from ids. What I called "inverse map" is to see ids as intervals and capture the relation from their inner points with their indices.
Intuition Intuitively,
0 2 1
3 4 3
can be seen as
0 -> 0, 1, 2
1 -> 2, 3
2 -> 1, 2
where the right-hand-side endpoints are excluded for my problem. The "inverse" map would be
0 -> 0
1 -> 0, 2
2 -> 0, 1, 2
3 -> 1
Code I have a piece of Python code that attempts to calculate the inverse map in a dictionary inv below:
for i in range(ids.shape[1]):
for j in range(ids[0][i], ids[1][i]):
inv[j].append(i)
where each inv[j] is an array-like data initialized as empty before the nested loop. Currently I use python's built-in arrays to initialize it.
for i in range(M): inv[i]=array.array('I')
Question The nested loop above works like a mess. In my problem setting (in image processing), my first loop has a million iterations; second one about 3000 iterations. Not only it takes much memory (because inv is huge), it is also slow. I would like to focus on speed in this question. How can I accelerate this nested loop above, e.g. with vectorization?
You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis(). Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):
import numpy as np
import array
ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids) # Convert to numpy array. Expensive operation?
range_index = 0 # Initialize. To be bumped up by each invocation of my_func()
inv = {}
for i in range(np.max(ids_arr)):
inv[i] = array.array('I')
def my_func(my_slice):
global range_index
for i in range(my_slice[0], my_slice[1]):
inv[i].append(range_index)
range_index += 1
np.apply_along_axis (my_func,0,ids_arr)
print (inv)
Output:
{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]),
3: array('I', [1])}
Edit:
I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy array indexing. Use the below lines to create and initialize inv as a numpy array of Python arrays. The rest of the code can remain as-is:
inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
inv[i] = array.array('I')
(Note: This assumes that your application isn't doing dict-specific stuff on inv, such as inv.items() or inv.keys(). If that's the case, however, you might need an extra step to convert the numpy array into a dict)
avoid for loop, just a pandas sample
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": np.random.randint(0, 100, 100000),
"B": np.random.randint(0, 100, 100000)
})
df.groupby("B")["A"].agg(list)
Since the order of N is large, I've come up with what seems like a practical approach; let me know if there are any flaws.
For the ith interval as [x,y], store it as [x,y,i]. Sort the arrays based on their start and end times. This should take O(NlogN) time.
Create a frequency array freq[2*N+1]. For each interval, update the frequency using the concept of range update in O(1) per update. Generating the frequencies gets done in O(N).
Determine a threshold, based on your data. According to that value, the elements can be specified as either sparse or frequent. For sparse elements, do nothing. For frequent elements only, store the intervals in which they occur.
During lookup, if there is a frequent element, you can directly access the pre-computed lists. If the element is a sparse one, you can search the intervals in O(logN) time, since the intervals are sorted and there indexes were appended in step 1.
This seems like a practical approach to me, rest depends on your usage. Like the amortized time complexity you need per query and so on.
I have an array of Cartesian coordinates
xy = np.array([[0,0], [2,3], [3,4], [2,5], [5,2]])
which I want to convert into an array of complex numbers representing the same:
c = np.array([0, 2+3j, 3+4j, 2+5j, 5+2j])
My current solution is this:
c = np.sum(xy * [1,1j], axis=1)
This works but seems crude to me, and probably there is a nicer version with some built-in magic using np.complex() or similar, but the only way I found to use this was
c = np.array(list(map(lambda c: np.complex(*c), xy)))
This doesn't look like an improvement.
Can anybody point me to a better solution, maybe using one of the many numpy functions I don't know by heart (is there a numpy.cartesian_to_complex() working on arrays I haven't found yet?), or maybe using some implicit conversion when applying a clever combination of operators?
Recognize that complex128 is just a pair of floats. You can then do this using a "view" which is free, after converting the dtype from int to float (which I'm guessing your real code might already do):
xy.astype(float).view(np.complex128)
The astype() converts the integers to floats, which requires construction of a new array, but once that's done the view() is "free" in terms of runtime.
The above gives you shape=(n,1); you can np.squeeze() it to remove the extra dimension. This is also just a view operation, so takes basically no time.
How about
c=xy[:,0]+1j*xy[:,1]
xy[:,0] will give an array of all elements in the 0th column of xy and xy[:,1] will give that of the 1st column.
Multiply xy[:,1] with 1j to make it imaginary and then add the result with xy[:,0].
I have a dataset on which I'm trying to apply some arithmetical method.
The thing is it gives me relatively large numbers, and when I do it with numpy, they're stocked as 0.
The weird thing is, when I compute the numbers appart, they have an int value, they only become zeros when I compute them using numpy.
x = np.array([18,30,31,31,15])
10*150**x[0]/x[0]
Out[1]:36298069767006890
vector = 10*150**x/x
vector
Out[2]: array([0, 0, 0, 0, 0])
I have off course checked their types:
type(10*150**x[0]/x[0]) == type(vector[0])
Out[3]:True
How can I compute this large numbers using numpy without seeing them turned into zeros?
Note that if we remove the factor 10 at the beggining the problem slitghly changes (but I think it might be a similar reason):
x = np.array([18,30,31,31,15])
150**x[0]/x[0]
Out[4]:311075541538526549
vector = 150**x/x
vector
Out[5]: array([-329406144173384851, -230584300921369396, 224960293581823801,
-224960293581823801, -368934881474191033])
The negative numbers indicate the largest numbers of the int64 type in python as been crossed don't they?
As Nils Werner already mentioned, numpy's native ctypes cannot save numbers that large, but python itself can since the int objects use an arbitrary length implementation.
So what you can do is tell numpy not to convert the numbers to ctypes but use the python objects instead. This will be slower, but it will work.
In [14]: x = np.array([18,30,31,31,15], dtype=object)
In [15]: 150**x
Out[15]:
array([1477891880035400390625000000000000000000L,
191751059232884086668491363525390625000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
28762658884932613000273704528808593750000000000000000000000000000000L,
437893890380859375000000000000000L], dtype=object)
In this case the numpy array will not store the numbers themselves but references to the corresponding int objects. When you perform arithmetic operations they won't be performed on the numpy array but on the objects behind the references.
I think you're still able to use most of the numpy functions with this workaround but they will definitely be a lot slower than usual.
But that's what you get when you're dealing with numbers that large :D
Maybe somewhere out there is a library that can deal with this issue a little better.
Just for completeness, if precision is not an issue, you can also use floats:
In [19]: x = np.array([18,30,31,31,15], dtype=np.float64)
In [20]: 150**x
Out[20]:
array([ 1.47789188e+39, 1.91751059e+65, 2.87626589e+67,
2.87626589e+67, 4.37893890e+32])
150 ** 28 is way beyond what an int64 variable can represent (it's in the ballpark of 8e60 while the maximum possible value of an unsigned int64 is roughly 18e18).
Python may be using an arbitrary length integer implementation, but NumPy doesn't.
As you deduced correctly, negative numbers are a symptom of an int overflow.
Basically I have an array that may vary between any two numbers, and I want to preserve the distribution while constraining it to the [0,1] space. The function to do this is very very simple. I usually write it as:
def to01(array):
array -= array.min()
array /= array.max()
return array
Of course it can and should be more complex to account for tons of situations, such as all the values being the same (divide by zero) and float vs. integer division (use np.subtract and np.divide instead of operators). But this is the most basic.
The problem is that I do this very frequently across stuff in my project, and it seems like a fairly standard mathematical operation. Is there a built in function that does this in NumPy?
Don't know if there's a builtin for that (probably not, it's not really a difficult thing to do as is). You can use vectorize to apply a function to all the elements of the array:
def to01(array):
a = array.min()
# ignore the Runtime Warning
with numpy.errstate(divide='ignore'):
b = 1. /(array.max() - array.min())
if not(numpy.isfinite(b)):
b = 0
return numpy.vectorize(lambda x: b * (x - a))(array)
I have three arrays that are processed with a mathematical function to get a final result array. Some of the arrays contain NaNs and some contain 0. However a division by zero logically raise a Warning, a calculation with NaN gives NaN. So I'd like to do certain operations on certain parts of the arrays where zeros are involved:
r=numpy.array([3,3,3])
k=numpy.array([numpy.nan,0,numpy.nan])
n=numpy.array([numpy.nan,0,0])
1.0*n*numpy.exp(r*(1-(n/k)))
e.g. in cases where k == 0, I'd like to get as a result 0. In all other cases I'd to calculate the function above. So what is the way to do such calculations on parts of the array (via indexing) to get a final single result array?
import numpy
r=numpy.array([3,3,3])
k=numpy.array([numpy.nan,0,numpy.nan])
n=numpy.array([numpy.nan,0,0])
indxZeros=numpy.where(k==0)
indxNonZeros=numpy.where(k!=0)
d=numpy.empty(k.shape)
d[indxZeros]=0
d[indxNonZeros]=n[indxNonZeros]/k[indxNonZeros]
print d
Is following what you need?
>>> rv = 1.0*n*numpy.exp(r*(1-(n/k)))
>>> rv[k==0] = 0
>>> rv
array([ nan, 0., nan])
So, you may think that the solution to this problem is to use numpy.where, but the following:
numpy.where(k==0, 0, 1.0*n*numpy.exp(r*(1-(n/k))))
still gives a warning, as the expression is actually evaluated for the cases where k is zero, even if those results aren't used.
If this really bothers you, you can use numexpr for this expression, which will actually branch on the where statement and not evaluate the k==0 case:
import numexpr
numexpr.evaluate('where(k==0, 0, 1.0*n*exp(r*(1-(n/k))))')
Another way, based on indexing as you asked for, involves a little loss in legibility
result = numpy.zeros_like(k)
good = k != 0
result[good] = 1.0*n[good]*numpy.exp(r[good]*(1-(n[good]/k[good])))
This can be bypassed somewhat by defining a gaussian function:
def gaussian(r, k, n):
return 1.0*n*numpy.exp(r*(1-(n/k)))
result = numpy.zeros_like(k)
good = k != 0
result[good] = gaussian(r[good], k[good], n[good])