I'm coding a hash-table-ish indexing mechanism that returns an integer's interval number (0 to n), according to a set of splitting points.
For example, if integers are split at value 3 (one split point, so two intervals), we can find the interval number for each array element using a simple comparison:
>>> import numpy as np
>>> x = np.array(range(7))
>>> [int(i>3) for i in x]
[0, 0, 0, 0, 1, 1, 1]
When there are many intervals, we can define a function as below:
>>> def get_interval_id(input_value, splits):
... for i,split_point in enumerate(splits):
... if input_value < split_point:
... return i
... return len(splits)
...
>>> [get_interval_id(i, [2,4]) for i in x]
[0, 0, 1, 1, 2, 2, 2]
But this solution does not look elegant. Is there any Pythonic (better) way to do this job?
Since you're already using it, I would suggest you use the digitize method from numpy:
>>> import numpy as np
>>> np.digitize(np.array([0, 1, 2, 3, 4, 5, 6]), [2, 4])
array([0, 0, 1, 1, 2, 2, 2])
From the documentation:
Return the indices of the bins to which each value in input array
belongs.
Python, per se, does not have a tractable function for this process, called binning. If you wanted, you could wrap your function into a one-line command, but it's more readable this way.
However, data frame packages usually have full-featured binning methods; the most popular one in Python is PANDAS. This allows you to collect or classify values by equal intervals, equal divisions (same quantity of entries in each bin), or custom split values (your case). See this question for a good discussion and examples.
Of course, this means that you'd have to install and import pandas and convert your list to a data frame. If that's too much trouble, just keep your current implementation; it's readable, straightforward, and reasonably short.
How about wrapping the whole process inside of one function instead of only half the process?
>>> get_interval_ids([0 ,1, 2, 3, 4, 5 ,6], [2, 4])
[0, 0, 1, 1, 2, 2, 2]
and your function would look like
def get_interval_ids(values, splits):
def get_interval_id(input_value):
for i,split_point in enumerate(splits):
if input_value < split_point:
return i
return len(splits)
return [get_interval_id(val) for val in values]
Related
I have a little bit of a tricky problem here...
Given two arrays A and B
A = np.array([8, 5, 3, 7])
B = np.array([5, 5, 7, 8, 3, 3, 3])
I would like to replace the values in B with the index of that value in A. In this example case, that would look like:
[1, 1, 3, 0, 2, 2, 2]
For the problem I'm working on, A and B contain the same set of values and all of the entries in A are unique.
The simple way to solve this is to use something like:
for idx in range(len(A)):
ind = np.where(B == A[idx])[0]
B_new[ind] = A[idx]
But the B array I'm working with contains almost a million elements and using a for loop gets super slow. There must be a way to vectorize this, but I can't figure it out. The closest I've come is to do something like
np.intersect1d(A, B, return_indices=True)
But this only gives me the first occurrence of each element of A in B. Any suggestions?
The solution of #mozway is good for small array but not for big ones as it runs in O(n**2) time (ie. quadratic time, see time complexity for more information). Here is a much better solution for big array running in O(n log n) time (ie. quasi-linear) based on a fast binary search:
unique_values, index = np.unique(A, return_index=True)
result = index[np.searchsorted(unique_values, B)]
Use numpy broadcasting:
np.where(B[:, None]==A)[1]
NB. the values in A must be unique
Output:
array([1, 1, 3, 0, 2, 2, 2])
Though cant tell exactly what the complexity of this is, I belive it will perform quite well:
A.argsort()[np.unique(B, return_inverse = True)[1]]
array([1, 1, 3, 0, 2, 2, 2], dtype=int64)
Suppose I have the following numpy array:
Space = np.arange(7)
Question: How could I generate a set of N samples from Space such that:
Each sample consist only of increasing or decreasing consecutive numbers
The sampling is done with replacement so the sample need not be monotonically increasing or decreasing.
Each sample ends with a 6 or 0, and
There is no limitation on the length of the samples (however each sample terminates once a 6 or 0 has been selected).
In essence I'm creating a markov reward process via numpy sampling (There is probably a more efficient packet for this, but i'm not sure what it would be.) For example if N = 3, a possible sampled set would look something like this.
Sample = [[1,0],[4, 3, 4, 5, 6],[4, 3, 2, 1, 2, 1, 0]]
I can accomplish this with something not very elegant like this:
N = len(Space)
Set = []
for i in range(3):
X = np.random.randint(N)
if (X == 0) | (X==6):
Set.append(X)
else:
Sample = []
while (X !=0) & (X != 6):
Next = np.array([X-1, X+1])
X = np.random.choice(Next)
Sample.append(X)
Set.append(Sample)
return(Set)
But I was wondering what a more efficient/pythonic way to go about this type of sampling, perhaps without so many loops? Or alternatively if there are better python libraries for this sort of thing? Thanks.
Numpy doesn't seem to be helping much here, I'd just use the standard random module. The main reason is that random faster when working with single values as this algorithm does and there doesn't seem to be any need to pull in an extra dependency unless needed.
from random import randint, choice
def bounded_path(lo, hi):
# r covers the interior space
r = range(lo+1, hi)
n = randint(lo, hi)
result = [n]
while n in r:
n += choice((-1, 1))
result.append(n)
return result
seems to do the right thing for me, e.g. evaluating the above 10 times, I get:
[0]
[4, 3, 4, 3, 2, 1, 0]
[5, 6]
[2, 3, 4, 3, 4, 5, 4, 3, 4, 3, 2, 1, 0]
[1, 0]
[1, 0]
[4, 3, 4, 3, 4, 3, 2, 3, 2, 1, 0]
[3, 2, 3, 2, 1, 0]
[6]
[4, 5, 4, 3, 4, 3, 2, 1, 0]
Just did quick benchmark of random number generation comparing:
def rng_np(X):
for _ in range(10):
X = np.random.choice(np.array([X-1,X+1]))
return X
def rng_py(X):
for _ in range(10):
X += choice((-1, +1))
return X
The Numpy version is ~30 times slower. Numpy has to do lots of extra work, building a Python array each iteration, converting to a Numpy array, switching in choice to allow for fancy vectorisation. Python knows that the (-1, +1) in the vanilla version is constant, so it's just built once (e.g. dis is useful to see what's going on inside).
You might be able to get somewhere by working with larger blocks of numbers, but I doubt it would be much faster. Maintaining the uniformity of starting point seems awkward, but you could probably do something if you were really careful! Numpy starts to break even when each call is vectorised over approx 10 values, and really shines when you have more than 100 values.
Assume I have an array like [2,3,4], I am looking for a way in NumPy (or Tensorflow) to convert it to [0,0,1,1,1,2,2,2,2] to apply tf.math.segment_sum() on a tensor that has a size of 2+3+4.
No elegant idea comes to my mind, only loops and list comprehension.
Would something like this work for you?
import numpy
arr = numpy.array([2, 3, 4])
numpy.repeat(numpy.arange(arr.size), arr)
# array([0, 0, 1, 1, 1, 2, 2, 2, 2])
You don't need to use numpy. You can use nothing but list comprehensions:
>>> foo = [2,3,4]
>>> sum([[i]*foo[i] for i in range(len(foo))], [])
[0, 0, 1, 1, 1, 2, 2, 2, 2]
It works like this:
You can create expanded arrays by multiplying a simple one with a constant, so [0] * 2 == [0,0]. So for each index in the array, we expand with [i]*foo[i]. In other words:
>>> [[i]*foo[i] for i in range(len(foo))]
[[0, 0], [1, 1, 1], [2, 2, 2, 2]]
Then we use sum to reduce the lists into a single list:
>>> sum([[i]*foo[i] for i in range(len(foo))], [])
[0, 0, 1, 1, 1, 2, 2, 2, 2]
Because we are "summing" lists, not integers, we pass [] to sum to make an empty list the starting value of the sum.
(Note that this likely will be slower than numpy, though I have not personally compared it to something like #Patol75's answer.)
I really like the answer from #Patol75 since it's neat. However, there is no pure tensorflow solution yet, so I provide one which maybe kinda complex. Just for reference and fun!
BTW, I didn't see tf.repeat this API in tf master. Please check this PR which adds tf.repeat support equivalent to numpy.repeat.
import tensorflow as tf
repeats = tf.constant([2,3,4])
values = tf.range(tf.size(repeats)) # [0,1,2]
max_repeats = tf.reduce_max(repeats) # max repeat is 4
tiled = tf.tile(tf.reshape(values, [-1,1]), [1,max_repeats]) # [[0,0,0,0],[1,1,1,1],[2,2,2,2]]
mask = tf.sequence_mask(repeats, max_repeats) # [[1,1,0,0],[1,1,1,0],[1,1,1,1]]
res = tf.boolean_mask(tiled, mask) # [0,0,1,1,1,2,2,2,2]
Patol75's answer uses Numpy but Gort the Robot's answer is actually faster (on your example list at least).
I'll keep this answer up as another solution, but it's slower than both.
Given that a = [2,3,4] this could be done using a loop like so:
b = []
for i in range(len(a)):
for j in range(a[i]):
b.append(range(len(a))[i])
Which, as a list comprehension one-liner, is this diabolical thing:
b = [range(len(a))[i] for i in range(len(a)) for j in range(a[i])]
Both end up with b = [0,0,1,1,1,2,2,2,2].
The type of matrix I am dealing with was created from a vector as shown below:
Start with a 1-d vector V of length L.
To create a matrix A from V with N rows, make the i'th column of A the first N entries of V, starting from the i'th entry of V, so long as there are enough entries left in V to fill up the column. This means A has L - N + 1 columns.
Here is an example:
V = [0, 1, 2, 3, 4, 5]
N = 3
A =
[0 1 2 3
1 2 3 4
2 3 4 5]
Representing the matrix this way requires more memory than my machine has. Is there any reasonable way of storing this matrix sparsely? I am currently storing N * (L - N + 1) values, when I only need to store L values.
You can take a view of your original vector as follows:
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>>
>>> v = np.array([0, 1, 2, 3, 4, 5])
>>> n = 3
>>>
>>> a = as_strided(v, shape=(n, len(v)-n+1), strides=v.strides*2)
>>> a
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
This is a view, not a copy of your original data, e.g.
>>> v[3] = 0
>>> v
array([0, 1, 2, 0, 4, 5])
>>> a
array([[0, 1, 2, 0],
[1, 2, 0, 4],
[2, 0, 4, 5]])
But you have to be careful no to do any operation on a that triggers a copy, since that would send your memory use through the ceiling.
If you're already using numpy, use its strided or sparse arrays, as Jaime explained.
If you're not already using numpy, you may to strongly consider using it.
If you need to stick with pure Python, there are three obvious ways to do this, depending on your use case.
For strided or sparse-but-clustered arrays, you could do effectively the same thing as numpy.
Or you could use a simple run-length-encoding scheme, plus maybe a higher-level list of runs for, or list of pointers to every Nth element, or even a whole stack of such lists (one for every 100 elements, one for every 10000, etc.).
But for mostly-uniformly-dense arrays, the easiest thing is to simply store a dict or defaultdict mapping indices to values. Random-access lookups or updates are still O(1)—albeit with a higher constant factor—and the storage you waste storing (in effect) a hash, key, and value instead of just a value for each non-default element is more than made up for by not storing values for the default elements, as long as you're less than 0.33 density.
I use itertools.product to generate all possible variations of 4 elements of length 13. The 4 and 13 can be arbitrary, but as it is, I get 4^13 results, which is a lot. I need the result as a Numpy array and currently do the following:
c = it.product([1,-1,np.complex(0,1), np.complex(0,-1)], repeat=length)
sendbuf = np.array(list(c))
With some simple profiling code shoved in between, it looks like the first line is pretty much instantaneous, whereas the conversion to a list and then Numpy array takes about 3 hours.
Is there a way to make this quicker? It's probably something really obvious that I am overlooking.
Thanks!
The NumPy equivalent of itertools.product() is numpy.indices(), but it will only get you the product of ranges of the form 0,...,k-1:
numpy.rollaxis(numpy.indices((2, 3, 3)), 0, 4)
array([[[[0, 0, 0],
[0, 0, 1],
[0, 0, 2]],
[[0, 1, 0],
[0, 1, 1],
[0, 1, 2]],
[[0, 2, 0],
[0, 2, 1],
[0, 2, 2]]],
[[[1, 0, 0],
[1, 0, 1],
[1, 0, 2]],
[[1, 1, 0],
[1, 1, 1],
[1, 1, 2]],
[[1, 2, 0],
[1, 2, 1],
[1, 2, 2]]]])
For your special case, you can use
a = numpy.indices((4,)*13)
b = 1j ** numpy.rollaxis(a, 0, 14)
(This won't run on a 32 bit system, because the array is to large. Extrapolating from the size I can test, it should run in less than a minute though.)
EIDT: Just to mention it: the call to numpy.rollaxis() is more or less cosmetical, to get the same output as itertools.product(). If you don't care about the order of the indices, you can just omit it (but it is cheap anyway as long as you don't have any follow-up operations that would transform your array into a contiguous array.)
EDIT2: To get the exact analogue of
numpy.array(list(itertools.product(some_list, repeat=some_length)))
you can use
numpy.array(some_list)[numpy.rollaxis(
numpy.indices((len(some_list),) * some_length), 0, some_length + 1)
.reshape(-1, some_length)]
This got completely unreadable -- just tell me whether I should explain it any further :)
The first line seems instantaneous because no actual operation is taking place. A generator object is just constructed and only when you iterate through it as the operating taking place. As you said, you get 4^13 = 67108864 numbers, all these are computed and made available during your list call. I see that np.array takes only list or a tuple, so you could try creating a tuple out of your iterator and pass it to np.array to see if there is any performance difference and it does not affect the overall performance of your program. This can be determined only by trying for your usecase though there are some points which say tuple is slightly faster.
To try with a tuple, instead of list just do
sendbuf = np.array(tuple(c))
You could speed things up by skipping the conversion to a list:
numpy.fromiter(c, count=…) # Using count also speeds things up, but it's optional
With this function, the NumPy array is first allocated and then initialized element by element, without having to go through the additional step of a list construction.
PS: fromiter() does not handle the tuples returned by product(), so this might not be a solution, for now. If fromiter() did handle dtype=object, this should work, though.
PPS: As Joe Kington pointed out, this can be made to work by putting the tuples in a structured array. However, this does not appear to always give a speed up.
Let numpy.meshgrid do all the job:
length = 13
x = [1, -1, 1j, -1j]
mesh = numpy.meshgrid(*([x] * length))
result = numpy.vstack([y.flat for y in mesh]).T
on my notebook it takes ~2 minutes
You might want to try a completely different approach: first create an empty array of the desired size:
result = np.empty((4**length, length), dtype=complex)
then use NumPy's slicing abilities to fill out the array yourself:
# Set up of the last "digit":
result[::4, length-1] = 1
result[1::4, length-1] = -1
result[2::4, length-1] = 1j
result[3::4, length-1] = -1j
You can do similar things for the other "digits" (i.e. the elements of result[:, 2], result[:, 1], and result[:, 0]). The whole thing could certainly be put in a loop that iterates over each digit.
Transposing the whole operation (np.empty((length, 4**length)…)) is worth trying, as it might bring a speed gain (through a better use of the memory cache).
Probably not optimized but much less reliant on python type conversions:
ints = [1,2,3,4]
repeat = 3
def prod(ints, repeat):
w = repeat
l = len(ints)
h = l**repeat
ints = np.array(ints)
A = np.empty((h,w), dtype=int)
rng = np.arange(h)
for i in range(w):
x = l**i
idx = np.mod(rng,l*x)/x
A[:,i] = ints[idx]
return A