Lets say I have two arrays: a = array([1,2,3,0,4,5,0]) and b = array([1,2,3,4,0,5,6]). I am interested in removing the instances where a and bare 0. But I also want to remove the corresponding instances from both lists. Therefore what I want to end up with is a = array([1,2,3,5]) and b = array([1,2,3,5]). This is because a[3] == 0 and a[6] == 0, so both b[3] and b[6] are also deleted. Likewise, since b[4] == 0, a[4] is also deleted.Its simple to do this for say two arrays:
import numpy as np
a = np.array([1,2,3,0,4,5,0])
b = np.array([1,2,3,4,0,5,6])
ix = np.where(b == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
ix = np.where(a == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
However this solution doesnt scale up if I have many many arrays (which I do). What would be a more elegant way to do this?
If I try the following:
import numpy as np
a = np.array([1,2,3,0,4,5,0])
b = np.array([1,2,3,4,0,5,6])
arrays = [a,b]
for array in arrays:
ix = np.where(array == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
I get a = array([1, 2, 3, 4]) and b = array([1, 2, 3, 0]), not the answers I need. Any idea where this is wrong?
Assuming both/all arrays always have the same length, you can use masks:
ma = a != 0 # mask elements which are not equal to zero in a
mb = b != 0 # mask elements which are not equal to zero in b
m = ma * mb # assign the intersection of ma and mb to m
print a[m], b[m] # [1 2 3 5] [1 2 3 5]
You can of course also do it in one line
m = (a != 0) * (b != 0)
Or use the inverse
ma = a == 0
mb = b == 0
m = ~(ma + mb) # not the union of ma and mb
This is happening because when you return from np.delete, you get an array that is stored in b and a inside the loop. However, the arrays stored in the arrays variable are copies, not references. Hence, when you're updating the arrays by deleting them, it deletes with regard to the original arrays. The first loop will return the corrects indices of 0 in the array but the second loop will return ix as 4 (look at the original array).Like if you display the arrays variable in each iteration, it is going to remain the same.
You need to reassign arrays once you are done processing one array so that it's taken into consideration the next iteration. Here's how you'd do it -
a = np.array([1, 2, 3, 0, 4, 5, 0])
b = np.array([1, 2, 3, 4, 0, 5, 6])
arrays = [a,b]
for i in range(0, len(arrays)):
ix = np.where(arrays[i] == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
arrays = [a, b]
Of course you can automate what happens inside the loop. I just wanted to give an explanation of what was happening.
A slow method involves operating over the whole list twice, first to build an intermediate list of indices to delete, and then second to delete all of the values at those indices:
import numpy as np
a = np.array([1,2,3,0,4,5,0])
b = np.array([1,2,3,4,0,5,6])
arrays = [a, b]
vals = []
for array in arrays:
ix = np.where(array == 0)
vals.extend([y for x in ix for y in x.tolist()])
vals = list(set(vals))
new_array = []
for array in arrays:
new_array.append(np.delete(array, vals))
Building up on top of Christoph Terasa's answer, you can use array operations instead of for loops:
arrays = np.vstack([a,b]) # ...long list of arrays of equal length
zeroind = (arrays==0).max(0)
pos_arrays = arrays[:,~zeroind] # a 2d array only containing those columns where none of the lines contained zeros
Related
Is there any better way to do this? Like replacing that list comprehension with numpy functions? I'd assume that for a small number of elements, the difference is insignificant, but for larger chunks of data it takes too much time.
>>> rows = 3
>>> cols = 3
>>> target = [0, 4, 7, 8] # each value represent target index of 2-d array converted to 1-d
>>> x = [1 if i in target else 0 for i in range(rows * cols)]
>>> arr = np.reshape(x, (rows, cols))
>>> arr
[[1 0 0]
[0 1 0]
[0 1 1]]
Another way:
shape = (rows, cols)
arr = np.zeros(shape)
arr[np.unravel_index(target, shape)] = 1
Since x comes from a range, you can index an array of zeros to set the ones:
x = np.zeros(rows * cols, dtype=bool)
x[target] = True
x = x.reshape(rows, cols)
Alternatively, you can create the proper shape up front and assign to the raveled array:
x = np.zeros((rows, cols), dtype=bool)
x.ravel()[target] = True
If you want actual zeros and ones, use a dtype like np.uint8 or whatever else suits your needs other than bool.
The approach shown here would apply even to your list example to make it more efficient. Even if you turned target into a set, you are performing O(N) lookups, with N = rows * cols. Instead, you only need M assignments with no lookups, with M = len(target):
x = [0] * (rows * cols)
for i in target:
x[i] = 1
Pandas groupby "ngroup" function tags each group in "group" order.
I'm looking for similar behaviour but need the assigned tags to be in original (index) order, how can I do so efficiently (this will happen often with large arrays) in pandas and numpy?
> df = pd.DataFrame(
{"A": [9,8,7,8,9]},
index=list("abcde"))
A
a 9
b 8
c 7
d 8
e 9
> df.groupby("A").ngroup()
a 2
b 1
c 0
d 1
e 2
# LOOKING FOR ###################
a 0
b 1
c 2
d 1
e 0
How can I achieve the desired output with a single dimension numpy array?
arr = np.array([9,8,7,8 ,9])
# looking for [0,1,2,1,0]
Perhaps a better way is factorize:
df['A'].factorize()[0]
Output:
array([0, 1, 2, 1, 0])
You can use np.unique -
In [105]: a = np.array([9,8,7,8,9])
In [106]: u,idx,tags = np.unique(a, return_index=True, return_inverse=True)
In [107]: idx.argsort().argsort()[tags]
Out[107]: array([0, 1, 2, 1, 0])
You can pass sort=Flase to groupby():
df.groupby('A', sort=False).ngroup()
a 0
b 1
c 2
d 1
e 0
dtype: int64
As far as I can tell, there isn't a direct equivalent of groupby in numpy. For a pure numpy version, you can use numpy.unique() to get the unique values. numpy.unique() has the option to return the inverse, basically the array of indices that would recreate your input array, but it sorts the unique values first, so the result is the same as using the regular (sorted) pandas.groupby() command.
To get around this, you can capture the index values of the first occurrence of each unique value. Sort the index values and use these as indices into the original array to get the unique values in their original order. Create a dictionary to map between the unique values and the group numbers and then use that dictionary to convert the values in the array to the appropriate group numbers.
import numpy as np
arr = np.array([9, 8, 7, 8, 9])
_, i = np.unique(arr, return_index=True) # get the indexes of the first occurence of each unique value
groups = arr[np.sort(i)] # sort the indexes and retrieve the values from the array so that they are in the array order
m = {value:ngroup for ngroup, value in enumerate(groups)} # create a mapping of value:groupnumber
np.vectorize(m.get)(arr) # use vectorize to create a new array using m
array([0, 1, 2, 1, 0])
I've benchmarked the suggested solutions:
Turns out that:
— factorize is the fastest for array sizes > 10³
— unique-argsort is the fastest for array sizes < 10³ (but slower by a factor of 10 for larger ones),
— ngroup is always slower, but for array sizes >3*10³ it has roughly the same speed as factorize.
from contextlib import contextmanager
from time import perf_counter as clock
from itertools import count
import numpy as np
import pandas as pd
def f1(a):
return s.factorize()[0]
def f2(s):
return s.groupby(s, sort=False).ngroup().values
def f3(s):
u, idx, tags = np.unique(s.values, return_index=True, return_inverse=True)
return idx.argsort().argsort()[tags]
#contextmanager
def bench(r):
t1 = clock()
yield
t2 = clock()
r.append(t2-t1)
res = []
for i in count():
n = 2**i
a = np.random.randint(0, n, n)
s = pd.Series(a)
rr = []
for j in range(5):
r = []
with bench(r):
a1 = f1(s)
with bench(r):
a2 = f2(s)
with bench(r):
a3 = f3(s)
rr.append(r)
if max(r) > 0.5:
break
res.append(np.min(rr, axis=0))
if np.max(rr) > 0.4:
break
np.save('results.npy', np.array(res))
Is it possible to create a automatically updating Numpy array?
For example:
a = numpy.array([1,2,3,4])
b = numpy.array([a[0]+1,a[1]+2,a[2]+3,a[3]+4])
a[0] = 5
Output:
>>>print(b)
>>>[6, 4, 6, 8]
Not if the array elements are stored by value. However numpy supports arrays of objects, so you could store lamdas or something and achieve something similar...although it's probably not what you want.
Ex:
a = np.array([1])
b = np.array([ lambda: a[0] + 1 ])
a[0] = 5
print (b[0]())
# 6
For all the folks who rock at vectorizing loops: I have two NumPy arrays of shape (N,) that contain indices to each other. Say we have a = np.asarray([0, 1, 2]) and b = np.array([1, 2, np.nan]). The function should first look at a[0] to get 0, then do b[0] to get 1, then again a[1] to get 2, and so on until we get np.nan. So the function is simply a[b[a[b[a[0]]]]] = np.nan. The output should contain two lists of values that were called for a and b respectively. Indices in b are always greater than in a, such that the process cannot get stuck.
I wrote a simple function that can do just this (wrapped with numba - 18.2 µs):
a = np.array([0, 1, 2, 3, 4])
b = np.array([ 2., 3., 4., nan, nan])
lst = []
while True:
if len(lst) > 0:
idx = lst[-1]
else:
idx = 0
if len(lst) % 2 == 0:
if idx < len(a) - 1:
next_idx = a[idx]
if np.isnan(next_idx):
break
lst.append(int(next_idx))
else:
break
else:
if idx < len(b) - 1:
next_idx = b[idx]
if np.isnan(next_idx):
break
lst.append(int(next_idx))
else:
break
The first list is lst[::2]:
[0, 2]
The second is lst[1::2]:
[2, 4]
Any way to vectorize this? Both arrays in inputs as well as both lists in output always have the same shape.
This is not a vectorized solution, but as a Numba solution it should be quite faster, and simpler. I changed the code slightly to use integers and -1 instead of np.nan, it is trivial to switch to this representation with something like b = np.where(np.isnan(b), -1, b), and it makes the code more efficient. Instead of having a growing structure within the Numba function, I preallocate the output array in advance, so the loop can run much faster.
import numba as nb
def point_each_other(a, b):
# Convert inputs to array if they are not already
a = np.asarray(a)
b = np.asarray(b)
# Make output array in advance
out = np.empty(len(a) + len(b), dtype=a.dtype)
# Call Numba function
n = point_each_other_nb(a, b, out)
# Return relevan part of the output
return out[:n]
#nb.njit
def point_each_other_nb(a, b, out):
curr = 0
i = 0
while curr >= 0:
# You can do bad input checking with the following
# if i >= len(out):
# raise ValueError
# Save current index
out[i] = curr
# Get the next index
curr = a[curr]
# Swap arrays
a, b = b, a
# Advance counter
i += 1
# Return number of stored indices
return i - 1
# Test
a = np.array([0, 1, 2, 3, 4])
b = np.array([2, 3, 4, -1, -1])
out = point_each_other(a, b)
print(out[::2])
# [0 2 4]
print(out[1::2])
# [0 2]
Not vectorized, but here's recursive solution:
import numpy as np
a = np.array([0,1,2,3,4])
b = np.array([2,3,4,np.nan, np.nan])
def rec(i,b,a, a_out, b_out):
if np.isnan(b[i]): return
else:
if not np.isnan(b[i]): a_out.append(i)
rec(int(b[i]), a, b, b_out, a_out)
return a_out, b_out
print(rec(0,b,a,[],[]))
Output
([0, 2], [2, 4])
I have an array like this:
A = [[1,0,2,3],
[2,0,1,1],
[3,1,0,0]]
and I want to get the position of one of the cells with the value == 1 such as A[0][0] or A[1][2] and so on ...
So far I did this:
A = np.array([[1,0,2,3],
[2,0,1,1],
[3,1,0,0]])
B = np.where(A == 1)
C = []
for i in range(len(B[0])):
Ca = [B[0][i], B[1][i]]
C.append(Ca)
D = random.choice(C)
But now I want to reuse D for getting a cell value back. Like:
A[D] (which does not work) should return the same as A[1][2]
Does someone how to fix this or knows even a better solution?
This should work for you.
A = np.array([[1,0,2,3],
[2,0,1,1],
[3,1,0,0]])
B = np.where(A == 1)
C = []
for i in range(len(B[0])):
Ca = [B[0][i], B[1][i]]
C.append(Ca)
D = random.choice(C)
print(A[D[0]][D[1]])
This gives the output.
>>> print(A[D[0]][D[1]])
1
Since the value of D would be of the sort [X,Y], the value could be obtained from the matrix as A[D[0]][D[1]]
It seems you are trying to randomly select one of the cells where A is 1. You can use numpy for this all the way through instead of having to resort to for-loops
B = np.array(np.where(A == 1))
>>> B
array([[0, 1, 1, 2],
[0, 2, 3, 1]])
Now to randomly select a column corresponding to one of the cells, we can use np.random.randint
column = np.random.randint(B.shape[1])
D = B[:, column]
>>> D
array([1, 2]) # corresponds to the second index pair in B
Now you can simply index into A using the tuple of D (to correpond to the dimensions' indices) as
>>> A[tuple(D)]
1