Finding items in one array based upon a second array

Finding items in one array based upon a second array - python

I have two arrays A and B:
A=array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B=array([[ 1., 1., 2.],
[ 3., 2., 1.]])
Anywhere there is a "1" in B I want to sum the same row and column locations in A.
So for example for this one the answer would be 5+5+9=10
I would want this to continue for 2,3....n (all unique values in B)
So for the 2's... it would be 9+5=14 and for the 3's it would be 8
I found the unique values by using:
numpy.unique(B)
I realize this make take multiple steps but I can't really wrap my head around using the index matrix to sum those locations in another matrix.

For each unique value x, you can do
A[B == x].sum()
Example:
>>> A[B == 1.0].sum()
19.0

I thinknumpy.bincount is what you want. If B is an array of small integers like in you example you can do something like this:
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1, 1, 2],
[ 3, 2, 1]])
print numpy.bincount(B.ravel(), weights=A.ravel())
# [ 0. 19. 14. 8.]
or if B has anything but small integers you can do something like this
import numpy
A = numpy.array([[ 5., 5., 5.],
[ 8., 9., 9.]])
B = numpy.array([[ 1., 1., 2.],
[ 3., 2., 1.]])
uniqB, inverse = numpy.unique(B, return_inverse=True)
print uniqB, numpy.bincount(inverse, weights=A.ravel())
# [ 1. 2. 3.] [ 19. 14. 8.]

[(val, np.sum(A[B==val])) for val in np.unique(B)] gives you a list of tuples where the first element is one of the unique values in B, and the second element is the sum of elements in A where the corresponding value in B is that value.
>>> [(val, np.sum(A[B==val])) for val in np.unique(B)]
[(1.0, 19.0), (2.0, 14.0), (3.0, 8.0)]
The key is that you can use A[B==val] to access items in A at positions where B equals val.
Edit: If you just want the sums, just do [np.sum(A[B==val]) for val in np.unique(B)].

I'd use numpy masked arrays. These are standard numpy arrays with a mask associated with them blocking off certain values. The process is pretty straight forward, create a masked array using
numpy.ma.masked_array(data, mask)
where mask is generated by using a masked function
mask = numpy.ma.masked_not_equal(B, 1).mask
and data is A
for i in numpy.unique(B):
print numpy.ma.masked_array(A, numpy.ma.masked_not_equal(B, i).mask).sum()
19.0
14.0
8.0

i found old question here
one of the answer
def sum_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
values.cumsum(out=values)
index = np.ones(len(groups), 'bool')
index[:-1] = groups[1:] != groups[:-1]
values = values[index]
groups = groups[index]
values[1:] = values[1:] - values[:-1]
return values, groups
in your case, you can flatten your array
aflat = A.flatten()
bflat = B.flatten()
sum_by_group(aflat, bflat)

Related

Joining Array In Python

Hi I want to join multiple arrays in python, using numpy to form multidimensional arrays, it's inside of a for loop, this is a pseudocode
import numpy as np
h = np.zeros(4)
for x in range(3):
x1 = some array of length of 4 returned from a previous function (3,5,6,7)
h = np.concatenate((h,x1), axis =0)
The first iteration goes fine, but during the second iteration on the for loop I get the following error,
ValueError: all the input arrays must have same number of dimensions
The output array should look something like this
[[0,0,0,0],[3,5,6,7],[6,3,6,7]]
etc
So how can I join the arrays?
Thanks

You need to use vstack. It allows you to stack arrays. You take a sequence of arrays and stack them vertically to make a single array
import numpy as np
h = np.zeros(4)
for x in range(3):
x1 = [3,5,6,7]
h = np.vstack((h,x1))
# not h = np.concatenate((h,x1), axis =0)
print h
Output:
[[ 0. 0. 0. 0.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]]
more edits later.
If you do want to use cocatenate only, you can do the following way as well:
import numpy as np
h1 = np.zeros(4)
for x in range(3):
x1 = np.array([3,5,6,7])
h1= np.concatenate([h1,x1.T], axis =0)
print h1.shape
print h1.reshape(4,4)
Output:
(16,)
[[ 0. 0. 0. 0.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]]
Both have different applications. You can choose according to your need.

There are multiple ways of doing this. I'll list a few examples:
First, we import numpy and define a function that generates those arrays of length 4.
import numpy as np
def previous_function_returning_array_of_length_4(x):
return np.array(range(4)) + x
The first way involves creating a list of arrays, then calling numpy.array() to convert the list to a 2D array.
h0 = np.zeros(4)
arrays = [h0]
for x in range(3):
x1 = previous_function_returning_array_of_length_4(x)
arrays.append(x1)
h = np.array(arrays)
You can do the same with np.vstack():
h0 = np.zeros(4)
arrays = [h0]
for x in range(3):
x1 = previous_function_returning_array_of_length_4(x)
arrays.append(x1)
h = np.vstack(arrays)
Alternatively, if you know how many arrays you are going to create, you can create the 2D array first and fill in the values:
h = np.zeros((4, 4))
for ii in range(3):
x1 = previous_function_returning_array_of_length_4(ii)
h[ii + 1, ...] = x1
There are more ways, but hopefully, this will give you an idea of what to do.

It is best to collect values in a list, and perform the concatenate or array creation once, at the end.
h = [np.zeros(4)]
for x in range(3):
x1 = some array of length of 4 returned from a previous function (3,5,6,7)
h = h.append(x1)
h = np.array(h)
# or h = np.vstack(h)
All the concatenate/stack/array functions takes a list of multiple items. It is faster to append to a list than to do a concatenate of 2 items.
======================
Let's try your approach step by step:
In [189]: h=np.zeros(4)
In [190]: h
Out[190]: array([ 0., 0., 0., 0.]) # 1d array (4,) shape
In [191]: x1=np.array([3,5,6,7]) # another 1d
In [192]: h1=np.concatenate((h,x1),axis=0)
In [193]: h1
Out[193]: array([ 0., 0., 0., 0., 3., 5., 6., 7.])
In [194]: h1.shape
Out[194]: (8,) # also a 1d array, but with 8 items
In [195]: x1=np.array([6,3,6,7])
In [196]: h1=np.concatenate((h1,x1),axis=0)
In [197]: h1
Out[197]: array([ 0., 0., 0., 0., 3., 5., 6., 7., 6., 3., 6., 7.])
In this case I'm adding (4,) arrays one after the other, still getting a 1d array.
If I go back an create x1 as 2d (1,4):
In [198]: h=np.zeros(4)
In [199]: x1=np.array([[6,3,6,7]])
In [200]: h1=np.concatenate((h,x1),axis=0)
...
ValueError: all the input arrays must have same number of dimensions
I get this dimension error right away.
The fact that you get the error on the 2nd iteration suggests that the 1st x1 is (4,), but the 2nd is 2d.
When you have dimensions errors like this, check the shapes.
vstack adds dimensions to the inputs, as needed, so you can build 2d arrays:
In [207]: h=np.zeros(4)
In [208]: x1=np.array([3,5,6,7])
In [209]: h=np.vstack((h,x1))
In [210]: h
Out[210]:
array([[ 0., 0., 0., 0.],
[ 3., 5., 6., 7.]])
In [211]: x1=np.array([6,3,6,7])
In [212]: h=np.vstack((h,x1))
In [213]: h
Out[213]:
array([[ 0., 0., 0., 0.],
[ 3., 5., 6., 7.],
[ 6., 3., 6., 7.]])

Simple NumPy array reference

I am having issues understanding how X and y are referenced for training.
I have a simple csv file with 5 numeric columns that I am loading into a NumPy array as follows:
url = "http://www.xyz/shortDataFinal.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:3] #Does this mean columns 1-4?
y = dataset[:,4] #Is this the 5th column?
I think I am referencing my X values incorrectly.
Here is what I need:
X values reference columns 1-4 and my y value is the last column, which is the 5th. If I understand correctly, I should be referencing array indices 0:3 for the X values and number 4 for the y as I have done above. However, those values aren't correct. In other words, the values returned by the array don't match the values in the data - they are off by one column (index).

Yes, your interpretation is correct. dataset is a matrix in this case, so the numpy indexing operators ([]) use the conventional row, column format.
X = dataset[:,0:3] is interpreted as "All rows for columns 0 through 3" and y = dataset[:,4] is interpreted as "all rows for column 4".

Using a multiline string as a standin for a csv file:
In [332]: txt=b"""0, 1, 2, 4, 5
.....: 6, 7, 8, 9, 10
.....: """
In [333]: data=np.loadtxt(txt.splitlines(), delimiter=',')
In [334]: data
Out[334]:
array([[ 0., 1., 2., 4., 5.],
[ 6., 7., 8., 9., 10.]])
In [335]: data.shape
Out[335]: (2, 5)
In [336]: data[:,0:4]
Out[336]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
In [337]: data[:,4]
Out[337]: array([ 5., 10.])
numpy indexing starts at 0; [0:4] is the same (more or less) as the list of numbers starting at 0, up to, but not including 4.
In [339]: np.arange(0,4)
Out[339]: array([0, 1, 2, 3])
Another way to get all but the last column is to use -1 indexing
In [352]: data[:,:-1]
Out[352]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
Often a CSV file is a mix of numeric and string values. The loadtxt dtype parameter has a short explanation of how you can load and access that as a structured array. genfromtxt is easier to use for that (though no less confusing).

NumPy array sum reduce

I have a numpy array with three columns of the form:
x1 y1 f1
x2 y2 f2
...
xn yn fn
The (x,y) pairs may repeat. I would need another array such that each (x,y) pair appears once and the corresponding third column is the sum of all the f values that appeared next to (x,y).
For example, the array
1 2 4.0
1 1 5.0
1 2 3.0
0 1 9.0
would give
0 1 9.0
1 1 5.0
1 2 7.0
The order of rows is not relevant. What is the fastest way to do this in Python?
Thank you!

This would be one approach to solve it -
import numpy as np
# Input array
A = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
# Extract xy columns
xy = A[:,0:2]
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(xy.T)
sorted_xy = xy[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_xy,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# OR df2 = np.append([True],np.logical_or(df1[:,0]!=0,df1[:,1]!=0),0)
# OR df2 = np.append([True],np.dot(df1!=0,[True,True]),0)
# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1
# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels
# Get unique indices
unq_idx = sorted_idx[df2]
# Get counts and unique rows and setup output array
counts = np.bincount(labels, weights=A[:,2])
unq_rows = xy[unq_idx,:]
out = np.append(unq_rows,counts.ravel()[:,None],1)
Input & Output -
In [169]: A
Out[169]:
array([[ 1., 2., 4.],
[ 1., 1., 5.],
[ 1., 2., 3.],
[ 0., 1., 9.]])
In [170]: out
Out[170]:
array([[ 0., 1., 9.],
[ 1., 1., 5.],
[ 1., 2., 7.]])

Thanks to #hpaulj, finally found the simplest solution. If d contains the 3-column data:
ind =d[0:2].astype(int)
x = zeros(shape=(N,N))
add.at(x,list(ind),d[2])
This solution assumes that the (x,y) indices in the first two columns are integer and smaller than N. This is what I need and should have mentioned in the post.
Edit: Note that the above solution produces a sparse matrix with the sum values at position (x,y) within the matrix.

Certainly easily done in Python:
arr = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
d={}
for x, y, z in arr:
d.setdefault((x,y), 0)
d[x,y]+=z
>>> d
{(1.0, 2.0): 7.0, (0.0, 1.0): 9.0, (1.0, 1.0): 5.0}
Then translate back to numpy:
>>> np.array([[x,y,d[(x,y)]] for x,y in d.keys()])
array([[ 1., 2., 7.],
[ 0., 1., 9.],
[ 1., 1., 5.]])

If you have scipy, the sparse module does this kind of addition - again for an array where the 1st 2 columns are integers - ie. indexes.
from scipy import sparse
M = sparse.csr_matrix((d[:,0], (d[:,1],d[:,2])))
M = M.tocoo() # there may be a short cut to this csr coo round trip
x = np.column_stack([M.row, M.col, M.data]) # needs testing
For convenience in constructing certain kinds of linear algebra matrices, the csr sparse array format sums values with duplicate indices. It's implemented in compiled code so should be fairly fast. But putting the data into M and taking it back out might slow it down.
(ps. I haven't tested this script since I'm writing this on a machine without scipy).

optimizing python file related to heapq.nlargest and extend using loop

my objective is to find a few (=3 in this example) largest values in one list, fourire, identify positions in the list, and obtain corresponding (position_wise) values in the other list, freq, so the print out should be like
2. 27.
9. 25.
4. 22.
the attached python is working fine....sort of.
** note that i am dealing with numpy array so index() is not working....
is there way to improve the followings?
import heapq
freq = [ 2., 8., 1., 6., 9., 3., 6., 9., 4., 8., 12.]
fourire = [ 27., 3., 2., 7., 4., 9., 10., 25., 22., 5., 3.]
out = heapq.nlargest(3, enumerate(fourire), key=lambda x:x[1])
elem_fourire = []
elem_freq = []
for i in range(len(out)):
(key, value) = out[i]
elem_freq.extend([freq[key]])
elem_fourire.extend([value])
for i in range(len(out)):
print elem_freq[i], elem_fourire[i]

import numpy as np
fourire = np.array(fourire)
freq = np.array(freq)
ix = fourire.argsort(kind='heapsort')[-3:][::-1]
for a, b in zip(freq[ix],fourire[ix]):
print a, b
prints
2.0 27.0
9.0 25.0
4.0 22.0
If you want to use heapq instead of numpy, a slight modification of your code above yields:
ix = heapq.nlargest(3,range(len(freq)),key=lambda x: fourire[x])
for x in ix:
print freq[x], fourire[x]
results in the same output

assign index dependant value to each index in numpy array

I want to center multi-dimensional data in a n x m matrix (<class 'numpy.matrixlib.defmatrix.matrix'>), let's say X . I defined a new array ones(645), lets say centVector to produce the mean for every row in matrix X. And now I want to iterate every row in X, compute the mean and assign this value to the corresponding index in centVector. Isn't this possible in a single row in scipy/numpy? I am not used to this language and think about something like:
centVector = ones(645)
for key, val in X:
centVector[key] = centVector[key] * (val.sum/val.size)
Afterwards I just need to subtract the mean in every Row:
X = X - centVector
How can I simplify this?
EDIT: And besides, the above code is not actually working - for a key-value loop I need something like enumerate(X). And I am not sure if X - centVector is returning the proper solution.

First, some example data:
>>> import numpy as np
>>> X = np.matrix(np.arange(25).reshape((5,5)))
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
numpy conveniently has a mean function. By default however, it'll give you the mean over all the values in the array. Since you want the mean of each row, you need to specify the axis of the operation:
>>> np.mean(X, axis=1)
matrix([[ 2.],
[ 7.],
[ 12.],
[ 17.],
[ 22.]])
Note that axis=1 says: find the mean along the columns (for each row), where 0 = rows and 1 = columns (and so on). Now, you can subtract this mean from your X, as you did originally.
Unsolicited advice
Usually, it's best to avoid the matrix class (see docs). If you remove the np.matrix call from the example data, then you get a normal numpy array.
Unfortunately, in this particular case, using an array slightly complicates things because np.mean will return a 1D array:
>>> X = np.arange(25).reshape((5,5))
>>> r_means = np.mean(X, axis=1)
>>> print r_means
[ 2. 7. 12. 17. 22.]
If you try to subtract this from X, r_means gets broadcast to a row vector, instead of a column vector:
>>> X - r_means
array([[ -2., -6., -10., -14., -18.],
[ 3., -1., -5., -9., -13.],
[ 8., 4., 0., -4., -8.],
[ 13., 9., 5., 1., -3.],
[ 18., 14., 10., 6., 2.]])
So, you'll have to reshape the 1D array into an N x 1 column vector:
>>> X - r_means.reshape((-1, 1))
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])
The -1 passed to reshape tells numpy to figure out this dimension based on the original array shape and the rest of the dimensions of the new array. Alternatively, you could have reshaped the array using r_means[:, np.newaxis].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding items in one array based upon a second array - python

For each unique value x, you can do A[B == x].sum() Example: >>> A[B == 1.0].sum() 19.0

Related

Joining Array In Python

Simple NumPy array reference

NumPy array sum reduce

optimizing python file related to heapq.nlargest and extend using loop

assign index dependant value to each index in numpy array

Categories

Resources