Faster way to extract one numpy array using another - python

I have two large one-dimensional numpy arrays (~1e8+ elements) of the same size. Array 1 (a1) is an integer indicator value, it tells me the meaning of the values in a2. Array 2 (a2) has measurement values. I want to separate the values in a2, based on the values in a1. I want to iterate over a list of possible values in a1 and select corresponding values (by index) from a2. Because of the data size and the number of different times I have to do this, it needs to be as fast as possible.
Here's a simple example to hopefully explain my problem more clearly:
a1 = [0, 2, 1, 1, 0, 2] # actual values range from 0 to 4095
a2 = [0.5, 2.4, 1.0, 1.2, 0.4, 2.6] # dummy values for example
I would like to separate the values in a2 based on the values in a1. I am saving each separated array into an HDF5 dataset.
So, in the end, I need an array for each value in a1
out = [0.5,0.4] # a1 == 0
out = [1.0, 1.2] # a1 == 1
out = [2.4, 2.6] # a1 == 2
Currently, I've tried the following:
import numpy as np
size = int(1e8)
a1 = np.random.randint(0, 4096, size=size)
a2 = np.random.rand(size)
for i in range(0, 4096):
ind = a1 == i
out = a2[ind]
# more code here to save out to h5 file
Based on some timeit testing, np.extract works just slightly faster than this boolean masking approach for large arrays:
import numpy as np
size = int(1e8)
a1 = np.random.randint(0, 4096, size=size)
a2 = np.random.rand(size)
for i in range(0,4096):
ind = a1 == i
out = np.extract(ind, a2)
# more code here to save out to h5 file
I also tried putting this into a pandas series with a1 as the index, but this was much slower.
import pandas as pd
import numpy as np
size = int(1e8)
a1 = np.random.randint(0, 4096, size=size)
a2 = np.random.rand(size)
s = pd.Series(a2, index=a1)
for i in range(0,4096):
out = s.loc[i] #this is WAY slower than numpy
# more code here to save out to h5 file
I'm wondering if there's a faster way of doing this? Maybe with sorting and searchsorted? Is it possible to generate a "database-like index" for a numpy array?

As I see, your both source arrays are plain pythonic lists.
So the first step is to convert them to Numpy arrays:
arr1 = np.array(a1)
arr2 = np.array(a2)
Then, to get your expected result as a list of Numpy arrays, run:
rng = arr1.max() + 1
result = [ arr2[np.nonzero(arr1 == i)] for i in range(rng) ]
Details:
np.nonzero(arr1 == i) - generates the list of indices of arr1
where the element contains i,
arr2[...] - retrieves elements of arr2 indicated by the above
indices,
[ ... for i in range(rng) ] - a list comprehension to generate
an output array for each value present in arr1.
For your data sample you will get:
[array([0.5, 0.4]), array([1. , 1.2]), array([2.4, 2.6])]
Or if you want your result as a list of plain list, change the above
code to:
result = [ arr2[np.nonzero(arr1 == i)].tolist() for i in range(rng) ]
This time you will get:
[[0.5, 0.4], [1.0, 1.2], [2.4, 2.6]]

Related

Fast, efficient approach to average values in one array selected using a key in another array

Sorry for the cryptic description.....
I'm workng in Python and need a fast solution for the below problem
I have an array of float values in one array (this array length can include be millions of values
values = [0.1, 0.2, 5.7, 12.9, 3.5, 100.6]
Each value represents an estimate of a quantity at a particular location where the location is identified by an ID. Multiple estimates per location are possible/common
locations = [1, 5, 3, 1, 1, 3]
I need to average all of the values that that share the same location id.
I can use numpy.where to do this for one location value
average_value_at_location = np.average(values[np.where(locations == 1)])
And of course I could loop over all of the unique values in locations..... But I'm looking for a fast (vectorized) way of doing this and can't figure out how to compose the numpy functions to do this without looping in Python.....
I'm not tied to numpy for this solution.
Any help will be gratefully received.
Thanks,
Doug
Assuming locations go from 0 to a maximum value of locmax (e.g. locmax=5), you can create a 2d array of nans to store the values at the corresponding location:
placements = np.zeros((values.size, locmax+1)) * np.nan
Then assign all the values using indexing:
placements[np.arange(values.size), locations] = values
Finally, calculate the np.nanmean along axis 0:
means = np.nanmean(placements, axis=0)
For your example this results in:
array([ nan, 5.5 , nan, 53.15, nan, 0.2 ])
Using add.reduceat for every group.
Preparing the arrays
import numpy as np
values = np.array([0.1, 0.2, 5.7, 12.9, 3.5, 100.6])
locations = np.array([1, 5, 3, 1, 1, 3])
Getting the indices to sort the arrays in groups
locsort = np.argsort(locations)
# locations[locsort] -> [ 1, 1, 1, 3, 3, 5]
# values[locsort] -> [0.1, 12.9, 3.5, 5.7, 100.6, 0.2]
Computing the start index for each group
i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
# [0, 3, 5]
Adding values per group and dividing by the group size
np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))
# [ 16.5, 106.3, 0.2] / [3, 2, 1]
Output
array([ 5.5 , 53.15, 0.2 ])
OK - I've tried four solutions based on the replies here. So far, the pandas groupby approach is the winner, but the numpy add.reduceat solution proposed by Michael S is a close second......
Using pandas (from the link provided by Ben T)
# Set up the data arrays
rng = np.random.default_rng(12345)
values = rng.random(size = 100000)
locations = rng.integers(low = 1, high = 25000, size = 100000)
#Create the pandas dataframe
df = pd.DataFrame({"locations":locations, "values": values})
# groupby and mean
start=timer()
average_by_location_pandas = df.groupby(["locations"]).mean()
end=timer()
print("Pandas time :", end-start)
Pandas time : 0.009602722000000008
Using numpy np.where and list comprehension to lop over unique locations
unique_locations = np.unique(locations)
average_by_location_numpy = [(i, values[locations==i].mean()) for i in unique_locations]
Numpy time : 2.644003632
Using numpy_indexed (link provide by Ben T)
average_by_location_numpy_indexed = npi.group_by(locations).mean(values)
Numpy_indexed time : 0.03701074199999965
Using numpy add.reduceat (solution proposed by Michael S)
locsort = np.argsort(locations)
i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
out = np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))
Numpy add_reduceat time : 0.01057279099999997

tensorflow collect similar values from a list

I have a tensor as follows:
arr = [[1.5,0.2],[2.3,0.1],[1.3,0.21],[2.2,0.09],[4.4,0.8]]
I would like to collect small arrays whose difference of first elements are within 0.3 and second elements are within 0.03.
For example [1.5,0.2] and [1.3,0.21] should belong to a same category. The difference of their first elements is 0.2<0.3 and second 0.01<0.03.
I want a tensor looks like this
arr = {[[1.5,0.2],[1.3,0.21]],[[2.3,0.1],[2.2,0.09]]}
How to do this in tensorflow? Eager mode is ok.
I found a way which is a bit ugly and slow:
samples = np.array([[1.5,0.2],[2.3,0.1],[1.3,0.2],[2.2,0.09],[4.4,0.8],[2.3,0.11]],dtype=np.float32)
ini_samples = samples
samples = tf.split(samples,2,1)
a = samples[0]
b = samples[1]
find_match1 = tf.reduce_sum(tf.abs(tf.expand_dims(a,0) - tf.expand_dims(a,1)),2)
a = tf.logical_and(tf.greater(find_match1, tf.zeros_like(find_match1)),tf.less(find_match1, 0.3*tf.ones_like(find_match1)))
find_match2 = tf.reduce_sum(tf.abs(tf.expand_dims(b,0) - tf.expand_dims(b,1)),2)
b = tf.logical_and(tf.greater(find_match2, tf.zeros_like(find_match2)),tf.less(find_match2, 0.03*tf.ones_like(find_match2)))
x,y = tf.unique(tf.reshape(tf.where(tf.logical_or(a,b)),[1,-1])[0])
r = tf.gather(ini_samples, x)
Does tensorflow have more elegant functions?
You cannot get a result composed of "groups" of vectors with different sizes. Instead, you can make a "group id" tensor that classifies each vector into a group according to your criteria. The part that makes this a bit more complicated is that you have to "fuse" groups with common elements, which I think can only be done with a loop. This code does something like that:
import tensorflow as tf
def make_groups(correspondences):
# Multiply each row by its index
m = tf.to_int32(correspondences) * tf.range(tf.shape(correspondences)[0])
# Pick the largest index for each row
r = tf.reduce_max(m, axis=1)
# While loop accounts for transitive correspondences
# (e.g. if A and B go toghether and B and C go together, then A, B and C go together)
# The loop makes sure every element gets the largest common group id
r_prev = -tf.ones_like(r)
r, _ = tf.while_loop(lambda r, r_prev: tf.reduce_any(tf.not_equal(r, r_prev)),
lambda r, r_prev: (tf.gather(r, r), tf.identity(r)),
[r, r_prev])
# Use unique indices to make sequential group ids starting from 0
return tf.unique(r)[1]
# Test
with tf.Graph().as_default(), tf.Session() as sess:
arr = tf.constant([[1.5 , 0.2 ],
[2.3 , 0.1 ],
[1.3 , 0.21],
[2.2 , 0.09],
[4.4 , 0.8 ],
[1.1 , 0.23]])
a = arr[:, 0]
b = arr[:, 0]
cond = (tf.abs(a - a[:, tf.newaxis]) < 0.3) | (tf.abs(b - b[:, tf.newaxis]) < 0.03)
groups = make_groups(cond)
print(sess.run(groups))
# [0 1 0 1 2 0]
So in this case, the groups would be:
[1.5, 0.2], [1.3, 0.21] and [1.1, 0.23]
[2.3, 0.1] and [2.2, 0.09]
[4.4, 0.8]

Compare two numpy arrays by first Column and create a third numpy array by concatenating two arrays

I have two 2d numpy arrays which is used to plot simulation results.
The first column of both arrays a and b contains the time intervals and the second column contains the data to be plotted. The two arrays have different shapes a(500,2) b(600,2). I want to compare these two numpy arrays by first column and create a third array with matches found on the first column of a. If no match is found add 0 to third column.
Is there any numpy trick to do this?
For instance:
a=[[0.002,0.998],
[0.004,0.997],
[0.006,0.996],
[0.008,0.995],
[0.010,0.993]]
b= [[0.002,0.666],
[0.004,0.665],
[0.0041,0.664],
[0.0042,0.664],
[0.0043,0.664],
[0.0044,0.663],
[0.0045,0.663],
[0.0005,0.663],
[0.006,0.663],
[0.0061,0.662],
[0.008,0.661]]
expected output
c= [[0.002,0.998,0.666],
[0.004,0.997,0.665],
[0.006,0.996,0.663],
[0.008,0.995,0.661],
[0.010,0.993, 0 ]]
I can quickly think of the solution as
import numpy as np
a = np.array([[0.002, 0.998],
[0.004, 0.997],
[0.006, 0.996],
[0.008, 0.995],
[0.010, 0.993]])
b = np.array([[0.002, 0.666],
[0.004, 0.665],
[0.0041, 0.664],
[0.0042, 0.664],
[0.0043, 0.664],
[0.0044, 0.663],
[0.0045, 0.663],
[0.0005, 0.663],
[0.0006, 0.663],
[0.00061, 0.662],
[0.0008, 0.661]])
c = []
for row in a:
index = np.where(b[:,0] == row[0])[0]
if np.size(index) != 0:
c.append([row[0], row[1], b[index[0], 1]])
else:
c.append([row[0], row[1], 0])
print c
As pointed out in the comments above, there seems to be a data entry error
import numpy as np
i = np.intersect1d(a[:,0], b[:,0])
overlap = np.vstack([i, a[np.in1d(a[:,0], i), 1], b[np.in1d(b[:,0], i), 1]]).T
underlap = np.setdiff1d(a[:,0], b[:,0])
underlap = np.vstack([underlap, a[np.in1d(a[:,0], underlap), 1], underlap*0]).T
fast_c = np.vstack([overlap, underlap])
This works by taking the intersection of the first column of a and b using intersect1d, and then using in1d to cross-reference that intersection with the second columns.
vstack stacks the elements of the input vertically, and the transpose is needed to get the right dimensions (very fast operation).
Then find times in a that are not in b using setdiff1d, and complete the result by putting 0s in the third column.
This prints out
array([[ 0.002, 0.998, 0.666],
[ 0.004, 0.997, 0.665],
[ 0.006, 0.996, 0. ],
[ 0.008, 0.995, 0. ],
[ 0.01 , 0.993, 0. ]])
The following works both for numpy arrays and simple python lists.
c = [[*x, y[1]] for x in a for y in b if x[0] == y[0]]
d = [[*x, 0] for x in a if x[0] not in [y[0] for y in b]]
c.extend(d)
Someone braver than I am could try to make this one line.

python numpy weighted average with nans

First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:
Suppose I have an array
a = array([1,2,3,4])
and I want to average over it with the weights
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
ok. So this is pretty straightforward. But now I have something like this:
a = array([1,2,nan,4])
calculating the average with the usual method yields of coursenan. Can I avoid this?
In principle I want to ignore the nans, so I'd like to have something like this:
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
Alternatively, you can use a MaskedArray as such:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
As suggested by #mtrw in comments, it would be cleaner to use masked array here instead of index array:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
Expanding on #Ashwini and #Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
Since you're looking for the mean another idea is to simply replace all the nan values with 0's:
>>>import numpy as np
>>>a = np.array([[ 3., 2., 5.], [np.nan, 4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1., 2., 3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665
This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.

How to add column to numpy array

I am trying to add one column to the array created from recfromcsv. In this case it's an array: [210,8] (rows, cols).
I want to add a ninth column. Empty or with zeroes doesn't matter.
from numpy import genfromtxt
from numpy import recfromcsv
import numpy as np
import time
if __name__ == '__main__':
print("testing")
my_data = recfromcsv('LIAB.ST.csv', delimiter='\t')
array_size = my_data.size
#my_data = np.append(my_data[:array_size],my_data[9:],0)
new_col = np.sum(x,1).reshape((x.shape[0],1))
np.append(x,new_col,1)
I think that your problem is that you are expecting np.append to add the column in-place, but what it does, because of how numpy data is stored, is create a copy of the joined arrays
Returns
-------
append : ndarray
A copy of `arr` with `values` appended to `axis`. Note that `append`
does not occur in-place: a new array is allocated and filled. If
`axis` is None, `out` is a flattened array.
so you need to save the output all_data = np.append(...):
my_data = np.random.random((210,8)) #recfromcsv('LIAB.ST.csv', delimiter='\t')
new_col = my_data.sum(1)[...,None] # None keeps (n, 1) shape
new_col.shape
#(210,1)
all_data = np.append(my_data, new_col, 1)
all_data.shape
#(210,9)
Alternative ways:
all_data = np.hstack((my_data, new_col))
#or
all_data = np.concatenate((my_data, new_col), 1)
I believe that the only difference between these three functions (as well as np.vstack) are their default behaviors for when axis is unspecified:
concatenate assumes axis = 0
hstack assumes axis = 1 unless inputs are 1d, then axis = 0
vstack assumes axis = 0 after adding an axis if inputs are 1d
append flattens array
Based on your comment, and looking more closely at your example code, I now believe that what you are probably looking to do is add a field to a record array. You imported both genfromtxt which returns a structured array and recfromcsv which returns the subtly different record array (recarray). You used the recfromcsv so right now my_data is actually a recarray, which means that most likely my_data.shape = (210,) since recarrays are 1d arrays of records, where each record is a tuple with the given dtype.
So you could try this:
import numpy as np
from numpy.lib.recfunctions import append_fields
x = np.random.random(10)
y = np.random.random(10)
z = np.random.random(10)
data = np.array( list(zip(x,y,z)), dtype=[('x',float),('y',float),('z',float)])
data = np.recarray(data.shape, data.dtype, buf=data)
data.shape
#(10,)
tot = data['x'] + data['y'] + data['z'] # sum(axis=1) won't work on recarray
tot.shape
#(10,)
all_data = append_fields(data, 'total', tot, usemask=False)
all_data
#array([(0.4374783740738456 , 0.04307289878861764, 0.021176067323686598, 0.5017273401861498),
# (0.07622262416466963, 0.3962146058689695 , 0.27912715826653534 , 0.7515643883001745),
# (0.30878532523061153, 0.8553768789387086 , 0.9577415585116588 , 2.121903762680979 ),
# (0.5288343561208022 , 0.17048864443625933, 0.07915689716226904 , 0.7784798977193306),
# (0.8804269791375121 , 0.45517504750917714, 0.1601389248542675 , 1.4957409515009568),
# (0.9556552723429782 , 0.8884504475901043 , 0.6412854758843308 , 2.4853911958174133),
# (0.0227638618687922 , 0.9295332854783015 , 0.3234597575660103 , 1.275756904913104 ),
# (0.684075052174589 , 0.6654774682866273 , 0.5246593820025259 , 1.8742119024637423),
# (0.9841793718333871 , 0.5813955915551511 , 0.39577520705133684 , 1.961350170439875 ),
# (0.9889343795296571 , 0.22830104497714432, 0.20011292764078448 , 1.4173483521475858)],
# dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8'), ('total', '<f8')])
all_data.shape
#(10,)
all_data.dtype.names
#('x', 'y', 'z', 'total')
If you have an array, a of say 210 rows by 8 columns:
a = numpy.empty([210,8])
and want to add a ninth column of zeros you can do this:
b = numpy.append(a,numpy.zeros([len(a),1]),1)
The easiest solution is to use numpy.insert().
The Advantage of np.insert() over np.append is that you can insert the new columns into custom indices.
import numpy as np
X = np.arange(20).reshape(10,2)
X = np.insert(X, [0,2], np.random.rand(X.shape[0]*2).reshape(-1,2)*10, axis=1)
'''
np.append or np.hstack expects the appended column to be the proper shape, that is N x 1. We can use np.zeros to create this zeros column (or np.ones to create a ones column) and append it to our original matrix (2D array).
def append_zeros(x):
zeros = np.zeros((len(x), 1)) # zeros column as 2D array
return np.hstack((x, zeros)) # append column
I add a new column with ones to a matrix array in this way:
Z = append([[1 for _ in range(0,len(Z))]], Z.T,0).T
Maybe it is not that efficient?
It can be done like this:
import numpy as np
# create a random matrix:
A = np.random.normal(size=(5,2))
# add a column of zeros to it:
print(np.hstack((A,np.zeros((A.shape[0],1)))))
In general, if A is an m*n matrix, and you need to add a column, you have to create an n*1 matrix of zeros, then use "hstack" to add the matrix of zeros to the right of the matrix A.
Similar to some of the other answers suggesting using numpy.hstack, but more readable:
import numpy as np
# declare 10 rows x 3 cols integer array of all 1s
arr = np.ones((10, 3), dtype=np.int64)
# get the number of rows in the original array (as if we didn't know it was 10 or it could be different in other cases)
numRows = arr.shape[0]
# declare the new array which will be the new column, integer array of all 0s so it's visually distinct from the original array
additionalColumn = np.zeros((numRows, 1), dtype=np.int64)
# use hstack to tack on the additionl column
result = np.hstack((arr, additionalColumn))
print(result)
result:
$ python3 scratchpad.py
[[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]]
Here's a shorter one-liner:
import numpy as np
data = np.random.rand(210, 8)
data = np.c_[data, np.zeros(len(data))]
Something that I use often to convert points to homogenous coordinates with np.ones instead.

Categories