I have a numpy array a, a.shape=(17,90,144). I want to find the maximum magnitude of each column of cumsum(a, axis=0), but retaining the original sign. In other words, if for a given column a[:,j,i] the largest magnitude of cumsum corresponds to a negative value, I want to retain the minus sign.
The code np.amax(np.abs(a.cumsum(axis=0))) gets me the magnitude, but doesn't retain the sign. Using np.argmax instead will get me the indices I need, which I can then plug into the original cumsum array. But I can't find a good way to do so.
The following code works, but is dirty and really slow:
max_mag_signed = np.zeros((90,144))
indices = np.argmax(np.abs(a.cumsum(axis=0)), axis=0)
for j in range(90):
for i in range(144):
max_mag_signed[j,i] = a.cumsum(axis=0)[indices[j,i],j,i]
There must be a cleaner, faster way to do this. Any ideas?
I can't find any alternatives to argmax but at least you can fasten that with a more vectorized approach:
# store the cumsum, since it's used multiple times
cum_a = a.cumsum(axis=0)
# find the indices as before
indices = np.argmax(abs(cum_a), axis=0)
# construct the indices for the second and third dimensions
y, z = np.indices(indices.shape)
# get the values with np indexing
max_mag_signed = cum_a[indices, y, z]
Related
From a matrix filled with values (see picture), I want to obtain a matrix with at most one value for every row and column. If there is more than one value, the maximum should be kept and the other set to 0. I know I can do that with np.max and np.argmax, but I'm wondering if there is some clever way to do it that I'm not aware of.
Here's the solution I have for now:
tmp = np.zeros_like(matrix)
for x in np.argmax(matrix, axis=0): # get max on x axis
for y in np.argmax(matrix, axis=1): # get max on y axis
tmp[x][y] = matrix[x][y]
matrix = tmp
The sparse structure may be used for efficiency, however right now I see a contradiction between at most one value for every row and column and your current implementation which may leave more than one value per row/column.
Either you need an order to prefer rows over columns or to go along an absolute sorting of all matrix values.
Just an idea which produces for sure at most one entry per row and column would be to, firstly select the maxima of rows, and secondly select from this intermediate matrix the maxima of columns:
import numpy as np
rows=5
cols=5
matrix=np.random.rand(rows, cols)
rowmax=np.argmax(matrix, 1)
rowmax_matrix=np.zeros((rows, cols))
for ri, rm in enumerate(rowmax):
rowmax_matrix[ri,rm]=matrix[ri, rm]
colrowmax=np.argmax(rowmax_matrix, 0)
colrowmax_matrix=np.zeros((rows, cols))
for ci, cm in enumerate(colrowmax):
colrowmax_matrix[cm, ci]=rowmax_matrix[cm, ci]
This is probably not the final answer, but may help to formulate the desired algorithm precisely.
Looking to print the minimum values of numpy array columns.
I am using a loop in order to do this.
The array is shaped (20, 3) and I want to find the min values of columns, starting with the first (i.e. col_value=0)
I have coded
col_value=0
for col_value in X:
print(X[:, col_value].min)
col_value += 1
However, it is coming up with an error
"arrays used as indices must be of integer (or boolean) type"
How do I fix this?
Let me suggest an alternative approach that you might find useful. numpy min() has axis argument that you can use to find min values along various
dimensions.
Example:
X = np.random.randn(20, 3)
print(X.min(axis=0))
prints numpy array with minimum values of X columns.
You don't need col_value=0 nor do you need col_value+=1.
x = numpy.array([1,23,4,6,0])
print(x.min())
EDIT:
Sorry didn't see that you wanted to iterate through columns.
import numpy as np
X = np.array([[1,2], [3,4]])
for col in X.T:
print(col.min())
Transposing the axis of the matrix is one the best solution.
X=np.array([[11,2,14],
[5,15, 7],
[8,9,20]])
X=X.T #Transposing the array
for i in X:
print(min(i))
I have a list of indices (list(int)) and a list of summing indices (list(list(int)). Given a 2D numpy array, I need to find the sum over indices in the second list for each column and add them to the corresponding indices in the first column. Is there any way to vectorize this?
Here is the normal code:
indices = [1,0,2]
summing_indices = [[5,6,7],[6,7,8],[4,5]]
matrix = np.arange(9*3).reshape((9,3))
for c,i in enumerate(indices):
matrix[i,c] = matrix[summing_indices[i],c].sum()+matrix[i,c]
Here's an almost* vectorized approach using np.add.reduceat -
lens = np.array(map(len,summing_indices))
col = np.repeat(indices,lens)
row = np.concatenate(summing_indices)
vals = matrix[row,col]
addvals = np.add.reduceat(vals,np.append(0,lens.cumsum()[:-1]))
matrix[indices,np.arange(len(indices))] += addvals[indices.argsort()]
Please note that this has some setup overhead, so it would be best suited for 2D input arrays with a good number of columns as we are iterating along the columns.
*: Almost because of the use of map() at the start, but computationally that should be negligible.
I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)
Does anyone know how to combine integer indices in numpy? Specifically, I've got the results of a few np.wheres and I would like to extract the elements that are common between them.
For context, I am trying to populate a large 3d array with the number of elements that are between boundary values of each cell, i.e. I have records of individual events including their time, latitude and longitude. I want to grid this into a 3D frequency matrix, where the dimensions are time, lat and lon.
I could loop round the array elements doing an np.where(timeCondition & latCondition & lonCondition), population with the length of the where result, but I figured this would be very inefficient as you would have to repeat a lot of the wheres.
What would be better is to just have a list of wheres for each of the cells in each dimension, and then loop through the logically combining them?
as #ali_m said, use bitwise and should be much faster, but to answer your question:
call ravel_multi_index() to convert the multi-dim index into 1-dim index.
call intersect1d() to get the index that in both condition.
call unravel_index() to convert the 1-dim index back to multi-dim index.
Here is the code:
import numpy as np
a = np.random.rand(10, 20, 30)
idx1 = np.where(a>0.2)
idx2 = np.where(a<0.4)
ridx1 = np.ravel_multi_index(idx1, a.shape)
ridx2 = np.ravel_multi_index(idx2, a.shape)
ridx = np.intersect1d(ridx1, ridx2)
idx = np.unravel_index(ridx, a.shape)
np.allclose(a[idx], a[(a>0.2) & (a<0.4)])
or you can use ridx directly:
a.ravel()[ridx]