Creating array from existing array based on conditions of other arrays

Creating array from existing array based on conditions of other arrays - python

itmPaths is a list with 719255 (integer) values.
Pt is a 719255x1 matrix/array with float64 values.
C is a 719255x1 matrix/array with float64 values.
I would like to extract the index values where Pt > C, and then use those index values to extract the values from itmPaths that correspond with those index values, and then store those values in a new array, called exPaths. I have tried using the following code:
exPaths = itmPaths[index for index,value in enumerate(Pt-C) if value > 0]
In Matlab I can successfully do this using:
exPaths = itmPaths(Pt>C);
I would like to keep the code as efficient as possible. Thanks.

Using a list comprehension, you could do this, but just as I don't know the exact structure of what you call matrix you may adapt, but zipping both allows to keep track of the index (to extract value after) and the values (to apply the condition)
exPaths = [itmPaths[idx] for idx, pc in enumerate(zip(Pt, C)) if pc[0] > pc[1]]

Related

numpy get rows where value in list within column

I have kind of an odd numpy ndarray, it looks like this:
[[0,1,3,list([0,1])],
[0,0,0,list([0])],
[0,0,0,list([])],
[1,1,1,list([1,2,3,4,5,6,7])]]
so, the column at index 3 is actually of type list.
I would like to extract the rows which contain a key value in the list contained in index 3, so for instance, something like magic_function(matrix, 0) might return [[0,1,3,list([0,1])], [0,0,0,list([0])]], as the rows at index 0 and 1 contain the value 0 in the list at index 3.
I tried using a combination of np.where and np.isin, but couldn't quite get it working in a way that was elegant (things like matrix[np.where(np.isin(matrix[:,3], 0))]). I would prefer the fastest approach, which I beleive would be an approach using numpy rather than iteration in python.

How to add a randomly generated number to elements in a pandas DataFrame

I am trying to add/subtract a random number from existing elements (floats) in a pandas DataFrame (Python).
indices is a random subset index, and modify_columns is a list of the columns I wish to modify. My DataFrame is as follows (active_set.loc[indices,modify_columns]):
Values
380977 0.0
683042 0.0
234012 0.0
16517 0.0
... ...
I would like to add or subtract a randomly generated integer (either -1 or 1) from these values.
I have tried using (2*np.random.randint(0,2,size=(count))-1) to generate an array of these random numbers, and add them:
active_set.loc[indices,modify_columns] = active_set.loc[indices,modify_columns] + (2*np.random.randint(0,2,size=(count))-1)
This does not work as there is a ValueError: Unable to coerce to Series, length must be 1: given 180. I think I can simply create a second DataFrame with the random numbers, or iterate, but these seem inefficient, and there must be a way to use .apply, so I am asking for some help on how to do this.

more general
df.loc[indexes,columns] = df.loc[indexes,columns] + 2*np.random.randint(0,50,size=(len(indexes),len(columns)))
if you want to add different random values, you can make your random.randint the same size as columns

Create array by same size like length of indices by parameter size for 2d array:
arr = 2*np.random.randint(0,2,size=(len(indices), len(modify_columns)))
active_set.loc[indices,modify_columns] += arr

The easiest solution is to use pandas.DataFrame.add function.
vector_to_add = 2*np.random.randint(0, 2, size=(count)) - 1
df.loc[indices, modify_columns] = df.loc[indices, modify_columns].add(vector_to_add, axis='index')

How to zero out rows and columns in a SymPy Matrix

I'm implementing a simple FEA code and I need to zero out particular rows and columns to apply boundary conditions. Example matrix:
I tried with my_matrix[:,1] = 0 but it returns an error: ValueError: unexpected value: 0
Can some one guide me on how to make columns and rows zero?

Sympy matrix objects don't appear to support assigning a constant to multiple entries like numpy array objects.
Try my_matrix[:,1] = [0]*my_matrix.shape[0] instead, which generates a list of 0s of length equal to the number of rows of my_matrix.

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!

As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)

Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b

Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

setting null values in a numpy array

how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.

Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])

np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating array from existing array based on conditions of other arrays - python

Related

numpy get rows where value in list within column

How to add a randomly generated number to elements in a pandas DataFrame

How to zero out rows and columns in a SymPy Matrix

Pandas - expanding inverse quantile function

setting null values in a numpy array

Categories

Resources