searching index of first and last value greater than - Python Pandas - python

I have 250 files with 1000 values which makes a Gaussian curve and I need to find the first and last index of the value that is bigger than half of the maximum. I loaded files as a list of dataframes and I was able to find its maximum using maxValues = dataframes_temp[1].max(). I was able to find the first closest value to the HotM (half of the maximum) using index_value_min = (dataframes_temp[1] - b[i]).apply(abs).idxmin() but it isn't greater value than HotM, that's the first problem.
and second problem I wanted to find last index of HotM using:
dataframes_temp = dataframes_list[1]
dataframes_temp2 = dataframes_temp.loc[::-1]
index_value_max = (dataframes_temp2[1] - b[i]).apply(abs).idxmin()
but it didn't work and just only found the same value from the first part.
So how can I find indexes of first and last values bigger than HotM?

How about instead of abs, use lambda x: float('inf') if x <= 0 else x so that smaller values will not be selected.

Related

Adding parameters for finding key points in data

I have a large data file that consist of many cycles. I am trying to find some key points for each of these cycles. I would like a way to narrow down the points I have identified but have not been successful.
#Determine Relevant Points A and B
Data3['Diff'] = Data3['Value'].diff(2)
Data3['Diff2'] = Data3.groupby('Cycle#', as_index=False)['Value'].transform(lambda x: x - 2*x.shift(1) + x.shift(2))
Data3['POI'] = Data3.groupby('Cycle#').apply(
lambda g: np.select([g['Value'] == g['Value'].max(),
g['Diff2'] == g['Diff2'].min()],
['A', 'B'], default='-')
).explode().set_axis(Data3.index, axis=0)
I have tried this but would like to refine it more. Right now it goes through my column labeled values, and will get the first derivative and the 2nd derivative. You can see me identifying point A to the max of the value, and B to the min of the 2nd derivative.
I want to refine this more, and in addition to A being the max of the value I want it to also ensure the lag of the 2 values before it is less than the max. So for example, if the max is 1280, ensure the 2 rows behind it are less than this value. This helps get rid of some outliers in the data.
For B i want to get rid of any 2nd derivative values less than -83. This also helps get rid of outliers.
The question is, where would I add this info. Would this be in the equation here:
Data3['POI'] = Data3.groupby('Cycle#').apply(
lambda g: np.select([g['Value'] == g['Value'].max(),
g['Diff2'] == g['Diff2'].min()],
['A', 'B'], default='-')

Masking few non-zero elements of certain rows of a matrix

I have a 3*3 matrix with 1s and 0s, A = [[1,0,1],[0,1,1],[1,0,0]] and an array indicating the limit on the row sum, B = [1,2,1]. I want to find rows of A whose sum exceeds the corresponding value in B, and set the non-zero elements of A to zero to ensure that the sum matches with B. Finding the rows of A that exceed the sum is easy, however masking the elements to adjust the sum is what I need help with.How can this be achieved (want to scale it to larger matrices and tensors)?
I would do smth like this:
import numpy as np
A = np.array([[1,0,1],[0,1,1],[1,0,0]])
B = np.array([1,2,1])
# a cumulative sum of each row will tell you how many
# ones were in that row up to each point.
A_cs = np.cumsum(A, axis = 1)
# theresholding according to the sum vector will let
# you know where you should start omitting values since
# at that point the sum of the row exceeds its limit.
A_th = A_cs > B[:, None]
# then you can use the boolean array to create a new
# array where the appropriate values in the original
# array are set to zero to reduce the row sum.
A_nw = A * (1 - A_th)
output:
A_nw =
[[1 0 0]
[0 1 1]
[1 0 0]]
Unrelated note:
The following note is here to help OP better their dev-related search skills.
I can answer some questions instantaneously, but this was not one of them. I'm telling you this because I reached the answer through a simple google search for "python find the i th non zero element in each row", which led me to this post which in turn led me very quickly to an answer. You don't have to try to be a better, more independet code writer. But if you want to, know that you can.

Selecting Pandas DF between two values

I'm trying to subset a column of values that were extracted from a correlation matrix. I want to get values greater than 0.75 and less than -0.75. I tried the first line of code and it only gave me positive values greater than 0.75. The second line of code error'd out without a result.
Corr_matrix1 = Corr_matrix1[(Corr_matrix1['Coefficient'] >= abs(0.75))]
Corr_matrix1 = Corr_matrix1 [(Corr_matrix1 ['Coefficient'] >= 0.75) & (Corr_matrix1 ['Coefficient'] <= -0.75)]
Any help would be appreciated.
You can do this with the DataFrame.query method, one of my favorite features of pandas and it's pretty slept on. Here's an example;
df.corr().query(
'Coefficient <= -0.75'
'or Coefficient >= 0.75'
)
It's kind of odd, you pass the arguments as strings without commas in between multiple arguments. If you use a variable, you can use an f string.
Take a look at Interval Index
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html

apply max to varying-dimension subsets of pandas dataframe

For a dataframe with an indexed column with repeated indexes, I'm trying to get the maximum value found in a different column, by index, and assign it to a third column, so that for any given row, we can see the maximum value found in any row with the same index.
I'm doing this over a very large data set and would like it to be vectorized if possible. For now, I can't get it to work at all
multiindexDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,7,10,15,11,25,89]]).transpose()
multiindexDF.columns = ['theIndex','theValue']
multiindexDF['maxValuePerIndex'] = 0
uniqueIndicies = multiindexDF['theIndex'].unique()
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF[matchingIndices == i]['theValue'].max()
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
This fails, telling me I should use .loc, when I'm already using it. Not sure what the error means, and not sure how I can fix this so I don't have to loop through everything so I can vectorize it instead
I'm looking for this
targetDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,10,7,15,11,25,89],[5,6,10,10,89,89,89,89]]).transpose()
targetDF
Looks like this is a good case for groupby transform, this can get the maximum value per index group and transform them back onto their original index (rather than the grouped index):
multiindexDF['maxValuePerIndex'] = multiindexDF.groupby("theIndex")["theValue"].transform("max")
The reason you're getting the SettingWithCopyWarning is that in your .loc call you're taking a slice of a slice and setting the value there, see the two pair of square brackets in:
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
So it tries to assign the value to the slice rather than the original DataFrame, you're doing a .loc and then another [] after it in a chain.
So using your original approach:
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF.loc[matchingIndices, 'theValue'].max()
multiindexDF.loc[matchingIndices, 'maxValuePerIndex'] = maxValue
(Notice I've also changed the first .loc where you were incorrectly using the boolean index)

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

Categories