Checking multiple condition with numpy array - python

I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.

Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)

Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...

It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe

Related

Check condition on each pandas DataFrame cell

I have a pandas DataFrame corr which collects correlations between 2k variables. Since I didn't create it, I would like to check whether it satisfies the usual consistency properties of a correlation matrix (simmetry, all numeric values in [-1,1],no missing values,...). How can I check such conditions efficiently, since my actual code involves two nested loops?
For the sake of completeness I generate a df below with an example of my actual checks.
import pandas as pd
df = pd.DataFrame(np.random.normal(size=(10000,2000)), columns = ["var"+str(i) for i in range(0,2000)])
corr = df.corr()
inconsistent_cells=[]
for row in corr.index:
for col in corr.columns:
value = corr.loc[row,col]
if not isinstance(value,float) or (value<-1 or value>1):
inconsistent_cells.append((value, (row,col)))
I think one possible solution would be to use itertuples(), but then I would loose info about cell coordinates. The same is true for apply().
Any suggestion is appreciate, thanks.
Write a custom function to check:
def check_df(df):
#symmetry
if not df.eq(df.T).all().all():
return False
#between -1 and 1
if not df.apply(lambda x: x.between(-1,1,inclusive="both").all()).all():
return False
#null values
if df.isnull().any().any():
return False
return True

How do you filter rows in a dataframe based on the column numbers from a Python list?

I have a Pandas dataframe with two columns, x and y, that correspond to a large signal. It is about 3 million rows in size.
Wavelength from dataframe
I am trying to isolate the peaks from the signal. After using scipy, I got a 1D Python list corresponding to the indexes of the peaks. However, they are not the actual x-values of the signal, but just the index of their corresponding row:
from scipy.signal import find_peaks
peaks, _ = find_peaks(y, height=(None, peakline))
So, I decided I would just filter the original dataframe by setting all values in its y column to NaN unless they were on an index found in the peak list. I did this iteratively, however, since it is 3000000 rows, it is extremely slow:
peak_index = 0
for data_index in list(data.index):
if data_index != peaks[peak_index]:
data[data_index, 1] = float('NaN')
else:
peak_index += 1
Does anyone know what a faster method of filtering a Pandas dataframe might be?
Looping in most cases is extremely inefficient when it comes to pandas. Assuming you just need filtered DataFrame that contains the values of both x and y columns only when y is a peak, you may use the following piece of code:
df.iloc[peaks]
Alternatively, if you are hoping to retrieve an original DataFrame with y column retaining its peak values and having NaN otherwise, then please use:
df.y = df.y.where(df.y.iloc[peaks] == df.y.iloc[peaks])
Finally, since you seem to care about just the x values of the peaks, you might just rework the first piece in the following way:
df.iloc[peaks].x

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

Selecting dataframe rows based on multiple columns, where new functions should be created to handle conditions in some columns

I have a dataframe that consists of multiple columns. I want to select rows based on conditions in multiple columns. Assuming that I have four columns in a dataframe:
import pandas as pd
di={"A":[1,2,3,4,5],
"B":['Tokyo','Madrid','Professor','helsinki','Tokyo Oliveira'],
"C":['250','200//250','250//250//200','12','200//300'],
"D":['Left','Right','Left','Right','Right']}
data=pd.DataFrame(di)
I want to select Tokyo in column B, 200 in column C, Left in column D. By that, the first row will be only selected. I have to create a function to handle column C. Since I need to check the first value if the row contains a list with //
To handle this, I assume this can be done through the following:
def check_200(thecolumn):
thelist=[]
for i in thecolumn:
f=i
if "//" in f:
#split based on //
z=f.split("//")
f=z[0]
f=float(f)
if f > 200.00:
thelist.append(True)
else:
thelist.append(False)
return thelist
Then, I will create the multiple conditions:
selecteddata=data[(data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left"))&(check_200(data.C))]
Is this the best way to do that, or there is an easier pandas function that can handle such requirements ?
I don't think there is a most pythonic way to do this, but I think this is what you want:
bool_idx = ((data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left")) & (data.C.str.contains("//")
& (data.C.str.split("//")[0].astype(float)>200.00))
selecteddata=data[bool_idx]
Bruno's answer does the job, and I agree that boolean masking is the way to go. This answer keeps the code a little closer to the requested format.
import numpy as np
def col_condition(col):
col = col.apply(lambda x: float(x.split('//')[0]) > 200)
return col
data = data[(data.B.str.contains('Tokyo')) & (data.D.str.contains("Left")) &
col_condition(data.C)]
The function reads in a Series, and converts each element to True or False, depending on the condition. It then returns this mask.

Creating a set of new columns based on other sets of columns and based on the IF THEN condition

I need to create a third set of columns based on two other sets of columns and based on the additional IF THEN condition. It works, but I am not sure how to introduce a condition that would limit division to only those rows where both list_A and list_B are higher than zero. In other words, I want to execute computation only if the numbers in both sets A and B are higher than zero because that's the condition that makes the division meaningful. And, if either list A or B is zero then I would like list C be zero. I tried NumPy where approach, but I am getting error. Thank you. Here is example
dates = pd.date_range('1/1/2000', periods=100, freq='M')
dft = pd.DataFrame(np.random.randn(100, 4),index=dates, columns=['var_A1',
'var_A2', 'var_B1', 'var_B2'])
list_A=['var_A1', 'var_A2']
list_B=['var_B1', 'var_B2']
list_C=['var_C1', 'var_C2']
C=pd.DataFrame(data=dft[list_A].values/dft[list_B].values,columns=list_C,
index=dft.index)
dft=pd.concat([dft,C],axis=1)
c = pd.DataFrame(data=np.where(((dft[list_A].values>0) & (dft[list_B].values>0)), dft[list_A].values/dft[list_B].values, 0),columns=list_C,index=dft.index)
dft=pd.concat([dft,c],axis=1)

Categories