Pandas loc dynamic conditional list - python

I have a Pandas DataFrame and I want to find all rows where the i'th column values are 10 times greater than other columns.
Here is an example of my DataFrame:
For example, looking at column i=0, row B (0.344) its is 10x greater than values in the same row but in other columns (0.001, 0, 0.009, 0). So I would like:
my_list_0=[False,True,False,False,False,False,False,False,False,False,False]
The number of columns might change hence I don't want a solution like:
#This is good only for a DataFrame with 4 columns.
my_list_i = data.loc[(data.iloc[:,i]>10*data.iloc[:,(i+1)%num_cols]) &
(data.iloc[:,i]>10*data.iloc[:,(i+2)%num_cols]) &
(data.iloc[:,i]>10*data.iloc[:,(i+3)%num_cols])]
Any idea?
thanks.

Given the df:
df = pd.DataFrame({'cell1':[0.006209, 0.344955, 0.004521, 0, 0.018931, 0.439725, 0.013195, 0.009045, 0, 0.02614, 0],
'cell2':[0.048043, 0.001077, 0,0.010393, 0.031546, 0.287264, 0.016732, 0.030291, 0.016236, 0.310639,0],
'cell3':[0,0,0.020238, 0, 0.03811, 0.579348, 0.005906, 0,0,0.068352, 0.030165],
'cell4':[0.016139, 0.009359, 0,0,0.025449, 0.47779, 0, 0.01282, 0.005107, 0.004846, 0],
'cell5': [0,0,0,0.012075, 0.031668, 0.520258, 0,0,0,2.728218, 0.013418]})
i = 0
You can use
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).all(1)
To get
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
dtype: bool
for any number of columns. This drops column i, multiplies the remaining df by 10, and checks row-wise for being less than i, then returns True only if all values in the row are True. So it returns a vector of True for each row where this obtains and False for others.
If you want to give an arbitrary threshold, you can sum the Trues and divide by the number of columns - 1, then compare with your threshold:
thresh = 0.5 # or whatever you want
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).sum(1) / (df.shape[1] - 1) > thresh
0 False
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
dtype: bool

Related

Pandas : Get binary OR/AND for all the columns in a dataframe

Say I have a dataframe. (Original dataframe has 91 columns 1000 rows)
0 1 2 3
0 False False False True
1 True False False False
2 True False False False
3 False False True False
4 False True True False
5 False False False False
6 True True True True
I need to get the AND/OR values for all the columns in my dataframe. So the resultant OR, AND values would be.
OR AND
0 True False
1 True False
2 True False
3 True False
4 True False
5 False False
6 True True
I can do this by looping over all my columns and calculate the boolean for each column but I was looking for a more dataframe level approach without actually going through the columns.
You can use any and all.
df = df.assign(OR=df.any(axis=1), AND=df.all(axis=1))
You can sum along the columns and then the OR is indicated by sum > 0, and AND is indicated by sum == len(df.columns):
total = df.sum(axis=1)
res = pd.DataFrame({"OR": total > 0, "AND": total == len(df.columns)})
If you have many columns this is more efficient as it only iterates over the entire matrix once (in the worst case, depending on the input distribution and implementation of any/all iterating twice can be faster).

Mark True from conditions satisfy on two consecutive values till another two consecutive values

I have a float column in a dataframe. And I want to add another boolean column which will be True if condition satisfies on two consecutive values till another condition satisfies on next two consecutive values.
For Example I have a data-frame which look like this:
index
Values %
0
0
1
5
2
11
3
9
4
14
5
18
6
30
7
54
8
73
9
100
10
100
11
100
12
100
13
100
Now I want to mark True from where two consecutive values satisfies the condition df['Values %'] >= 10 till next two consecutive values satisfies the next condition i.e. df[Values %] == 100.
So the final result will look like something this:
index
Values %
Flag
0
0
False
1
5
False
2
11
False
3
9
False
4
14
False
5
18
True
6
30
True
7
54
True
8
73
True
9
100
True
10
100
True
11
100
False
12
100
False
13
100
False
Not sure how exactly the second part of your question is supposed to work but here is how to achieve the first.
example data
s = pd.Series([0,5,11,9,14,18,2,14,16,18])
solution
# create true/false series for first condition and take cumulative sum
x = (s >= 10).cumsum()
# compare each element of x with 2 elements before. There will be a difference of 2 for elements which belong to streak of 2 or more True
condition = x - x.shift(2) == 2
condition looks like this
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
dtype: bool
I have a rather inefficient way of doing this. It's not vectorised, so not ideal, but it works:
# Convert the values column to a 1D NumPy array for ease of use.
values = df["Values %"].tolist()
values_np = np.array(values)
# Initialize flags 1D array to be the same size as values_np. Initially set to all 0s. Uses int form of booleans, i.e. 0 = False and 1 = True.
flags = np.zeros((values_np.shape[0]), dtype=int)
# Iterate from 1st (not 0th) row to last row.
for i in range(1, values_np.shape[0]):
# First set flag to 1 (True) if meets the condition that consecutive values are both >= 10.
if values_np[i] >= 10 and values_np[i-1] >= 10:
flags[i] = 1
# Then if consecutive values are both larger than 100, set flag to 0 (False).
if values_np[i] >= 100 and values_np[i-1] >= 100:
flags[i] = 0
# Turn flags into boolean form (i.e. convert 0 and 1 to False and True).
flags = flags.astype(bool)
# Add flags as a new column in df.
df["Flags"] = flags
One thing -- my method gives False for row 10, because both row 9 and row 10 >= 100. If this is not what you wanted, let me know and I can change it so that the flag is True only if the previous two values and the current value (3 consecutive values) are all >= 100.

Count how many consecutive TRUEs on each row in a dataframe

I am trying to count how many consecutive TRUEs on each row and I solved that part myself but I need to find a solution for this part: If a row starts with FALSE then result must be 0. There is a sample dataset below. Can you recommend me your tips to how to solve this.
PS. my original question is at the link below.
how to find number of consecutive decreases(increases)
Sample data, .csv file
idx,Expected Results,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1002,3,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,4,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1006,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1007,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1008,1,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE
1010,1,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE
1013,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1014,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1017,2,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
After John Solution;
How can I count the Trues till I see the "False"
result = df.where(df[0], 0)
idx,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,0,0,0,0,0,0,0,0,0,0,0
1002,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,0,0,0,0,0,0,0,0,0,0,0
1006,0,0,0,0,0,0,0,0,0,0,0,0
1007,0,0,0,0,0,0,0,0,0,0,0,0
1008,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,0,0,0,0,0,0,0,0,0,0,0
1010,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,0,0,0,0,0,0,0,0,0,0,0
1013,0,0,0,0,0,0,0,0,0,0,0,0
1014,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,0,0,0,0,0,0,0,0,0,0,0
1017,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,0,0,0,0,0,0,0,0,0,0,0
You can use np.argmin. You needn't prefilter your df, it will handle rows starting with False correctly.
df.loc[:, 'M_1':'M_12'].values.argmin(1)
#array([0, 3, 1, 4, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0])
Note that this assumes there is at least one False in every row.
df.loc[:, 'M_1':'M_12'].apply(np.logical_and.accumulate, axis=1).sum(axis=1)
reverse values of columns M-1 - M-12 using negation '~'. I.e, True to False and vice-versa. Doing cummax to separate first group of consecutive True (note: at this point True represent False-value and 'False' represent True-value). Doing another negation on the result of cummax and finally sum
(~(~df.drop(['idx'], 1)).cummax(1)).sum(1)
Out[503]:
0 0
1 3
2 1
3 4
4 0
5 0
6 0
7 1
8 0
9 1
10 0
11 0
12 1
13 1
14 0
15 2
16 0
dtype: int64

Count occurrences of False or True in a column in pandas

given
patient_id test_result has_cancer
0 79452 Negative False
1 81667 Positive True
2 76297 Negative False
3 36593 Negative False
4 53717 Negative False
5 67134 Negative False
6 40436 Negative False
how to count False or True in a column , in python?
I had been trying:
# number of patients with cancer
number_of_patients_with_cancer= (df["has_cancer"]==True).count()
print(number_of_patients_with_cancer)
So you need value_counts ?
df.col_name.value_counts()
Out[345]:
False 6
True 1
Name: has_cancer, dtype: int64
If has_cancer has NaNs:
false_count = (~df.has_cancer).sum()
If has_cancer does not have NaNs, you can optimise by not having to negate the masks beforehand.
false_count = len(df) - df.has_cancer.sum()
And similarly, if you want just the count of True values, that is
true_count = df.has_cancer.sum()
If you want both, it is
fc, tc = df.has_cancer.value_counts().sort_index().tolist()
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
If the panda series above is called example
example.sum()
Then this code outputs 1 since there is only one True value in the series. To get the count of False
len(example) - example.sum()
number_of_patients_with_cancer = df.has_cancer[df.has_cancer==True].count()
Consider your above data frame as a df
True_Count = df[df.has_cancer == True]
len(True_Count)
Just sum the column for a count of the Trues. False is just a special case of 0 and True a special case of 1. The False count would be your row count minus that. Unless you've got na's in there.

Index DataFrame with MultiIndex Rows and Columns via another DataFrame containing row and column indices as columns

I have a list of particle pairs within which each pair is referred to by a combination of a chain-index and an intra-chain-index of both particles. I have saved those in a Dataframe (let's call it index_array) and now I want to plot a matrix of all particle pairs, where I plot all matrix elements that correspond to a pair in the list in one color and all others in another color. My idea was thus to produce a DataFrame (let's call it to_fill) with chain- and intra-chain-index as a MultiIndex for both rows and columns, which thus has two entries per pair and then use index_array to index to_fill to change the corresponding values, such that I can then plot the values of to_fill via matplotlib.pyplot.pcolormesh.
So to break it down into a more or less well-defined problem: I have a boolean DataFrame to_fill that has multiindexed rows and columns (2 levels each) that contains only Falses. I also have another DataFrame index_array that has four columns, containing the index values for the levels of both rows and columns. Now I want to set all elements pointed to by index_array to True. A toy version of those could for example be produced with the code below:
import numpy as np
import pandas as pd
lengths = pd.Series(data=[2, 4], index=[1, 2]) # Corresponds to the chains' lengths
index = pd.MultiIndex.from_tuples([(i, j) for i in lengths.index
for j in np.arange(1, lengths.loc[i]+1)])
to_fill = pd.DataFrame(index=index, columns=index, dtype=np.bool)
to_fill.loc[slice(None), slice(None)] = 0
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 False False False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False False False
# 4 False False False False False False
index_array = pd.DataFrame([[1, 1, 1, 1],
[1, 1, 1, 2],
[2, 3, 2, 3],
[2, 3, 2, 4]],
columns=["i_1", "j_1", "i_2", "j_2"])
print(index_array)
# i_1 j_1 i_2 j_2
# 0 1 1 1 1
# 1 1 1 1 2
# 2 2 3 2 3
# 3 2 3 2 4
Now I want to set all entries in to_fill that correspond to (i_1, j_1), (i_2, j_2) for a row in index_array to True. So basically, index_array refers to entries in to_fill that should be changed. The expected result would thus be:
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 True True False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False True True
# 4 False False False False False False
But I did not manage to properly use index_array as an index. How can I tell to_fill to treat the indexing arrays i_1, j_1, i_2, and j_2 as corresponding index values for the levels of the row and column MultiIndex respectively?
This is a little better - hmm perhaps not really:
tuples = [tuple(x) for x in index_array.values]
stacked = to_fill.stack(level=0).stack() # double stack carefully ordered
stacked.loc[tuples] = True
result = stacked.unstack(level=2).unstack().dropna(axis=1) #unstack and drop NaN cols
This is not great as I don't seek to use iterrows() if it can be helped.
idx = pd.IndexSlice
for row in index_array.iterrows():
r = row[1]
i_1= r.loc['i_1']
j_1= r.loc['j_1']
i_2= r.loc['i_2']
j_2 = r.loc['j_2']
to_fill.loc[idx[i_1,j_1],idx[i_2,j_2]] = True

Categories