Count how many consecutive TRUEs on each row in a dataframe - python

I am trying to count how many consecutive TRUEs on each row and I solved that part myself but I need to find a solution for this part: If a row starts with FALSE then result must be 0. There is a sample dataset below. Can you recommend me your tips to how to solve this.
PS. my original question is at the link below.
how to find number of consecutive decreases(increases)
Sample data, .csv file
idx,Expected Results,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1002,3,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,4,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1006,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1007,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1008,1,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE
1010,1,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE
1013,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1014,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,1,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1017,2,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
After John Solution;
How can I count the Trues till I see the "False"
result = df.where(df[0], 0)
idx,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
1001,0,0,0,0,0,0,0,0,0,0,0,0
1002,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE
1003,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1004,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1005,0,0,0,0,0,0,0,0,0,0,0,0
1006,0,0,0,0,0,0,0,0,0,0,0,0
1007,0,0,0,0,0,0,0,0,0,0,0,0
1008,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1009,0,0,0,0,0,0,0,0,0,0,0,0
1010,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE
1011,0,0,0,0,0,0,0,0,0,0,0,0
1013,0,0,0,0,0,0,0,0,0,0,0,0
1014,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1015,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1016,0,0,0,0,0,0,0,0,0,0,0,0
1017,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
1018,0,0,0,0,0,0,0,0,0,0,0,0

You can use np.argmin. You needn't prefilter your df, it will handle rows starting with False correctly.
df.loc[:, 'M_1':'M_12'].values.argmin(1)
#array([0, 3, 1, 4, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0])
Note that this assumes there is at least one False in every row.

df.loc[:, 'M_1':'M_12'].apply(np.logical_and.accumulate, axis=1).sum(axis=1)

reverse values of columns M-1 - M-12 using negation '~'. I.e, True to False and vice-versa. Doing cummax to separate first group of consecutive True (note: at this point True represent False-value and 'False' represent True-value). Doing another negation on the result of cummax and finally sum
(~(~df.drop(['idx'], 1)).cummax(1)).sum(1)
Out[503]:
0 0
1 3
2 1
3 4
4 0
5 0
6 0
7 1
8 0
9 1
10 0
11 0
12 1
13 1
14 0
15 2
16 0
dtype: int64

Related

How to count value change during a for loop in python?

I have a dataframe that consists of one column that consists of 0 and 1.
They are structured in this way [0,0,1,1,0,0,1,1,1,].
My goal is to count only the first 1 in each repeating 1s in a loop.
So in this example of [0,0,1,1,0,0,1,1,1,] it should be able to only count a total of 2. How can I use a for loop and use an if condition and count this?
A simple for loop:
out = [0]+[int(j-i==1) for i,j in zip(lst,lst[1:])]
Output:
[0, 0, 1, 0, 0, 0, 1, 0, 0]
Also, you can assign a pd.Series to a DataFrame column like:
df.col = (pd.Series(lst).diff()==1).astype(int)
Found a messy way to do it where I can create a translated list and count the sum.
def FirstValue(data):
for index, item in enumerate(data):
if item == 1:
if data[index-1] == 1:
counter.append(0)
if item == 1:
if data[index-1] == 0:
counter.append(1)
else:
counter.append(0)
(As #Erfan mentiond in the comments:)
>>> df
col
0 0
1 0
2 1
3 1
4 0
5 0
6 1
7 1
8 1
>>> df['col'].diff().eq(1).sum()
2
>>> df['col'].diff().eq(1).astype(int)
0 0
1 0
2 1
3 0
4 0
5 0
6 1
7 0
8 0
Name: col, dtype: int64

How to solve for unknowns in a 5x5 matrix in python

Here is a 5x5 matrix, with all cells unknown, it looks something like this:
A1+B1+C1+D1+E1| 1
A2+B2+C2+D2+E2| 0
A3+B3+C3+D3+E3| 1
A4+B4+C4+D4+E4| 3
A5+B5+C5+D5+E5| 2
_______________
2 1 2 1 1
So, the summation of rows can be seen on the right, and the summation of columns can be seen on the bottom. The solution can only be 0 or 1, and as an example here is the solution to the specific one I have typed out above:
0+0+1+0+0| 1
0+0+0+0+0| 0
1+0+0+0+0| 1
1+1+0+0+1| 3
0+0+1+1+0| 2
____________
2 1 2 1 1
As you can see, summing the rows and columns gives the results on the right and bottom.
My question: How would you go about entering the original matrix with unknowns and having python iterate each cell with 0 or 1 until the puzzle is complete?
You don't really need a matrix -- just use vectors (tuples) of length 25. They can represent 5x5 matrices according to the following scheme:
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
These are the indices of such tuples. Note that the row and column of an index can be obtained from the function divmod.
You can use product from itertools to iterate over the 2**25 possible ways of filling in the matrix.
These ideas lead to the following code:
from itertools import product
#nxn matrices will be represented by tuples of length n**2,
#in row-major order
#the following function caluculates row and column sums:
def all_sums(array,n):
row_sums = [0]*n
col_sums = [0]*n
for i,x in enumerate(array):
q,r = divmod(i,n)
row_sums[q] += x
col_sums[r] += x
return row_sums, col_sums
#in what follows, row_sums, col_sums are lists of target values
def solve_puzzle(row_sums, col_sums):
n = len(row_sums)
for p in product(range(2),repeat = n*n):
if all_sums(p,n) == (row_sums, col_sums):
return p
return "no solution"
solution = solve_puzzle([1,0,1,3,2],[2,1,2,1,1])
for i in range(0,25,5):
print(solution[i:i+5])
Output:
(0, 0, 0, 0, 1)
(0, 0, 0, 0, 0)
(0, 0, 0, 1, 0)
(1, 1, 1, 0, 0)
(1, 0, 1, 0, 0)
In this case brute-force was feasible. If you go much beyond 5x5 it would no longer be feasible, and more sophisticated algorithms would be required.
This is a special case of an integer linear programming problem. The special case of 0-1 integer linear programming is still unfortunately NP-complete, though there exist many algorithms including heuristic ones. You can use a built-in library to do this for you.

Pandas loc dynamic conditional list

I have a Pandas DataFrame and I want to find all rows where the i'th column values are 10 times greater than other columns.
Here is an example of my DataFrame:
For example, looking at column i=0, row B (0.344) its is 10x greater than values in the same row but in other columns (0.001, 0, 0.009, 0). So I would like:
my_list_0=[False,True,False,False,False,False,False,False,False,False,False]
The number of columns might change hence I don't want a solution like:
#This is good only for a DataFrame with 4 columns.
my_list_i = data.loc[(data.iloc[:,i]>10*data.iloc[:,(i+1)%num_cols]) &
(data.iloc[:,i]>10*data.iloc[:,(i+2)%num_cols]) &
(data.iloc[:,i]>10*data.iloc[:,(i+3)%num_cols])]
Any idea?
thanks.
Given the df:
df = pd.DataFrame({'cell1':[0.006209, 0.344955, 0.004521, 0, 0.018931, 0.439725, 0.013195, 0.009045, 0, 0.02614, 0],
'cell2':[0.048043, 0.001077, 0,0.010393, 0.031546, 0.287264, 0.016732, 0.030291, 0.016236, 0.310639,0],
'cell3':[0,0,0.020238, 0, 0.03811, 0.579348, 0.005906, 0,0,0.068352, 0.030165],
'cell4':[0.016139, 0.009359, 0,0,0.025449, 0.47779, 0, 0.01282, 0.005107, 0.004846, 0],
'cell5': [0,0,0,0.012075, 0.031668, 0.520258, 0,0,0,2.728218, 0.013418]})
i = 0
You can use
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).all(1)
To get
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
dtype: bool
for any number of columns. This drops column i, multiplies the remaining df by 10, and checks row-wise for being less than i, then returns True only if all values in the row are True. So it returns a vector of True for each row where this obtains and False for others.
If you want to give an arbitrary threshold, you can sum the Trues and divide by the number of columns - 1, then compare with your threshold:
thresh = 0.5 # or whatever you want
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).sum(1) / (df.shape[1] - 1) > thresh
0 False
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
dtype: bool

How to iterate over a pandas dataframe while referencing previous rows?

I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.

Pandas : determine mapping from unique rows to original dataframe

Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.
One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64
Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

Categories