Python reset cumulative sum over intervals in a column - python

I am trying to do cumulative sum by intervals ie. with cumsum being reset to zero if the next value to accumulate is 0. Below is an example with the desired result following. I have tried using numpy 'convolve' and 'groupby' but can't get come up with a way to do the reset except by creating a def that loops over all the rows. Is there a clever approach I'm missing? Note that the real data in column 'x' are real numbers separated by 0's.
import numpy as np
import pandas as pd
a = pd.DataFrame([[0,0],[1,0],[1,0],[1,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],\
[0,0],[0,0],[0,0],[0,0],[1,0],[1,0],[0,0]], columns=["x","y"])
def patch(k):
k["z"] = k.x.cumsum()
return k
print(patch(a))
Current output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 3
6 0 0 3
7 0 0 3
9 0 0 3
10 0 0 3
12 0 0 3
13 1 0 4
15 1 0 5
16 0 0 5
Desired output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0

Do groupby on cumsum:
a['z'] = a.groupby(a['x'].eq(0).cumsum())['x'].cumsum()
Output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0

Related

Pandas: copy value from row above of another column when condition match

When the column "event" has a value different than 0, and rows above from column "input1" has value 5, I need to copy the value from row above of column "label".
What I have now:
input1 input2 input3 event label
0 0 0 0 0
5 5 0 0 2
5 5 0 0 2
0 0 0 24 0
0 0 0 0 0
5 0 5 0 3
5 0 5 0 3
5 0 5 0 3
0 0 0 25 0
0 0 0 0 0
What I need to happen:
input1 input2 input3 event label marker
0 0 0 0 0 0
5 5 0 0 2 0
5 5 0 0 2 0
0 0 0 24 0 2
0 0 0 0 0 0
5 0 5 0 3 0
5 0 5 0 3 0
5 0 5 0 3 0
0 0 0 25 0 3
0 0 0 0 0 0
You can use boolean masks eq/ne, and shift to get the previous values, then select with where:
# is the previous input1 equal to 5?
m1 = df['input1'].shift().eq(5)
# is event not 0?
m2 = df['event'].ne(0)
# get the previous label if both conditions are true, else 0
df['marker'] = df['label'].shift(fill_value=0).where(m1&m2, 0)
# OR
# df['marker'] = df['label'].shift().where(m1&m2, 0).convert_dtypes()
output:
input1 input2 input3 event label marker
0 0 0 0 0 0 0
1 5 5 0 0 2 0
2 5 5 0 0 2 0
3 0 0 0 24 0 2
4 0 0 0 0 0 0
5 5 0 5 0 3 0
6 5 0 5 0 3 0
7 5 0 5 0 3 0
8 0 0 0 25 0 3
9 0 0 0 0 0 0

Convert one-hot encoded data-frame columns into one column

In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

Accumulate 1 and Reset to 0 once condition is met

Currently I have a dataset below and I try to accumulate the value if ColA is 0 while reset the value to 0 (restart counting again) if the ColA is 1 again.
ColA
1
0
1
1
0
1
0
0
0
1
0
0
0
My expected result is as below.
ColA Accumulate
1 0
0 1
1 0
1 0
0 1
1 0
0 1
0 2
0 3
1 0
0 1
0 2
0 3
The current code I use
test['Value'] = np.where ( (test['ColA']==1),test['ColA'].cumsum() ,0)
ColA Value
1 0
0 1
1 0
1 0
0 2
1 0
0 3
0 4
0 5
1 0
0 6
0 7
0 8
Use cumsum if performance is important:
a = df['ColA'] == 0
cumsumed = a.cumsum()
df['Accumulate'] = cumsumed-cumsumed.where(~a).ffill().fillna(0).astype(int)
print (df)
ColA Accumulate
0 1 0
1 0 1
2 1 0
3 1 0
4 0 1
5 1 0
6 0 1
7 0 2
8 0 3
9 1 0
10 0 1
11 0 2
12 0 3
This should do it:
test['Value'] = (test['ColA']==0) * 1 * (test['ColA'].groupby((test['ColA'] != test['ColA'].shift()).cumsum()).cumcount() + 1)
It is an adaption of this answer.

How do I open a binary matrix and convert it into a 2D array or a dataframe?

I have a binary matrix in a txt file that looks as follows:
0011011000
1011011000
0011011000
0011011010
1011011000
1011011000
0011011000
1011011000
0100100101
1011011000
I want to make this into a 2D array or a dataframe where there is one number per column and the rows are as shown. I've tried using numpy and pandas, but the output has only one column that contains the whole number. I want to be able to call an entire column as a number.
One of the codes I've tried is:
with open("a1data1.txt") as myfile:
dat1=myfile.read().split('\n')
dat1=pd.DataFrame(dat1)
Use read_fwf with parameter widths:
df = pd.read_fwf("a1data1.txt", header=None, widths=[1]*10)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0
After you read your txt, you can using following code fix it
pd.DataFrame(df[0].apply(list).values.tolist())
Out[846]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0

Pivotting via Python and Pandas

Has a table like this:
ID Word
1 take
2 the
3 long
4 long
5 road
6 and
7 walk
8 it
9 walk
10 it
Wanna to use pivot table in pandas to get distinct words in columns and 1 and 0 in Values. Smth like this matrix:
ID Take The Long Road And Walk It
1 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0
and so on
Trying to use pivot table but not familiar with pandas syntax yet:
import pandas as pd
data = pd.read_csv('dataset.txt', sep='|', encoding='latin1')
table = pd.pivot_table(data,index=["ID"],columns=pd.unique(data["Word"].values),fill_value=0)
How can I rewrite pivot table function to deal with it?
You can use concatwith str.get_dummies:
print pd.concat([df['ID'], df['Word'].str.get_dummies()], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0
Or as Edchum mentioned in comments - pd.get_dummies:
print pd.concat([df['ID'], pd.get_dummies(df['Word'])], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0

Categories