Currently I have a dataset below and I try to accumulate the value if ColA is 0 while reset the value to 0 (restart counting again) if the ColA is 1 again.
ColA
1
0
1
1
0
1
0
0
0
1
0
0
0
My expected result is as below.
ColA Accumulate
1 0
0 1
1 0
1 0
0 1
1 0
0 1
0 2
0 3
1 0
0 1
0 2
0 3
The current code I use
test['Value'] = np.where ( (test['ColA']==1),test['ColA'].cumsum() ,0)
ColA Value
1 0
0 1
1 0
1 0
0 2
1 0
0 3
0 4
0 5
1 0
0 6
0 7
0 8
Use cumsum if performance is important:
a = df['ColA'] == 0
cumsumed = a.cumsum()
df['Accumulate'] = cumsumed-cumsumed.where(~a).ffill().fillna(0).astype(int)
print (df)
ColA Accumulate
0 1 0
1 0 1
2 1 0
3 1 0
4 0 1
5 1 0
6 0 1
7 0 2
8 0 3
9 1 0
10 0 1
11 0 2
12 0 3
This should do it:
test['Value'] = (test['ColA']==0) * 1 * (test['ColA'].groupby((test['ColA'] != test['ColA'].shift()).cumsum()).cumcount() + 1)
It is an adaption of this answer.
Related
I have a dataframe like this
ID Q001 Q002 Q003 Q004 Q005 Q006 Q007 Q008 Win
A 1 1 1 1 1 1 1 1 Yes
B 0 1 0 1 0 1 0 1 No
C 0 1 0 1 0 1 0 1 No
D 1 1 0 1 1 1 1 1 Yes
E 1 1 0 1 1 1 1 1 Yes
F 1 1 0 1 1 1 1 1 Yes
G 0 0 1 0 0 0 0 0 No
H 0 0 0 0 0 0 0 0 No
I 1 0 1 0 1 0 1 0 No
In the above dataframe, I want to create the colum 'Win' and assign the values 'Yes' if the sum of Q001 and Q002 is equal or higher than 2 and 'No', if lower than 2. How can I do this in Python?
Use np.where() to return a value conditional on other columns.
df['Win'] = np.where(df['Q001'] + df['Q002'] >= 2, 'Yes', 'No')
Check
df['Win'] = np.where(df[['Q001','Q002']].sum(1)>=2,'Yes','No')
df
Out[680]:
ID Q001 Q002 Q003 Q004 Q005 Q006 Q007 Q008 Win
0 A 1 1 1 1 1 1 1 1 Yes
1 B 0 1 0 1 0 1 0 1 No
2 C 0 1 0 1 0 1 0 1 No
3 D 1 1 0 1 1 1 1 1 Yes
4 E 1 1 0 1 1 1 1 1 Yes
5 F 1 1 0 1 1 1 1 1 Yes
6 G 0 0 1 0 0 0 0 0 No
7 H 0 0 0 0 0 0 0 0 No
8 I 1 0 1 0 1 0 1 0 No
Simply use:
import numpy as np
cols = ['Q001', 'Q002']
df['Win'] = np.where(df[cols].sum(axis=1).ge(2),
'Yes', 'No')
You can scale this up to any number of columns.
Output:
ID Q001 Q002 Q003 Q004 Q005 Q006 Q007 Q008 Win
0 A 1 1 1 1 1 1 1 1 Yes
1 B 0 1 0 1 0 1 0 1 No
2 C 0 1 0 1 0 1 0 1 No
3 D 1 1 0 1 1 1 1 1 Yes
4 E 1 1 0 1 1 1 1 1 Yes
5 F 1 1 0 1 1 1 1 1 Yes
6 G 0 0 1 0 0 0 0 0 No
7 H 0 0 0 0 0 0 0 0 No
8 I 1 0 1 0 1 0 1 0 No
You could calculate the column as a Boolean series and replace the values with Yes and No (if you must):
df['Win'] = (df['Q001'] + df['Q002'] >= 2).replace({False: 'No', True: 'Yes'})
I am trying to do cumulative sum by intervals ie. with cumsum being reset to zero if the next value to accumulate is 0. Below is an example with the desired result following. I have tried using numpy 'convolve' and 'groupby' but can't get come up with a way to do the reset except by creating a def that loops over all the rows. Is there a clever approach I'm missing? Note that the real data in column 'x' are real numbers separated by 0's.
import numpy as np
import pandas as pd
a = pd.DataFrame([[0,0],[1,0],[1,0],[1,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],\
[0,0],[0,0],[0,0],[0,0],[1,0],[1,0],[0,0]], columns=["x","y"])
def patch(k):
k["z"] = k.x.cumsum()
return k
print(patch(a))
Current output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 3
6 0 0 3
7 0 0 3
9 0 0 3
10 0 0 3
12 0 0 3
13 1 0 4
15 1 0 5
16 0 0 5
Desired output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0
Do groupby on cumsum:
a['z'] = a.groupby(a['x'].eq(0).cumsum())['x'].cumsum()
Output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0
Here is an example.
a b k c
0 0 0 0
0 1 1 0
0 2 0 0
0 3 0 0
0 4 1 0
0 5 0 0
0 0 0 1
0 1 1 1
0 2 0 1
0 3 0 1
0 4 1 1
0 5 0 1
1 0 0 0
1 1 1 0
1 2 0 0
1 3 1 0
1 4 0 0
1 0 0 1
1 1 1 1
1 2 0 1
1 3 1 1
1 4 0 1
Here, "a" is user id, "b" is time, 'c' is product and "k" is a binary indicator flag. For each c, "b" is consecutive for sure and binary flag 'k' of a unique pair (a,b) is same, which means it is independent with 'c'. What I want to get is this:
a b k c diff_b
0 0 0 0 nan
0 1 1 0 nan
0 2 0 0 1
0 3 0 0 2
0 4 1 0 3
0 5 0 0 1
0 0 0 1 nan
0 1 1 1 nan
0 2 0 1 1
0 3 0 1 2
0 4 1 1 3
0 5 0 1 1
1 0 0 0 nan
1 1 1 0 nan
1 2 0 0 1
1 3 1 0 2
1 4 0 0 1
1 0 0 1 nan
1 1 1 1 nan
1 2 0 1 1
1 3 1 1 2
1 4 0 1 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently and for a same user but different product, it should be independent with product also.
Thank you.
You just need to adding the c into the group indicator at second step
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby(['a','c']).New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']
I have a binary matrix in a txt file that looks as follows:
0011011000
1011011000
0011011000
0011011010
1011011000
1011011000
0011011000
1011011000
0100100101
1011011000
I want to make this into a 2D array or a dataframe where there is one number per column and the rows are as shown. I've tried using numpy and pandas, but the output has only one column that contains the whole number. I want to be able to call an entire column as a number.
One of the codes I've tried is:
with open("a1data1.txt") as myfile:
dat1=myfile.read().split('\n')
dat1=pd.DataFrame(dat1)
Use read_fwf with parameter widths:
df = pd.read_fwf("a1data1.txt", header=None, widths=[1]*10)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0
After you read your txt, you can using following code fix it
pd.DataFrame(df[0].apply(list).values.tolist())
Out[846]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0
Has a table like this:
ID Word
1 take
2 the
3 long
4 long
5 road
6 and
7 walk
8 it
9 walk
10 it
Wanna to use pivot table in pandas to get distinct words in columns and 1 and 0 in Values. Smth like this matrix:
ID Take The Long Road And Walk It
1 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0
and so on
Trying to use pivot table but not familiar with pandas syntax yet:
import pandas as pd
data = pd.read_csv('dataset.txt', sep='|', encoding='latin1')
table = pd.pivot_table(data,index=["ID"],columns=pd.unique(data["Word"].values),fill_value=0)
How can I rewrite pivot table function to deal with it?
You can use concatwith str.get_dummies:
print pd.concat([df['ID'], df['Word'].str.get_dummies()], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0
Or as Edchum mentioned in comments - pd.get_dummies:
print pd.concat([df['ID'], pd.get_dummies(df['Word'])], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0