Lets say I have something that looks like this
df = pd.DataFrame({'Event':['A','A','A','A', 'A' ,'B','B','B','B','B'], 'Number':[1,2,3,4,5,6,7,8,9,10],'Ref':[False,False,False,False,True,False,False,False,True,False]})
What I want to do is create a new column which is the difference in Number from the True in ref. So for the A group, the True is the last one, so the column would read -4,-3,-2,-1,0. I have been thinking to do the following:
for col in df.groupby('Event'):
temp = col[1]
reference = temp[temp.Ref==True]
dist1 = temp.apply(lambda x:x.Number-reference.Number,axis=1)
This seems to correctly calculate for each group, but I am not sure how to join the result into the df.
In your case
df['new']=(df.set_index('Event').Number-df.query('Ref').set_index('Event').Number).to_numpy()
df
Event Number Ref new
0 A 1 False -4
1 A 2 False -3
2 A 3 False -2
3 A 4 False -1
4 A 5 True 0
5 B 6 False -3
6 B 7 False -2
7 B 8 False -1
8 B 9 True 0
9 B 10 False 1
You could do the following:
df["new"] = df.Number - df.Number[df.groupby('Event')['Ref'].transform('idxmax')].reset_index(drop=True)
print(df)
Output
Event Number Ref new
0 A 1 False -4
1 A 2 False -3
2 A 3 False -2
3 A 4 False -1
4 A 5 True 0
5 B 6 False -3
6 B 7 False -2
7 B 8 False -1
8 B 9 True 0
9 B 10 False 1
This: df.groupby('Event')['Ref'].transform('idxmax') fill find the indices by group where Ref is True. Basically it finds the indices of the max values, so given that True = 1, and False = 0, it find the indices of the True values.
Try where and grouby transform first
s = df.Number.where(df.Ref).groupby(df.Event).transform('first')
df.Number - s
Out[319]:
0 -4.0
1 -3.0
2 -2.0
3 -1.0
4 0.0
5 -3.0
6 -2.0
7 -1.0
8 0.0
9 1.0
Name: Number, dtype: float64
Related
I am trying to set the value of a column, "C" based on the idxmax() of a groupby ("B"). To make it a bit more complicated though, in the event of a NaN or 0, I would like it to return the min value excluding the NaN or 0 if such a value exists. Here is an example dataframe:
Index
A
B
C
0
1
5
False
1
1
10
False
2
2
9
False
3
2
NaN
False
4
3
3
False
5
3
5
False
6
4
NaN
False
7
4
NaN
False
8
5
0
False
9
5
5
False
I am trying to set column "C" to True for the idxmax() of column B, split by a groupby on column "A":
A
B
C
0
1
5
True
1
1
10
False
2
2
9
True
3
2
NaN
False
4
3
3
True
5
3
5
False
6
4
NaN
True
7
4
NaN
False
8
5
0
False
9
5
5
True
Thanks!
Let's use groupby with transform like this:
df['C_new'] = df.groupby('A')['B'].transform('idxmax') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False False
1 1 1 10.0 False True
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False False
5 5 3 5.0 False True
or, idxmin...
df['C_new'] = df.groupby('A')['B'].transform('idxmin') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False True
1 1 1 10.0 False False
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False True
5 5 3 5.0 False False
What i have is a dataframe like:
total_sum pid
5 2
1 2
6 7
3 7
1 7
1 7
0 7
5 10
1 10
1 10
What I want is another column pos like:
total_sum pid pos
5 2 1
1 2 2
6 7 1
3 7 2
1 7 3
1 7 3
0 7 4
5 10 1
1 10 2
1 10 2
The logic behind is:
The initial pos value for new pid is 1.
If pid does not change but the total_sum changes, the value for pos is incremented by 1 (example first two rows) else the value for pos is the previous value (example last two rows).
What i tried:
df['pos'] = 1
df['pos'] = np.where(((df.pid.diff(-1)) == 0 & (df.total_sum.diff(-1) == 0)),
df.pos, (np.where(df.total_sum.diff(1) < 1, df.pos + 1, df.pos )))
Currently, I am doing it in an excel sheet, where I initially write 1 manually in the first column of pos and then write the formula in second cell of pos:
=IF(A3<>A2,1,IF(B3=B2,C2,C2+1))
Explanation:
Doing groupby on pid to group the same pid into separate groups. On each group, apply these following operations:
_ Call diff on each group. diff returns integers or NaN indicate the differences between 2 consecutive rows. First row of each group has no previous row, so diff always returns NaN for first row of each group :
df.groupby('pid').total_sum.transform(lambda x: x.diff()
Out[120]:
0 NaN
1 -4.0
2 NaN
3 -3.0
4 -2.0
5 0.0
6 -1.0
7 NaN
8 -4.0
9 0.0
Name: total_sum, dtype: float64
_ ne checks to see if any value is not 0. It returns True on not 0
df.groupby('pid').total_sum.transform(lambda x: x.diff().ne(0))
Out[121]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: total_sum, dtype: bool
_ cumsum is cumulative sum which successively adds each rows. In Python, True is interpreted as 1 and False interpreted as 0. The 1st of each group is always True, so cumsum is always starting from 1 and adding up each rows to get the desired output.
df.groupby('pid').total_sum.transform(lambda x: x.diff().ne(0).cumsum())
Out[122]:
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 1
8 2
9 2
Name: total_sum, dtype: int32
Chain all commands to one-liner as follows:
df['pos'] = df.groupby('pid').total_sum.transform(lambda x: x.diff().ne(0).cumsum())
Out[99]:
total_sum pid pos
0 5 2 1
1 1 2 2
2 6 7 1
3 3 7 2
4 1 7 3
5 1 7 3
6 0 7 4
7 5 10 1
8 1 10 2
9 1 10 2
I need to extract the cumulative mean only while my column A is different form zero. Each time it is zero, the cumulative mean should restart. Thanks so much in advance I am not so good using python.
Input:
ColumnA
0 5
1 6
2 7
3 0
4 0
5 1
6 2
7 3
8 0
9 5
10 10
11 15
Expected Output:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
You can try with cumsum to make groups and then with expanding+mean to make the cumulative mean
groups=df.ColumnA.eq(0).cumsum()
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
Details:
Make groups when column is equal to 0 with eq and cumsum, since eq gives you a mask with True and False values, and with cumsum these values are taken as 1 or 0:
groups=df.ColumnA.eq(0).cumsum()
groups
0 0
1 0
2 0
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ColumnA, dtype: int32
Then group by that groups and use apply to do the cumulative mean over elements different to 0:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean())
ColumnA
0 5.0
1 5.5
2 6.0
3 NaN
4 NaN
5 1.0
6 1.5
7 2.0
8 NaN
9 5.0
10 7.5
11 10.0
And finally use fillna to fill with 0 the nan values:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
ColumnA
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
You can use boolean indexing to compare rows that are ==0 and !=0 against the previous rows with .shift(). Then, jsut take the .cumsum() to separate into groups, according to where zeros are within ColumnA.
df['CumulativeMean'] = (df.groupby((((df.shift()['ColumnA'] != 0) & (df['ColumnA'] == 0)) |
(df.shift()['ColumnA'] == 0) & (df['ColumnA'] != 0))
.cumsum())['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[6]:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
I'll have broken down the logic of the boolean indexing within the .groupby statement down into multiple columns that build into the final result of the column abcd_cumsum. From there, ['ColumnA'].apply(lambda x: x.expanding().mean())) takes the mean of the group up to any given row in that group. For example, The second row (index of 1) takes the grouped mean of the first and second row, but excludes the third row.
df['a'] = (df.shift()['ColumnA'] != 0)
df['b'] = (df['ColumnA'] == 0)
df['ab'] = (df['a'] & df['b'])
df['c'] = (df.shift()['ColumnA'] == 0)
df['d'] = (df['ColumnA'] != 0)
df['cd'] = (df['c'] & df['d'])
df['abcd'] = (df['ab'] | df['cd'])
df['abcd_cumsum'] = (df['ab'] | df['cd']).cumsum()
df['CumulativeMean'] = (df.groupby(df['abcd_cumsum'])['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[7]:
ColumnA a b ab c d cd abcd abcd_cumsum \
0 5 True False False False True False False 0
1 6 True False False False True False False 0
2 7 True False False False True False False 0
3 0 True True True False False False True 1
4 0 False True False True False False False 1
5 1 False False False True True True True 2
6 2 True False False False True False False 2
7 3 True False False False True False False 2
8 0 True True True False False False True 3
9 5 False False False True True True True 4
10 10 True False False False True False False 4
11 15 True False False False True False False 4
CumulativeMean
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
Having the following dataframe:
df = pd.DataFrame(np.ones(10).reshape(10,1), columns=['A'])
df.ix[2]['A'] = 0
df.ix[6]['A'] = 0
A
0 1
1 1
2 0
3 1
4 1
5 1
6 0
7 1
8 1
9 1
I am trying to add a new column B which would contain a number of "1"-occurrences in the column A until the first "0"-event before. Expected output should be like this:
A B
0 1 0
1 1 2
2 0 0
3 1 0
4 1 0
5 1 3
6 0 0
7 1 0
8 1 0
9 1 3
Any efficient vectorized way to do this?
You can use:
a = df.A.groupby((df.A != df.A.shift()).cumsum()).cumcount() + 1
print (a)
0 1
1 2
2 1
3 1
4 2
5 3
6 1
7 1
8 2
9 3
dtype: int64
b = ((~df.A.astype(bool)).shift(-1).fillna(df.A.iat[-1].astype(bool)))
print (b)
0 False
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 True
Name: A, dtype: bool
df['B'] = ( a * b )
print (df)
A B
0 1.0 0
1 1.0 2
2 0.0 0
3 1.0 0
4 1.0 0
5 1.0 3
6 0.0 0
7 1.0 0
8 1.0 0
9 1.0 3
Explanation:
#difference with shifted A
df['C'] = df.A != df.A.shift()
#cumulative sum
df['D'] = (df.A != df.A.shift()).cumsum()
#cumulative count each group
df['a'] = df.A.groupby((df.A != df.A.shift()).cumsum()).cumcount() + 1
#invert and convert to boolean
df['F'] = ~df.A.astype(bool)
#shift
df['G'] = (~df.A.astype(bool)).shift(-1)
#fill last nan
df['b'] = (~df.A.astype(bool)).shift(-1).fillna(df.A.iat[-1].astype(bool))
print (df)
A B C D a F G b
0 1.0 0 True 1 1 False False False
1 1.0 2 False 1 2 False True True
2 0.0 0 True 2 1 True False False
3 1.0 0 True 3 1 False False False
4 1.0 0 False 3 2 False False False
5 1.0 3 False 3 3 False True True
6 0.0 0 True 4 1 True False False
7 1.0 0 True 5 1 False False False
8 1.0 0 False 5 2 False False False
9 1.0 3 False 5 3 False NaN True
Last NaN is problematic. So I check last value of column A by df.A.iat[-1] and convert to boolean. So if it is 0, output is False and finally 0 or if 1, output is True and then is used last value of a.
I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.
Look at the following example
In [11]: df
Out[11]:
x y
0 a 1
1 b 2
2 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:
In [13]: df[df.shift() != df]
Out[13]:
x y
0 a 1
1 b 2
2 NaN NaN
3 e 4
4 NaN 5
5 f 6
6 g 7
7 h 8
I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?
Well, look at df.shift() != df:
>>> df.shift() != df
x y
0 True True
1 True True
2 False False
3 True True
4 False True
5 True True
6 True True
7 True True
This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:
>>> (df.shift() != df).any(axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
x y
0 a 1
1 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8