I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.
Look at the following example
In [11]: df
Out[11]:
x y
0 a 1
1 b 2
2 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:
In [13]: df[df.shift() != df]
Out[13]:
x y
0 a 1
1 b 2
2 NaN NaN
3 e 4
4 NaN 5
5 f 6
6 g 7
7 h 8
I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?
Well, look at df.shift() != df:
>>> df.shift() != df
x y
0 True True
1 True True
2 False False
3 True True
4 False True
5 True True
6 True True
7 True True
This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:
>>> (df.shift() != df).any(axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
x y
0 a 1
1 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
Related
In a Pandas Dataframe I have 1 series A
Index A
0 2
1 1
2 6
3 3
4 2
5 7
6 1
7 3
8 8
9 1
10 3
I would like to check if between every 1 of column A there is a number 2 and to write in column B the results like this:
Index A B
0 2 FALSE
1 1 FALSE
2 6 FALSE
3 3 FALSE
4 2 TRUE
5 7 FALSE
6 1 FALSE
7 3 FALSE
8 2 TRUE
9 1 FALSE
10 3 FALSE
I though to use rolling() as a function, but can rolling work with a random sized window of values (ranges) ?
mask = (df["A"] == 1) | (df["A"] == 2)
df = df.assign(B=df.loc[mask, "A"].sub(df.loc[mask, "A"].shift()).eq(1)).fillna(False)
>>> df
A B
0 2 False
1 1 False
2 6 False
3 3 False
4 2 True
5 7 False
6 1 False
7 3 False
8 2 True
9 1 False
10 3 False
IIUC use np.where to assign 2s in 'A' to True after the first 1 is found:
df['B'] = np.where(df['A'].eq(1).cumsum().ge(1) & df['A'].eq(2), True, False)
df:
A B
0 2 False
1 1 False
2 6 False
3 3 False
4 2 True
5 7 False
6 1 False
7 3 False
8 2 True
9 1 False
10 3 False
Imports and DataFrame used:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [2, 1, 6, 3, 2, 7, 1, 3, 2, 1, 3]})
Here is a very basic and straightforward way to do it:
import pandas as pd
df= pd.DataFrame({"A":[2,1,6,3,2,7,1,3,2,1,3]})
for i, cell in enumerate(df["A"]):
if (1 in list(df.loc[:i,"A"])) and (1 in list(df.loc[i:,"A"])) and cell==2:
df.at[i,"B"] = True
else:
df.at[i,"B"] = False
Given the following data
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [4, 5, 9, 5, 6, 4, 0]})
df["split_by"] = df["b"].eq(9)
which looks as
a b split_by
0 1 4 False
1 2 5 False
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
I would like to create two dataframes as follows:
a b split_by
0 1 4 False
1 2 5 False
and
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Clearly this is based on the value in column split_by, but I'm not sure how to subset using this.
My approach is:
split_1 = df.index < df[df["split_by"].eq(True)].index.to_list()[0]
split_2 = ~df.index.isin(split_1)
df1 = df[split_1]
df2 = df[split_2]
Use argmax as:
true_index = df['split_by'].argmax()
df1 = df.loc[:true_index-1, :]
df2 = df.loc[true_index:, :]
print(df1)
a b split_by
0 1 4 False
1 2 5 False
print(df2)
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Another approach:
i = df[df['split_by']==True].index.values[0]
df1 = df.iloc[:i]
df2 = df.iloc[i:]
This is assuming you have only one "True". If you have more than one "True", this code will split df into only two dataframes regardless, considering only the first "True".
Use groupby with cumsum , notice if you have more then one True , this will split the dataframe to n+1 dfs (n True)
d={x : y for x , y in df.groupby(df.split_by.cumsum())}
d[0]
a b split_by
0 1 4 False
1 2 5 False
d[1]
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
I'm using python and need to solve the dataframe as cumsum() the value until the boolean column change its value from True to False. How to solve this task?
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3 << reset from here
5 False 5 8
6 True 2 10
....
Thank all!
You can try this
a = df.Bool.eq(True).cumsum().shift().fillna(0)
df['Expected_cumsum']= df.groupby(a)['Value'].cumsum()
df
Output
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3
5 False 5 8
6 True 2 10
Lets say I have something that looks like this
df = pd.DataFrame({'Event':['A','A','A','A', 'A' ,'B','B','B','B','B'], 'Number':[1,2,3,4,5,6,7,8,9,10],'Ref':[False,False,False,False,True,False,False,False,True,False]})
What I want to do is create a new column which is the difference in Number from the True in ref. So for the A group, the True is the last one, so the column would read -4,-3,-2,-1,0. I have been thinking to do the following:
for col in df.groupby('Event'):
temp = col[1]
reference = temp[temp.Ref==True]
dist1 = temp.apply(lambda x:x.Number-reference.Number,axis=1)
This seems to correctly calculate for each group, but I am not sure how to join the result into the df.
In your case
df['new']=(df.set_index('Event').Number-df.query('Ref').set_index('Event').Number).to_numpy()
df
Event Number Ref new
0 A 1 False -4
1 A 2 False -3
2 A 3 False -2
3 A 4 False -1
4 A 5 True 0
5 B 6 False -3
6 B 7 False -2
7 B 8 False -1
8 B 9 True 0
9 B 10 False 1
You could do the following:
df["new"] = df.Number - df.Number[df.groupby('Event')['Ref'].transform('idxmax')].reset_index(drop=True)
print(df)
Output
Event Number Ref new
0 A 1 False -4
1 A 2 False -3
2 A 3 False -2
3 A 4 False -1
4 A 5 True 0
5 B 6 False -3
6 B 7 False -2
7 B 8 False -1
8 B 9 True 0
9 B 10 False 1
This: df.groupby('Event')['Ref'].transform('idxmax') fill find the indices by group where Ref is True. Basically it finds the indices of the max values, so given that True = 1, and False = 0, it find the indices of the True values.
Try where and grouby transform first
s = df.Number.where(df.Ref).groupby(df.Event).transform('first')
df.Number - s
Out[319]:
0 -4.0
1 -3.0
2 -2.0
3 -1.0
4 0.0
5 -3.0
6 -2.0
7 -1.0
8 0.0
9 1.0
Name: Number, dtype: float64
I have two DataFrames in pandas:
dfm_one
data group_a group_b
0 3 a z
1 1 a z
2 2 b x
3 0 b x
4 0 b x
5 1 b z
6 0 c x
7 0 c y
8 3 c z
9 3 c z
dfm_two
data group_a group_b
0 4 a x
1 4 a y
2 4 b x
3 4 b x
4 4 b y
5 1 b y
6 1 b z
7 1 c x
8 4 c y
9 3 c z
10 2 c z
As output I want a boolean column that indicates for dfm_one whether there is a matching data entry (i.e. has the same vale) in dfm_two for each group_a group_b combination.
So my expected output is:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
I'm guessing the code should look something like:
dfm_one.groupby(['group_a','group_b']).apply(lambda x: ??)
and that the function inside apply should make use of the isin method.
Another solution might be to merge the two datasets but I think this is not trivial since there is no unique identifier in the DataFrame.
OK this is a slight hack, if we cast the df to str dtype then we can call sum to concatenate the rows into a string, we can use the resultant string as a kind of unique identifier and then call isin on the other df, again converting to a str:
In [91]:
dfm_one.astype(str).sum(axis=1).isin(dfm_two.astype(str).sum(axis=1))
Out[91]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
dtype: bool
Output from the conversions:
In [92]:
dfm_one.astype(str).sum(axis=1)
Out[92]:
0 3az
1 1az
2 2bx
3 0bx
4 0bx
5 1bz
6 0cx
7 0cy
8 3cz
9 3cz
dtype: object
In [93]:
dfm_two.astype(str).sum(axis=1)
Out[93]:
0 4ax
1 4ay
2 4bx
3 4bx
4 4by
5 1by
6 1bz
7 1cx
8 4cy
9 3cz
10 2cz
dtype: object