How to remove rows with multiple occurrences in a row with pandas - python

i have this data:
A
1 1
2 1
3 1
4 2
5 2
6 1
i expect to get:
A
1 1
- - -> (drop)
3 1
4 2
5 2
6 1
I want to drop all the rows in col ['A'] with the same value that appear in a row,
but without the first and the last ones.
Until now I used:
df = df.loc[df[col].shift() != df[col]]
but it will remove also the last appearance.
Sorry for my bad English, thanks in advance.

Looks like you have the same problem as this question: Pandas drop_duplicates. Keep first AND last. Is it possible?.
The suggested solution is:
pd.concat([
df['A'].drop_duplicates(keep='first'),
df['A'].drop_duplicates(keep='last'),
])
Update after clarification:
First get the boolean masks for your described criteria:
is_last = df['A'] != df['A'].shift(-1)
is_duplicate = df['A'] == df['A'].shift()
And drop the rows based on these:
df.drop(df.index[~is_last & is_duplicate]) # note the ~ to negate is_last

Basically you need to group consecutive numbers, which can be achieved by diff and cumsum:
print (df.groupby(df["A"].diff().ne(0).cumsum(), as_index=False).nth([0, -1]))
A
1 1
3 1
4 2
5 2
6 1

Related

looping and with if statement over dataframe

I'm running into an issue when iterating over rows in a pandas data frame
this is the code I am trying to run
data = {'test':[1,1,0,0,3,1,0,3,0],
'test2':[0, 2, 0,1,1,2,7,3,2],
}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df['combined'].astype('float64')
df
for index, row in df.iterrows():
if row['test']>=1 & row['test2']>=1:
row['combined']/=2
else:
pass
so, it should divide by 2 if both test and test2 have a value of 1 or more, however it doesn't divide all the rows that should be divided.
am I making a mistake somewhere?
this is the outcome when I run the code
corresponding columns are test, test2 and combined
0 1 0 1
1 1 2 3
2 0 0 0
3 0 1 1
4 3 1 2
5 1 2 3
6 0 7 7
7 3 3 3
8 0 2 2
You are using &, the bitwise AND operator. You should be using and, the boolean AND operator. This is causing the if statement to give an answer you don't expect.
What you are doing is in general a bad practice as iterating the rows should be avoided for performance reasons if is not strictly necessary, the solution is defining mask with your conditions and operate within the mask using .loc:
data = {'test':[1,1,0,0,3,1,0,3,0],
'test2':[0, 2, 0,1,1,2,7,3,2],
}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df['combined'].astype('float64')
mask = (df['test']>=1) & (df['test2']>=1)
df.loc[mask,'combined'] /=2

Pandas How to find duplicate row in group

everyone, I try to find duplicate row in double grouped DataFrame and I don't understand how to do it.
df_part[df_part.income_flag==1].groupby(['app_id', 'month_num'])['amnt'].duplicate()
For example df:
So I want to see something like this:
So, if I use thise code I see that there are two same value 'amnt' 0.387677 but in different month... it's information that i need
df_part[(df_part.income_flag==2) & df_part.duplicated(['app_id','amnt'], keep=False)].groupby(['app_id', 'amnt', 'month_num'])['month_num'].count().head(10)
app_id amnt month_num
0 0.348838 3 1
0.387677 6 1
10 2
0.426544 2 2
0.475654 2 1
0.488173 1 1
1 0.297589 1 1
4 1
0.348838 2 1
0.426544 8 3
Name: month_num, dtype: int64
Thanks all.
I think you need chain another mask by & for bitwise AND with DataFrame.duplicated and then use GroupBy.size:
df = (df_part[(df_part.income_flag==1) & df_part.duplicated(['app_id','amnt'], keep=False)]
.groupby('app_id')['amnt']
.size()
.reset_index(name='duplicate_count'))
print (df)
app_id duplicate_count
0 12 2
1 13 3

Pandas delete first n rows until condition on columns is fulfilled

I am trying to delete some rows from my dataframe. In fact I want to delete the the first n rows, while n should be the row number of a certain condition. I want the dataframe to start with the row that contains the x-y values xEnd,yEnd. All earlier rows shall be dropped from the dataframe. Somehow I do not get the solution. That is what i have so far.
Example:
import pandas as pd
xEnd=2
yEnd=3
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
n=df["id"].iloc[df["x"]==xEnd and df["y"]==yEnd]
df = df.iloc[n:]
I want my code to reduce the dataframe from
{'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]}
to
{'x':[2,2,2], 'y':[3,4,3], 'id':[3,4,5]}
Use & instead of and
Use loc instead of iloc. You can use iloc but it could break depending on the index
Use idxmax to find the first positiopn
# I used idxmax to find the index |
# v
df.loc[((df['x'] == xEnd) & (df['y'] == yEnd)).idxmax():]
# ^
# | finding the index goes with using loc
id x y
3 3 2 3
4 4 2 4
5 5 2 3
Here is an iloc variation
# I used values.argmax to find the position |
# v
df.iloc[((df['x'] == xEnd) & (df['y'] == yEnd)).values.argmax():]
# ^
# | finding the position goes with using iloc
id x y
3 3 2 3
4 4 2 4
5 5 2 3
Using cummax
df[((df['x'] == xEnd) & (df['y'] == yEnd)).cummax()]
Out[147]:
id x y
3 3 2 3
4 4 2 4
5 5 2 3

Pandas multi index Dataframe - Select and remove

I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5

Pandas - Select rows based on other rows

Let's say I have a dataframe:
df = pd.DataFrame({'A':[1,2,3,3,6,8,1,4]})
df
A
0 1
1 2
2 3
3 3
4 6
5 8
6 1
7 4
And I want to select rows which are preceded by a row that contains a one:
selection
A
1 2
7 4
Right now, I can solve this by selecting rows with ones, getting their indices, and then adding one to the indices and then using iloc:
df.iloc[df[df.A == 1].index + 1]
But I'm wondering if there is a "more pandas" way to do this. Further, if the search was more complicated like: select all rows preceded by a 1 and followed by a 3. Or what if the index wasn't just integers, but timestamps. How do I express inter-row dependencies cleanly?
Solution
df[df.shift().A == 1]

Categories