how to remove rows based on some specific criteria - python

I have a data frame like the table below. based on the ranks i provided I want to remove rows in this way.
Data..
name date rank
angel 7/25/2017 3
maggie 8/8/2017 2
maggie 8/8/2017 1
maggie 8/8/2017 2
maggie 8/8/2017 3
smith 8/16/2017 1
smith 8/16/2017 3
laura 9/26/2017 2
laura 9/26/2017 1
laura 9/26/2017 2
laura 9/27/2017 3
lisa 9/5/2017 1
lisa 9/5/2017 3
bill 7/20/2017 1
bill 7/20/2017 3
bill 7/21/2017 3
bill 7/31/2017 3
bill 8/1/2017 3
bill 8/7/2017 1
tomy 8/1/2017 3
What I want to do is for every given name - if there is one date- I want to keep that row but for same name- same date if there are different ranks-I want to select in order and remove the rest. so for example- if bill has 4 rows in the same date- but different ranks_ I want to remove all ranks and keep only "1" with all row information
The output I want is like this:
name date rank
angel 7/25/2017 3
maggie 8/8/2017 1
smith 8/16/2017 1
laura 9/26/2017 1
laura 9/27/2017 3
lisa 9/5/2017 1
bill 7/20/2017 1
bill 8/7/2017 1
tomy 8/1/2017 3
Can someone please help me with that

I was able to get this answered by the following
`data = df.loc[df.groupby(['name', 'date'])['rank'].idxmin()]`
However, I would still like to know if a complex for loop can get that too. I am new to python and would love to learn more.
Thanks

Related

Assign values (1 to N) for similar rows in a dataframe Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1

How to drop duplicates in column with respect to values in another column in pandas?

I have a database with person names and the date of their visit. I need to remove duplicated rows in "Visit_date" column with respect to each person in another column. I have a very big database, so I need a working code. I've spent several days trying to do this, but no result. Here is a sample:
Person Visit_date
0 John 11.09.2020
1 John 11.09.2020
2 John 11.08.2020
3 Andy 11.07.2020
4 Andy 11.09.2020
5 Andy 11.09.2020
6 George 11.09.2020
7 George 11.09.2020
8 George 11.07.2020
9 George 11.07.2020
The code should return:
Person Visit_date
0 John 11.09.2020
1 John 11.08.2020
2 Andy 11.07.2020
3 Andy 11.09.2020
4 George 11.09.2020
5 George 11.07.2020
Hope this help you. Using df.drop_duplicates() then df.reset_index(drop=True)
import pandas as pd
df = pd.DataFrame({"Person" :['John','John','John','Andy','Andy','Andy','George','George','George','George'],"Visit_date" :['11.09.2020','11.09.2020','11.08.2020','11.07.2020','11.09.2020','11.09.2020','11.09.2020','11.09.2020','11.07.2020','11.07.2020']})
df=df.drop_duplicates()
df=df.reset_index(drop=True)
print(df)
[Result]:
Person Visit_date
0 John 11.09.2020
1 John 11.08.2020
2 Andy 11.07.2020
3 Andy 11.09.2020
4 George 11.09.2020
5 George 11.07.2020

Pandas group by a specific value in any of given columns

Given the pandas dataframe as follows:
Partner1 Partner2 Interactions
0 Ann Alice 1
1 Alice Kate 8
2 Kate Tony 9
3 Tony Ann 2
How can I group by a specific partner, let's say to find the total number of interactions of Ann?
Something like
gb = df.groupby(['Partner1'] or ['Partner2']).agg({'Interactions': 'sum'})
and getting the answer:
Partner Interactions
Ann 3
Alice 9
Kate 17
Tony 11
You can use melt together with groupby. First melt:
df = pd.melt(df, id_vars='Interactions', value_vars=['Partner1', 'Partner2'], value_name='Partner')
This will give:
Interactions variable Partner
0 1 Partner1 Ann
1 8 Partner1 Alice
2 9 Partner1 Kate
3 2 Partner1 Tony
4 1 Partner2 Alice
5 8 Partner2 Kate
6 9 Partner2 Tony
7 2 Partner2 Ann
Now, group by Partner and sum:
df.groupby('Partner')[['Interactions']].sum()
Result:
Partner Interactions
Alice 9
Ann 3
Kate 17
Tony 11
You can do merge dataframe itself:
# join the df to itself
join_df = df.merge(df, left_on='Partner1', right_on='Partner2', suffixes=('', '_'))
# get sum
join_df['InteractionsSum'] = join_df[['Interactions', 'Interactions_']].agg(sum, 1)
join_df = join_df[['Partner1', 'Interactions']].copy()
print(join_df)
Partner1 Interactions
0 Ann 1
1 Alice 8
2 Kate 9
3 Tony 2

Create a cumulative count column in pandas dataframe

I have a dataframe set up similar to this
**Person Value**
Joe 3
Jake 4
Patrick 2
Stacey 1
Joe 5
Stacey 6
Lara 7
Joe 2
Stacey 1
I need to create a new column 'x' which keeps a running count of how many times each person's name has appeared so far in the list.
Expected output:
**Person Value** **x**
Joe 3 1
Jake 4 1
Patrick 2 1
Stacey 1 1
Joe 5 2
Stacey 6 2
Lara 7 1
Joe 2 3
Stacey 1 3
All I've managed so far is to create an overall count, which is not what I'm looking for.
Any help is appreciated
You could let
df['x'] = df.groupby('Person').cumcount() + 1

How to assign a unique ID to detect repeated rows in a pandas dataframe?

I am working with a large pandas dataframe, with several columns pretty much like this:
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
Homer Bart 2 3
Lisa John 5 0
Homer Bart 2 3
Homer Bart 2 3
Tom Maggie 1 4
How can I assign an unique id to each repeated row? For example:
A B C D new_id
John Tom 0 1.2 1
Homer Bart 2 3.0 2
Tom Maggie 1 4.2 3
Lisa John 5 0 4
Homer Bart 2 3 5
Lisa John 5 0 4
Homer Bart 2 3.0 2
Homer Bart 2 3.0 2
Tom Maggie 1 4.1 6
I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:
df.assign(id=(df.columns).astype('category').cat.codes)
df
However, is not working. How can I get a unique id for detecting groups of duplicated rows?
For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
groupby is more efficient for larger dataframes:
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
Group by the columns you are trying to find duplicates over and use ngroup:
df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

Categories