Filtering duplicate values using groupby

Filtering duplicate values using groupby - python

I'm reading the documentation to understand the method filter when used with groupby. In order to understand it, I've got the below scenario:
I'm trying to get the duplicate names grouped by city from my DataFrame df.
Below is my try:
df = pd.DataFrame({
'city':['LA','LA','LA','LA','NY', 'NY'],
'name':['Ana','Pedro','Maria','Maria','Peter','Peter'],
'age':[24, 27, 19, 34, 31, 20],
'sex':['F','M','F','F','M', 'M'] })
df_filtered = df.groupby('city').filter(lambda x: len(x['name']) >= 2)
df_filtered
The output I'm getting is:
city name age sex
LA Ana 24 F
LA Pedro 27 M
LA Maria 19 F
LA Maria 34 F
NY Peter 31 M
NY Peter 20 M
The output I'm expecting is:
city name age sex
LA Maria 19 F
LA Maria 34 F
NY Peter 31 M
NY Peter 20 M
It's not clear to me in which cases I have to use different column names in the "groupby" method and in the "len" inside of the "filter" method
Thank you

How about just duplicated:
df[df.duplicated(['city', 'name'], keep=False)]

You should groupby two columns 'city','name'
Yourdf=df.groupby(['city','name']).filter(lambda x : len(x)>=2)
Yourdf
Out[234]:
city name age sex
2 LA Maria 19 F
3 LA Maria 34 F
4 NY Peter 31 M
5 NY Peter 20 M

Related

Pandas dataframe sorting string values and by descending aggregated values

I'm working on transforming a dataframe to show the top 3 earners.
The dataframe looks like this
data = {'Name': ['Allistair', 'Bob', 'Carrie', 'Diane', 'Allistair', 'Bob', 'Carrie','Evelyn'], 'Sale': [20, 21, 19, 18, 5, 300, 35, 22]}
df = pd.DataFrame(data)
print(df)
Name Sale
0 Allistair 20
1 Bob 21
2 Carrie 19
3 Diane 18
4 Allistair 5
5 Bob 300
6 Carrie 35
7 Evelyn 22
In my actual dataset, I have several more columns and rows, and I want to print out and get to
something like
Name Sale
0 Bob 321
1 Carrie 35
2 Allistair 25
Every iteration that I've searched through doesn't quite get there because I get
'Name' is both an index level and a column label, which is ambiguous.

Use groupby:
>>> df.groupby('Name').sum().sort_values('Sale', ascending=False)
Sale
Name
Bob 321
Carrie 54
Allistair 25
Evelyn 22
Diane 18

Thanks to #Andrej Kasely above,
df.groupby("Name")["Sale"].sum().nlargest(3)

Add a column from an existing dataframe into another between every other column

I'll try my best to explain this as I had trouble phrasing the title. I have two dataframes. What I would like to do is add a column from df1 into df2 between every other column.
For example, df1 looks like this :
Age City
0 34 Sydney
1 30 Toronto
2 31 Mumbai
3 32 Richmond
And after adding in df2 it looks like this:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
In terms of code, I wasn't quite sure where to even start.
'''Concatenating the dataframes'''
for i in range len(df2):
pos = i+1
df3 = df2.insert
#df2 = pd.concat([df1, df2], axis=1).sort_index(axis=1)
#df2.columns = np.arange(len(df2.columns))
#print (df2)
I was originally going to run it through a loop, but I wasn't quite sure how to do it. Any help would be appreciated!

You can use itertools.zip_longest. For example:
from itertools import zip_longest
new_columns = [
v
for v in (c for a in zip_longest(df2.columns, df1.columns) for c in a)
if not v is None
]
df_out = pd.concat([df1, df2], axis=1)[new_columns]
print(df_out)
Prints:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States

Function to move specific row to top or bottom of pandas dataframe

I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?

Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc

Update date-frame with another data-frame which as a different number of columns

I have a large df called data which looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
I have another dataframe called updates. In this example the dataframe has updated information for data for a couple of records and looks like:
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 05/09/14
1 10610.0 Cooper Amy 16/08/12
I'm trying to find a way to update data with the updates df so the resulting dataframe looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
As you can see the Date change field for Bob in the data df has been updated with the Date change from the updates df.
What can I try next?

a while back, I was dealing with that too. the straight up .update was giving me issues (sorry can't remember the exact issue I had. I think it was that when you do .update, it's reliant on indexes matching, and they didn't match in my 2 separate dataframes. so I wanted to use certain columns as my index to update on),
But I made a function to deal with it. So this might be way overkill than what's needed but try this and see if it'll work.
I'm also assuming the date you want update from the updates dataframe should be 15/09/14 not 05/09/14. So I had that different in my sample data below
Also, I'm assuming the Identifier is unique key. If not, you'll need to include multiple columns as your unique key
import sys
import pandas as pd
data = pd.DataFrame([[12233.0,'Smith','Bob','','FT','NW'],
[54213.0,'Jones','Sally','15/04/15','FT','NW'],
[12237.0,'Evans','Steve','26/08/14','FT','SE'],
[10610.0,'Cooper','Amy','16/08/12','FT','SE']],
columns = ['Identifier','Surname','First names(s)','Date change','Work Pattern','Region'])
updates = pd.DataFrame([[12233.0,'Smith','Bob','15/09/14'],
[10610.0,'Cooper','Amy','16/08/12']],
columns = ['Identifier','Surname','First names(s)','Date change'])
def update(df1, df2, keys_list):
df1 = df1.set_index(keys_list)
df2 = df2.set_index(keys_list)
dup_idx1 = df1.index.get_duplicates()
dup_idx2 = df2.index.get_duplicates()
if len(dup_idx1) > 0 or len(dup_idx2) > 0:
print('\n'+'#'*50+'\nError! Duplicate Indicies:')
for element in dup_idx1:
print('df1: %s' %(element,))
for element in dup_idx2:
print('df2: %s' %(element,))
print('#'*50+'\n\n')
df1.update(df2, overwrite=True)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
return df1
# the 3rd input is a list, in case you need multiple columns as your unique key
df = update(data, updates, ['Identifier'])
Output:
print (data)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
print (updates)
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 15/09/14
1 10610.0 Cooper Amy 16/08/12
df = update(data, updates, ['Identifier'])
In [19]: print (df)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE

Using DataFrame.update.
First set index:
data.set_index('Identifier', inplace=True)
updates.set_index('Identifier', inplace=True)
Then update:
data.update(updates)
print(data)
Surname First names(s) Date change Work Pattern Region
Identifier
12233.0 Smith Bob 15/09/14 FT NW
54213.0 Jones Sally 15/04/15 FT NW
12237.0 Evans Steve 26/08/14 FT SE
10610.0 Cooper Amy 16/08/12 FT SE
If you need multiple columns to create a unique index you can just set them with a list. For example:
data.set_index(['Identifier', 'Surname'], inplace=True)
updates.set_index(['Identifier', 'Surname'], inplace=True)
data.update(updates)

How to split single column of pandas dataframe into multiple columns with group?

I am new to python pandas. I have one dataframe like below:
df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
'age': ['25', '22','21','32','37','26','24','30']})
print df
Name age
0 football 25
1 ramesh 22
2 suresh 21
3 pankaj 32
4 cricket 37
5 rakesh 26
6 mohit 24
7 mahesh 30
"Name" column contains "sports name" and "sport person name" also. I want to split it into two different columns like below:
Expected Output:
sports_name sport_person_name age
football ramesh 25
suresh 22
pankaj 32
cricket rakesh 26
mohit 24
mahesh 30
If I make groupby on "Name" column I'm not getting expected output and it is obviously straight-forward output because no duplicates in "Name" column. What I need to use so that I can get expected output?
Edit : If don't want to hardcode the sports names
df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
'age': ['', '22','21','32','','26','24','30']})
df = df.replace('', np.nan, regex=True)
nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
I Just Checked for except "Name" column which rows contains NAN values in all rest of the columns and It will be definitely sports names. I created list of that sports names and make use of below solutions to create sports_name and sports_person_name columns.

You can use:
#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 football suresh 21
2 football pankaj 32
3 cricket rakesh 26
4 cricket mohit 24
5 cricket mahesh 30
Similar solution with DataFrame.insert - then reorder is not necessary:
#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 football suresh 21
2 football pankaj 32
3 cricket rakesh 26
4 cricket mohit 24
5 cricket mahesh 30
If want only one value of sport add limit=1 to ffill and replace NaNs to empty string:
sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 suresh 21
2 pankaj 32
3 cricket rakesh 26
4 mohit 24
5 mahesh 30

The output you want is a dictionary and not a dataframe.
The dictionary will look:
{'Sport' : {'Player' : age,'Player2' : age}}
If you really want a dataframe:
If the name always comes before the players:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket'
,'rakesh','mohit','mahesh'],
'age': ['25', '22','21','32','37','26','24','30']})
sports=['football', 'cricket']
wanted_dict={}
current_sport=''
for val in df['sport_person_name']:
if val in sports:
current_sport=val
else:
wanted_dict[val]=current_sport
#Now you got - {name:sport_name,...}
df['sports_name']=999
for val in df['sport_person_name']
df['sports_name']=np.where((val not in sports)&
(df['sport_person_name']==val),
wanted_dict[val],'sport)
df = df[df['sports_name']!='sport']
What it should look like:
sports_name sport_person_name age
football ramesh 25
football suresh 22
football pankaj 32
cricket rakesh 26
cricket mohit 24
cricket mahesh 30

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering duplicate values using groupby - python

How about just duplicated: df[df.duplicated(['city', 'name'], keep=False)]

You should groupby two columns 'city','name' Yourdf=df.groupby(['city','name']).filter(lambda x : len(x)>=2) Yourdf Out[234]: city name age sex 2 LA Maria 19 F 3 LA Maria 34 F 4 NY Peter 31 M 5 NY Peter 20 M

Related

Pandas dataframe sorting string values and by descending aggregated values

Add a column from an existing dataframe into another between every other column

Function to move specific row to top or bottom of pandas dataframe

Update date-frame with another data-frame which as a different number of columns

How to split single column of pandas dataframe into multiple columns with group?

Categories

Resources