Find extras between two columns of two dataframes - subtract [duplicate] - python

This question already has answers here:
Diff of two Dataframes
(7 answers)
Closed 4 years ago.
I have 2 dataframes (df_a and df_b) with 2 columns: 'Animal' and 'Name'.
In the bigger dataframe, there are more animals of the same type than the other. How do I find the extra animals of the same type by name? i.e. (df_a - df_b)
Dataframe A
Animal Name
dog john
dog henry
dog betty
dog smith
cat charlie
fish tango
lion foxtrot
lion lima
Dataframe B
Animal Name
dog john
cat charlie
dog betty
fish tango
lion foxtrot
dog smith
In this case, the extra would be:
Animal Name
dog henry
lion lima
Attempt: I tried using
df_c = df_a.subtract(df_b, axis='columns')
but got the following error "unsupported operand type(s) for -: 'unicode' and 'unicode'", which makes sense since they are strings not numbers. Is there any other way?

You are looking for a left_only merge.
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
merged.loc[merged['_merge'] == 'left_only'][['Animal', 'Name']]
Output
Animal Name
1 dog henry
7 lion lima
Explanation:
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
Gives:
Animal Name _merge
0 dog john both
1 dog henry left_only
2 dog betty both
3 dog smith both
4 cat charlie both
5 fish tango both
6 lion foxtrot both
7 lion lima left_only
The extra animals are in df_a only, which is denoted by left_only.

Using isin
df1[~df1.sum(1).isin(df2.sum(1))]
Out[611]:
Animal Name
1 dog henry
7 lion lima

Related

filter data frame for common vales and rank

If I have data that looks like this:
name times_used gender
0 Sophia 42261 Girl
1 Jacob 42164 Boy
2 Emma 35951 Girl
3 Ethan 34523 Boy
4 Mason 34195 Boy
5 William 34130 Boy
6 Olivia 34128 Girl
7 Jayden 33962 Boy
8 Michael 33842 Boy
9 Noah 33098 Boy
10 Alexander 32292 Boy
11 Daniel 30907 Boy
12 Aiden 30868 Boy
13 Ava 30765 Girl
Could someone give me a tip on how to use Pandas where I could find names (like top 10) that are used both in a Girl and Boy gender? The column times_used is an int value of how many times that name was chosen for a child.
df = pd.read_csv('../resource/lib/public/babynames.csv')
cols = ['name','times_used','gender']
df.columns = cols
print(df)
here is one way to do it
top_count=3
df[df.groupby(['gender'])['times_used'].transform(
lambda x: x.nlargest(top_count)).notna()
].sort_values(['gender','times_used'], ascending=False)
name times_used gender
0 Sophia 42261 Girl
2 Emma 35951 Girl
6 Olivia 34128 Girl
1 Jacob 42164 Boy
3 Ethan 34523 Boy
4 Mason 34195 Boy
df = pd.read_csv('../resource/lib/public/babynames.csv')
cols = ['name','times_used','gender']
df.columns = cols
#identify duplicate rows in 'team' column
duplicateRows = df[df.duplicated(['name'])]
#view duplicate rows
print(duplicateRows)

How to rename Pandas columns based on mapping?

I have a dataframe where column Name contains values such as the following (the rest of the columns do not affect how this question is answered I hope):
Chicken
Chickens
Fluffy Chicken
Whale
Whales
Blue Whale
Tiger
White Tiger
Big Tiger
Now, I want to ensure that we rename these entries to be like the following:
Chicken
Chicken
Chicken
Whale
Whale
Whale
Tiger
Tiger
Tiger
Essentially substituting anything that has 'Chicken' to just be 'Chicken, anything with 'Whale' to be just 'Whale, and anything with 'Tiger' to be just 'Tiger'.
What is the best way to do this? There are almost 1 million rows in the dataframe.
Sorry just to add, I have a list of what we expect i.e.
['Chicken', 'Whale', 'Tiger']
They can appear in any order in the column
What I should also add is, the column might contain things like "Mushroom" or "Eggs" that do not need substituting from the original list.
Try with str.extract
#l = ['Chicken', 'Whale', 'Tiger']
df['new'] = df['col'].str.extract('('+'|'.join(l)+')')[0]
Out[10]:
0 Chicken
1 Chicken
2 Chicken
3 Whale
4 Whale
5 Whale
6 Tiger
7 Tiger
8 Tiger
Name: 0, dtype: object

Pandas Get List of Unique Values in Column A for each Unique Value in Column B

I'm finding this problem easy to write out, but difficult to apply with my Pandas Dataframe.
When searching for anything 'unique values' and 'list' I only get answers for getting the unique values in a list.
There is a brute force solution with a double for loop, but there must be a faster Pandas solution than n^2.
I have a DataFrame with two columns: Name and Likes Food.
As output, I want a list of unique Likes Food values for each unique Name.
Example Dataframe df
Index Name Likes Food
0 Tim Pizza
1 Marie Pizza
2 Tim Pasta
3 Tim Pizza
4 John Pizza
5 Amy Pizza
6 Amy Sweet Potatoes
7 Marie Sushi
8 Tim Sushi
I know how to aggregate and groupby the unique count of Likes Food:
df.groupby( by='Name', as_index=False ).agg( {'Likes Food': pandas.Series.nunique} )
df.sort_values(by='Likes Food', ascending=False)
df.reset_index( drop=True )
>>>
Index Name Likes Food
0 Tim 3
1 Marie 2
2 Amy 2
3 John 1
But given that, what ARE the foods for each Name in that DataFrame? For readability, expressed as a list makes good sense. List sorting doesn't matter (and is easy to fix probably).
Example Output
<code here>
>>>
Index Name Likes Food Food List
0 Tim 3 [Pizza, Pasta, Sushi]
1 Marie 2 [Pizza, Sushi]
2 Amy 2 [Pizza, Sweet Potatoes]
3 John 1 [Pizza]
To obtain the output without the counts, just try unique
df.groupby("Name")["Likes"].unique()
Name
Amy [Pizza, Sweet Potatoes]
John [Pizza]
Marie [Pizza, Sushi]
Tim [Pizza, Pasta, Sushi]
Name: Likes, dtype: object
additionally, you can used named aggregation
df.groupby("Name").agg(**{"Likes Food": pd.NamedAgg(column='Likes', aggfunc="size"),
"Food List": pd.NamedAgg(column='Likes', aggfunc="nunique")}).reset_index()
Name Likes Food Food List
0 Amy 2 [Pizza, Sweet Potatoes]
1 John 1 [Pizza]
2 Marie 2 [Pizza, Sushi]
3 Tim 3 [Pizza, Pasta, Sushi]
To get both columns, also sorted, try this:
df = df.groupby("Name")["Likes_Food"].aggregate({'counts': 'nunique',
'food_list': 'unique'}).reset_index().sort_values(by='counts', ascending=False)
df
Name counts food_list
3 Tim 3 [Pizza, Pasta, Sushi]
0 Amy 2 [Pizza, SweetPotatoes]
2 Marie 2 [Pizza, Sushi]
1 John 1 [Pizza]

How to get the SQL SELF JOIN equivalent in pandas? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a simple data frame with first and last names. I would like to get the equivalent of SQL SELF JOIN equivalent in pandas.
Here goes the full example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'first_name': ['Rose','Summer','Jane','Kim','Jack'],
'last_name': ['Howard','Solstice','Kim','Cruz','Rose'],
'customer_id': [1,2,3,4,5]})
df
first_name last_name customer_id
0 Rose Howard 1
1 Summer Solstice 2
2 Jane Kim 3
3 Kim Cruz 4
4 Jack Rose 5
REQUIRED OUTPUT
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
1 Rose Howard 5 Jack Rose
4 Kim Cruz 3 Jane Kim
Using SQL
select a.first_name, a.last_name, b.first_name, b.last_name
from df as a, df as b
where a.first_name = b.last_name
My attempt
(pd.concat( [ df[['first_name','last_name']],
df[['first_name','last_name']].add_suffix('_1')
], axis=1, ignore_index=False)
)
first_name last_name first_name_1 last_name_1
0 Rose Howard Rose Howard
1 Summer Solstice Summer Solstice
2 Jane Rose Jane Rose
But,
(pd.concat( [ df,df.add_suffix('_1')], axis=1)
.query(" first_name == last_name_1 ")
)
This gives empty output to my surprise!!
I want two rows and fours columns as given by SQL.
Use left_on and right_on
df.merge(df, left_on='first_name', right_on='last_name')
Result:
first_name_x last_name_x customer_id_x first_name_y last_name_y \
0 Rose Howard 1 Jack Rose
1 Kim Cruz 4 Jane Kim
customer_id_y
0 5
1 3

Pandas: Concatenate two dataframes with different column names

I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?

Categories