How to get the SQL SELF JOIN equivalent in pandas? [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a simple data frame with first and last names. I would like to get the equivalent of SQL SELF JOIN equivalent in pandas.
Here goes the full example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'first_name': ['Rose','Summer','Jane','Kim','Jack'],
'last_name': ['Howard','Solstice','Kim','Cruz','Rose'],
'customer_id': [1,2,3,4,5]})
df
first_name last_name customer_id
0 Rose Howard 1
1 Summer Solstice 2
2 Jane Kim 3
3 Kim Cruz 4
4 Jack Rose 5
REQUIRED OUTPUT
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
1 Rose Howard 5 Jack Rose
4 Kim Cruz 3 Jane Kim
Using SQL
select a.first_name, a.last_name, b.first_name, b.last_name
from df as a, df as b
where a.first_name = b.last_name
My attempt
(pd.concat( [ df[['first_name','last_name']],
df[['first_name','last_name']].add_suffix('_1')
], axis=1, ignore_index=False)
)
first_name last_name first_name_1 last_name_1
0 Rose Howard Rose Howard
1 Summer Solstice Summer Solstice
2 Jane Rose Jane Rose
But,
(pd.concat( [ df,df.add_suffix('_1')], axis=1)
.query(" first_name == last_name_1 ")
)
This gives empty output to my surprise!!
I want two rows and fours columns as given by SQL.

Use left_on and right_on
df.merge(df, left_on='first_name', right_on='last_name')
Result:
first_name_x last_name_x customer_id_x first_name_y last_name_y \
0 Rose Howard 1 Jack Rose
1 Kim Cruz 4 Jane Kim
customer_id_y
0 5
1 3

Related

filter data frame for common vales and rank

If I have data that looks like this:
name times_used gender
0 Sophia 42261 Girl
1 Jacob 42164 Boy
2 Emma 35951 Girl
3 Ethan 34523 Boy
4 Mason 34195 Boy
5 William 34130 Boy
6 Olivia 34128 Girl
7 Jayden 33962 Boy
8 Michael 33842 Boy
9 Noah 33098 Boy
10 Alexander 32292 Boy
11 Daniel 30907 Boy
12 Aiden 30868 Boy
13 Ava 30765 Girl
Could someone give me a tip on how to use Pandas where I could find names (like top 10) that are used both in a Girl and Boy gender? The column times_used is an int value of how many times that name was chosen for a child.
df = pd.read_csv('../resource/lib/public/babynames.csv')
cols = ['name','times_used','gender']
df.columns = cols
print(df)
here is one way to do it
top_count=3
df[df.groupby(['gender'])['times_used'].transform(
lambda x: x.nlargest(top_count)).notna()
].sort_values(['gender','times_used'], ascending=False)
name times_used gender
0 Sophia 42261 Girl
2 Emma 35951 Girl
6 Olivia 34128 Girl
1 Jacob 42164 Boy
3 Ethan 34523 Boy
4 Mason 34195 Boy
df = pd.read_csv('../resource/lib/public/babynames.csv')
cols = ['name','times_used','gender']
df.columns = cols
#identify duplicate rows in 'team' column
duplicateRows = df[df.duplicated(['name'])]
#view duplicate rows
print(duplicateRows)

How to groupby on several columns to know value on other column? [duplicate]

This question already has answers here:
Get rows based on distinct values from one column
(2 answers)
Closed 1 year ago.
I have a dataframe with thousands rows like this:
city zip_code name
paris 1 John
paris 1 Eric
paris 2 David
LA 3 David
LA 4 David
LA 4 NaN
How can I do a groupby city and zip code and know the name for each city and zip_code grouped ?
Expected output: a dataframe with rows with unique city and unique zip_code and corresponding names in other column (one row per name)
city zip_code name
paris 1 John
Eric
paris 2 David
LA 3 David
LA 4 David
IIUC, you want to know the existing combinations of city and zip_code?
[k for k,_ in df.groupby(['city', 'zip_code'])]
output: [('LA', 3), ('LA', 4), ('paris', 1), ('paris', 2)]
edit following your change in the question:
It looks like you want:
df.drop_duplicates().dropna()
output:
city zip_code name
0 paris 1 John
1 paris 1 Eric
2 paris 2 David
3 LA 3 David
4 LA 4 David

groupby data with columns that have mixed data types

Lets say I had this sample of a mixed dataset:
df:
Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
I'd like to group by all this data starting with property and have it look like this:
grouped:
Property Name Date of entry Old data Updated Data
City names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
State names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
grouped = pd.DataFrame(df.groupby(['Property','Name','Date of entry','Old Data', 'updated data'])
.size(),columns=['Count'])
grouped
and I get a type error saying: '<' not supported between instances of 'int' and 'datetime.datetime'
Is there some sort of formatting that I need to do to the df['Old data'] & df['Updated data'] columns to allow them to be added to the groupby?
added data types:
Property: Object
Name: Object
Date of entry: datetime
Old data: Object
Updated data: Object
*I modified your initial data to get a better view of the output.
You can try with pivot_table instead of groupby:
df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x)
Output:
Old data Updated data
Property Name Date of entry
Address Harry 2/3/2021 123 lane 123 street
Lisa 2/3/2021 123 lane 123 street
City Jack 1/8/2021 TX Miami
Jim 1/7/2021 Jacksonville Miami
Tammy 1/8/2021 TX Miami
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Order count Jack 3/4/2021 2 3
Tammy 3/4/2021 2 3
State Jack 1/8/2021 TX CA
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Zip Joe 2/2/2021 11111 22222
The whole code:
import pandas as pd
from io import StringIO
txt = '''Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
City Jack 1/8/2021 TX Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Order count Jack 3/4/2021 2 3
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Address Lisa 2/3/2021 123 lane 123 street
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
City Tammy 1/8/2021 TX Miami
'''
df = pd.read_csv(StringIO(txt), header=0, skipinitialspace=True, sep=r'\s{2,}', engine='python')
print(df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x))

Find extras between two columns of two dataframes - subtract [duplicate]

This question already has answers here:
Diff of two Dataframes
(7 answers)
Closed 4 years ago.
I have 2 dataframes (df_a and df_b) with 2 columns: 'Animal' and 'Name'.
In the bigger dataframe, there are more animals of the same type than the other. How do I find the extra animals of the same type by name? i.e. (df_a - df_b)
Dataframe A
Animal Name
dog john
dog henry
dog betty
dog smith
cat charlie
fish tango
lion foxtrot
lion lima
Dataframe B
Animal Name
dog john
cat charlie
dog betty
fish tango
lion foxtrot
dog smith
In this case, the extra would be:
Animal Name
dog henry
lion lima
Attempt: I tried using
df_c = df_a.subtract(df_b, axis='columns')
but got the following error "unsupported operand type(s) for -: 'unicode' and 'unicode'", which makes sense since they are strings not numbers. Is there any other way?
You are looking for a left_only merge.
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
merged.loc[merged['_merge'] == 'left_only'][['Animal', 'Name']]
Output
Animal Name
1 dog henry
7 lion lima
Explanation:
merged = pd.merge(df_a,df_b, how='outer', indicator=True)
Gives:
Animal Name _merge
0 dog john both
1 dog henry left_only
2 dog betty both
3 dog smith both
4 cat charlie both
5 fish tango both
6 lion foxtrot both
7 lion lima left_only
The extra animals are in df_a only, which is denoted by left_only.
Using isin
df1[~df1.sum(1).isin(df2.sum(1))]
Out[611]:
Animal Name
1 dog henry
7 lion lima

Pandas: Concatenate two dataframes with different column names

I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?

Categories