I don't see the below case mentioned in Pandas Merging 101.
I'm having trouble understanding the Pandas documentation for doing a left outer join.
import pandas as pd
left_df = pd.DataFrame({
'user_id': ['Peter', 'John', 'Robert', 'Anna']
})
right_df = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John',
'Anna']
})
pd.merge(left_df, right_df, on = 'user_id', how = 'left')
Output is:
user_id
0 Peter
1 John
2 Robert
3 Anna
Expected output:
user_id
0 Peter
1 Robert
What am I missing? Is the indicator = True parameter a must (to create a _merge column to filter on) for left outer joins?
You can use merge with indicator=True and keep only rows where value is set to left_only but it's not the best way. You can use isin to get a boolean mask then invert it:
>>> left_df[~left_df['user_id'].isin(right_df['user_id'])]
user_id
0 Peter
2 Robert
With merge:
>>> (left_df.merge(right_df, on='user_id', how='left', indicator='present')
.loc[lambda x: x.pop('present') == 'left_only'])
user_id
0 Peter
2 Robert
Related
I have 2 dataframes which I need to join using left join. In sql I have the query as
SELECT A.* INTO NewTable FROM A LEFT JOIN B ON A.id=B.id WHERE B.id IS NULL;
I have the 2 dataframes as:
df1:
id
name
1
one
2
two
3
three
4
four
df2:
id
2
3
What I am expecting is:
id
name
1
one
4
four
What I have tried?
common = df1.merge(df2, on=['id', 'id'])
result = df1[~df1.id.isin(common.id)]
I get more results in this then what the query returns. Any help is appreciated.
you have the right solution,only you do interpret the results wrong.
This will give you the result without index
import pandas as pd
d = {'id': [1, 2,3,4], 'col2': ['one','two','three','four']}
d1 = {'id': [2,3]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d1)
result = df1[~df1.id.isin(df2.id)]
print(result.to_string(index=False))
You can use left join with .merge() with indicator= parameter turned on. Then, filter the indicator values equal to "left_only" with .query(), as follows:
df1.merge(df2, on='id', how='left', indicator='ind').query('ind == "left_only"')
Result:
id name ind
0 1 one left_only
3 4 four left_only
Optionally, you can remove the indicator column, as follows:
df1.merge(df2, on='id', how='left', indicator='ind').query('ind == "left_only"').drop('ind', axis=1)
Result:
id name
0 1 one
3 4 four
Try:
print(df1[~df1["id"].isin(df2["id"])])
Prints:
id name
0 1 one
3 4 four
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1, "one"], [2, "two"], [3, "three"], [4, "four"]]),
columns=['id', 'name '])
df2 = pd.DataFrame(np.array([[1], [2]]),
columns=['id'])
df1.drop(df2['id'], axis=0,inplace=True)
df1
Say I have a simple dataframe with the names of people. I perform a groupby on name
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3,4,5,6,7], 'name': ['George', 'John', 'Tim', 'Joe', 'Issac', 'George', 'Tim'] })
df1 = df.groupby('name')
Question: How can I select out a table of specific names out of a list which contains a string subset of the names, either 2 or 3 characters?
e.g say I have the following list where both Tim & Geo are the first 3 characters of some entries in the name column and Jo is the first 2 characters of a certain entry in the name column.
list = ['Jo', 'Tim', 'Geo']
Attempted: My initial thought was to create new columns in the original dataframe which were either a 2 or 3 character subset of the name column and then try grouping by that however since 2 and 3 string characters are different the grouping wouldn't output the correct result.
Not sure whether it would be better to use some if condition such as if v in list is len(2) groupby(2char) else groupby(3char) and output the result as 1 dataframe.
list
df1['name_2char_subset] = df1['name'].str[0:2]
df1['name_3char_subset] = df1['name'].str[0:3]
if v in list is len(2):
df2 = df1.groupby('name_2char_subset')
else:
df2 = df1.groupby('name_3char_subset')
Desired Output: Since there are 2 counts of each of Jo, Geo & Tim. The output should group by each case. ie for Jo there are both John & Joe hence a count of 2 in the groupby.
df3 = pd.DataFrame({'name': ['Jo', 'Tim', 'Geo'], 'col1': [2,2,2]})
How could we group by name and output the entries in name which have the given initial characters as in the list?
Any alternative ways of doing this will be helpful. For example, can perform this in the group by of extract values in the list after the group by has been performed.
First dont use list for variable, because python code word. Then use Series.str.extract for test if match by starting of strings by ^ and count in Series.value_counts:
L = ['Jo', 'Tim', 'Geo']
pat = '|'.join(r"^{}".format(x) for x in L)
df = (df['name'].str.extract('('+ pat + ')', expand=False)
.dropna()
.value_counts()
.reindex(L, fill_value=0)
.rename_axis('name')
.reset_index(name='col1'))
print (df)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
Your solution:
L = ['Jo', 'Tim', 'Geo']
s1 = df['name'].str[:2]
s2 = df['name'].str[:3]
df = (s1.where(s1.isin(L)).fillna(s2.where(s2.isin(L)))
.dropna()
.value_counts()
.reindex(L, fill_value=0)
.rename_axis('name')
.reset_index(name='col1'))
print (df)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
Solution from deleted answer with change by Series.str.startswith for test if starting string by list:
L = ['Jo', 'Tim', 'Geo']
df3 = pd.DataFrame({'name': L})
df3['col1'] = df3['name'].apply(lambda x: sum(df['name'].str.startswith(x)))
print (df3)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
EDIT: If need groupby more columns use first or second solution, assign columns back and aggregate by names aggregation in GroupBy.agg:
df = pd.DataFrame({'age' : [1,2,3,4,5,6,7],
'name': ['George', 'John', 'Tim', 'Joe', 'Issac', 'George', 'Tim'] })
print (df)
L = ['Jo', 'Tim', 'Geo']
pat = '|'.join(r"^{}".format(x) for x in L)
df['name'] = df['name'].str.extract('('+ pat + ')', expand=False)
df = df.groupby('name').agg(sum_age=('age','sum'), col1=('name', 'count'))
print (df)
sum_age col1
name
Geo 7 2
Jo 6 2
Tim 10 2
I would like to create a DataFrame from a DataFrame I already have in Python.
The DataFrame I have looks like below:
Nome Dept
Maria A1
Joao A2
Anna A1
Jorge A3
The DataFrame I want to create is like the below:
Dept Funcionario 1 Funcionario 2
A1 Maria Anna
A2 Joao
I tried the below code:
df_func.merge(df_dept, how='inner', on='Dept')
But I got the error: TypeError: merge() got multiple values for argument 'how'
Would anyone know how I can do this?
Thank you in Advance! :)
Even if you try that and it works, you will not get the right answer. in fact the key is gonna be duplicated 4 times.
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
d = _
df = pd.DataFrame(d)
df.merge(df, how='inner', on='dept')
Out[8]:
Name_x dept Name_y
0 maria 1 maria
1 maria 1 anna
2 anna 1 maria
3 anna 1 anna
4 joao 2 joao
5 jorge 3 jorge
Best way around is to groupby :
dd = df.groupby('dept').agg(list)
Out[10]:
Name
dept
1 [maria, anna]
2 [joao]
3 [jorge]
Then you apply pd.Series
dd['Name'].apply(pd.Series)
Out[21]:
0 1
dept
1 maria anna
2 joao NaN
3 jorge NaN
This is how I have merged two data frames recently.
rpt_data = connect_to_presto() # returned data from a db
df_rpt = pd.DataFrame(rpt_data, columns=["domain", "revenue"])
""" adding sellers.json seller {} into a panads df """
sj_data = data # returned response from requests module
df_sj = pd.json_normalize(sj_data, record_path="sellers", errors="ignore")
""" merging both dataframes """
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
Notice how I have stored the data into a variable each time, then created a DataFrame out of that? Then merged them like so
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
This may not be the best approach but it works.
I've the following dataframe:
id;name;parent_of
1;John;3
2;Rachel;3
3;Peter;
Where the column "parent_of" is the id of the parent id. What I want to get the is the name instead of the id on the column "parent_of".
Basically I want to get this:
id;name;parent_of
1;John;Peter
2;Rachel;Peter
3;Peter;
I already wrote a solution but is not the more effective way:
import pandas as pd
d = {'id': [1, 2, 3], 'name': ['John', 'Rachel', 'Peter'], 'parent_of': [3,3,'']}
df = pd.DataFrame(data=d)
df_tmp = df[['id', 'name']]
df = pd.merge(df, df_tmp, left_on='parent_of', right_on='id', how='left').drop('parent_of', axis=1).drop('id_y', axis=1)
df=df.rename(columns={"name_x": "name", "name_y": "parent_of"})
print(df)
Do you have any better solution to achieve this?
Thanks!
Check with map
df['parent_of']=df.parent_of.map(df.set_index('id')['name'])
df
Out[514]:
id name parent_of
0 1 John Peter
1 2 Rachel Peter
2 3 Peter NaN
I am trying to calculate fuzz ratios for multiple rows in 2 data frames:
df1:
id name
1 Ab Cd E
2 X.Y!Z
3 fgh I
df2:
name_2
abcde
xyz
I want to calculate the fuzz ratio between all the values in df1.name and df2.name_2:
To do that I have code:
for i in df1['name']:
for r in df2['name_2']:
print(fuzz.ratio(i,r))
But I want the final result to have the ids from df1 as well. It would ideally look like this:
final_df:
id name name_2 score
1 Ab Cd E abcde 50
1 Ab Cd E xyz 0
2 X.Y!Z abcde 0
2 X.Y!Z xyz 60
3 fgh I abcde 0
3 fgh I xyz 0
Thanks for the help!
You can solve your problem like this:
Create an empty DataFrame:
final = pandas.DataFrame({'id': [], 'name': [], 'name_2': [], 'score': []})
Iterate through the two DataFrames inserting the id, names, and score and concatenating it onto the final DataFrame:
for id, name in zip(df1['id'], df1['name']):
for name2 in df2['name_2']:
tmp = pandas.DateFrame({'id': id, 'name': name, 'name_2': name2, 'score': fuzz.ratio(name, name2)})
final = pandas.concat([final, tmp], ignore_index=True)
print(final)
There is probably a cleaner and more efficient way to do this, but I hope this helps.
I don't fully understand the application of lambda functions in pd.apply, but after some SO searching, I think this is a reasonable solution.
import pandas as pd
from fuzzywuzzy import fuzz
d = [{'id': 1, 'name': 'Ab Cd e'}, {'id': 2, 'name': 'X.Y!Z'}, {'id': 3, 'name': 'fgh I'}]
df1 = pd.DataFrame(d)
df2 = pd.DataFrame({'name_2': ['abcde', 'xyz']})
This is a cross join in pandas; a tmp df is required
pandas cross join no columns in common
df1['tmp'] = 1
df2['tmp'] = 1
df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
You can .apply the function fuzz.ratio to columns in the df.
Pandas: How to use apply function to multiple columns
df['fuzz_ratio'] = df.apply(lambda row: fuzz.ratio(row['name'], row['name_2']), axis = 1)
df
I also tried setting an index on df1, but that resulted in its exclusion from the cross-joined df.