Python Pandas - Conditional Join - python

I would like to create a DataFrame from a DataFrame I already have in Python.
The DataFrame I have looks like below:
Nome Dept
Maria A1
Joao A2
Anna A1
Jorge A3
The DataFrame I want to create is like the below:
Dept Funcionario 1 Funcionario 2
A1 Maria Anna
A2 Joao
I tried the below code:
df_func.merge(df_dept, how='inner', on='Dept')
But I got the error: TypeError: merge() got multiple values for argument 'how'
Would anyone know how I can do this?
Thank you in Advance! :)

Even if you try that and it works, you will not get the right answer. in fact the key is gonna be duplicated 4 times.
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
d = _
df = pd.DataFrame(d)
df.merge(df, how='inner', on='dept')
Out[8]:
Name_x dept Name_y
0 maria 1 maria
1 maria 1 anna
2 anna 1 maria
3 anna 1 anna
4 joao 2 joao
5 jorge 3 jorge
Best way around is to groupby :
dd = df.groupby('dept').agg(list)
Out[10]:
Name
dept
1 [maria, anna]
2 [joao]
3 [jorge]
Then you apply pd.Series
dd['Name'].apply(pd.Series)
Out[21]:
0 1
dept
1 maria anna
2 joao NaN
3 jorge NaN

This is how I have merged two data frames recently.
rpt_data = connect_to_presto() # returned data from a db
df_rpt = pd.DataFrame(rpt_data, columns=["domain", "revenue"])
""" adding sellers.json seller {} into a panads df """
sj_data = data # returned response from requests module
df_sj = pd.json_normalize(sj_data, record_path="sellers", errors="ignore")
""" merging both dataframes """
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
Notice how I have stored the data into a variable each time, then created a DataFrame out of that? Then merged them like so
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
This may not be the best approach but it works.

Related

Left Outer Join with two single columned dataframes

I don't see the below case mentioned in Pandas Merging 101.
I'm having trouble understanding the Pandas documentation for doing a left outer join.
import pandas as pd
left_df = pd.DataFrame({
'user_id': ['Peter', 'John', 'Robert', 'Anna']
})
right_df = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John',
'Anna']
})
pd.merge(left_df, right_df, on = 'user_id', how = 'left')
Output is:
user_id
0 Peter
1 John
2 Robert
3 Anna
Expected output:
user_id
0 Peter
1 Robert
What am I missing? Is the indicator = True parameter a must (to create a _merge column to filter on) for left outer joins?
You can use merge with indicator=True and keep only rows where value is set to left_only but it's not the best way. You can use isin to get a boolean mask then invert it:
>>> left_df[~left_df['user_id'].isin(right_df['user_id'])]
user_id
0 Peter
2 Robert
With merge:
>>> (left_df.merge(right_df, on='user_id', how='left', indicator='present')
.loc[lambda x: x.pop('present') == 'left_only'])
user_id
0 Peter
2 Robert

Pandas AttributeError: 'str' object has no attribute 'loc'

this is my code:
DF['CustomerId'] = DF['CustomerId'].apply(str)
print(DF.dtypes)
for index, row in merged.iterrows():
DF = DF.loc[(DF['CustomerId'] == str(row['CustomerId'])), 'CustomerId'] = row['code']
My goal is to do this:
if DF['CustomerId'] is equal to row['CustomerId'] then change value of DF['CustomerId'] to row['CustomerId']
else leave as it is.
row['CustomerId'] and DF['CustomerId'] should be string. I know that loc works not with string, but how can I do this with string type ?
thanks
You can approach without looping by merging the 2 dataframes on the common CustomerId column using .merge() and then update the CustomerID column with the code column originated from the 'merged' datraframe with .update(), as follows:
df_out = DF.merge(merged, on='CustomerId', how='left')
df_out['CustomerId'].update(df_out['code'])
Demo
Data Preparation:
data = {'CustomerId': ['11111', '22222', '33333', '44444'],
'CustomerInfo': ['Albert', 'Betty', 'Charles', 'Dicky']}
DF = pd.DataFrame(data)
print(DF)
CustomerId CustomerInfo
0 11111 Albert
1 22222 Betty
2 33333 Charles
3 44444 Dicky
data = {'CustomerId': ['11111', '22222', '44444'],
'code': ['A1011111', 'A1022222', 'A1044444']}
merged = pd.DataFrame(data)
print(merged)
CustomerId code
0 11111 A1011111
1 22222 A1022222
2 44444 A1044444
Run New Code
# ensure the CustomerId column are strings as you did
DF['CustomerId'] = DF['CustomerId'].astype(str)
merged['CustomerId'] = merged['CustomerId'].astype(str)
df_out = DF.merge(merged, on='CustomerId', how='left')
print(df_out)
CustomerId CustomerInfo code
0 11111 Albert A1011111
1 22222 Betty A1022222
2 33333 Charles NaN
3 44444 Dicky A1044444
df_out['CustomerId'].update(df_out['code'])
print(df_out)
# `CustomerId` column updated as required if there are corresponding entries in dataframe `merged`
CustomerId CustomerInfo code
0 A1011111 Albert A1011111
1 A1022222 Betty A1022222
2 33333 Charles NaN
3 A1044444 Dicky A1044444

Merge between columns from the same dataframe

I've the following dataframe:
id;name;parent_of
1;John;3
2;Rachel;3
3;Peter;
Where the column "parent_of" is the id of the parent id. What I want to get the is the name instead of the id on the column "parent_of".
Basically I want to get this:
id;name;parent_of
1;John;Peter
2;Rachel;Peter
3;Peter;
I already wrote a solution but is not the more effective way:
import pandas as pd
d = {'id': [1, 2, 3], 'name': ['John', 'Rachel', 'Peter'], 'parent_of': [3,3,'']}
df = pd.DataFrame(data=d)
df_tmp = df[['id', 'name']]
df = pd.merge(df, df_tmp, left_on='parent_of', right_on='id', how='left').drop('parent_of', axis=1).drop('id_y', axis=1)
df=df.rename(columns={"name_x": "name", "name_y": "parent_of"})
print(df)
Do you have any better solution to achieve this?
Thanks!
Check with map
df['parent_of']=df.parent_of.map(df.set_index('id')['name'])
df
Out[514]:
id name parent_of
0 1 John Peter
1 2 Rachel Peter
2 3 Peter NaN

Python: Compare two dataframes in Python with different number rows and a Compsite key

I have two different dataframes which i need to compare.
These two dataframes are having different number of rows and doesnt have a Pk its Composite primarykey of (id||ver||name||prd||loc)
df1:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
b 1 alex 1b y
b 2 david 1b z
df2:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
I tried the below code and this workingif there are same number of rows , but if its like the above case its not working.
df1 = pd.DataFrame(Source)
df1 = df1.astype(str) #converting all elements as objects for easy comparison
df2 = pd.DataFrame(Target)
df2 = df2.astype(str) #converting all elements as objects for easy comparison
header_list = df1.columns.tolist() #creating a list of column names from df1 as the both df has same structure
df3 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
for x in range(len(header_list)) :
df3[header_list[x]] = np.where(df1[header_list[x]] == df2[header_list[x]], 'True', 'False')
df3.to_csv('Output', index=False)
Please leet me know how to compare the datasets if there are different number od rows.
You can try this:
~df1.isin(df2)
# df1[~df1.isin(df2)].dropna()
Lets consider a quick example:
df1 = pd.DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
# Buyer Quantity
# 0 Carl 18
# 1 Carl 3
# 2 Carl 5
df2 = pd.DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
# Buyer Quantity
# 0 Carl 2
# 1 Mark 1
# 2 Carl 18
# 3 Carl 5
~df2.isin(df1)
# Buyer Quantity
# 0 False True
# 1 True True
# 2 False True
# 3 True True
df2[~df2.isin(df1)].dropna()
# Buyer Quantity
# 1 Mark 1
# 3 Carl 5
Another idea can be merge on the same column names.
Sure, tweak the code to your needs. Hope this helped :)

How to create data frame with links between data in two different data frames

I have one pandas dataframe for persons like:
pid name job
1 Mike A
2 Lucy A
3 Jeff B
And a second one for jobs like:
id name
1 A
2 B
3 C
What I want to produce is a third dataframe where I list the connections between people and jobs, so in this dummy example the desired result will be:
personid jobid
1 1
2 1
3 2
How can I accomplish this with pandas? I don't understand how to join in this case, since it's not a by row thing...
Try with pandas, suppose you have df1 and df2:
import pandas as pd
df1 = pd.read_csv('Data1.csv')
df2 = pd.read_csv('Data2.csv')
print df1
print df2
df1 :
pid name job
0 1 Mike A
1 2 Lucy A
2 3 Jeff B
and df2:
id name
0 1 A
1 2 B
2 3 C
then:
df2['job']=df2['name']
df_result = df1.merge(df2, on='job', how='left')
print df_result[['pid','id']]
It will print out:
pid id
0 1 1
1 2 1
2 3 2
Is this what you're looking for?
output = pd.merge(persons, jobs, how='left', left_on='job', right_on='name')[['pid', 'id']]
Output:
pid id
0 1 1
1 2 1
2 3 2
The two given dataframes are the following:
import pandas as pd
people_df = pd.DataFrame([[1, "Mike", "A"], [2, "Lucy", "A"], [3, "Jeff", "B"]], columns=["pid", "name", "job"])
jobs_df = pd.DataFrame([[1, "A"], [2, "B"], [3, "C"]], columns=["id", "name"])
You can get the desired result by using merge method.
merged_df = pd.merge(people_df, jobs_df, left_on='job', right_on='name')
result = merged_df[['pid', 'id']].rename(columns={'pid': 'personid', 'id': 'jobid'}) # for extracting and renaming data
"inner join" is used in default merge method. You can use how option for other "join" if you want.

Categories