Merge between columns from the same dataframe - python

I've the following dataframe:
id;name;parent_of
1;John;3
2;Rachel;3
3;Peter;
Where the column "parent_of" is the id of the parent id. What I want to get the is the name instead of the id on the column "parent_of".
Basically I want to get this:
id;name;parent_of
1;John;Peter
2;Rachel;Peter
3;Peter;
I already wrote a solution but is not the more effective way:
import pandas as pd
d = {'id': [1, 2, 3], 'name': ['John', 'Rachel', 'Peter'], 'parent_of': [3,3,'']}
df = pd.DataFrame(data=d)
df_tmp = df[['id', 'name']]
df = pd.merge(df, df_tmp, left_on='parent_of', right_on='id', how='left').drop('parent_of', axis=1).drop('id_y', axis=1)
df=df.rename(columns={"name_x": "name", "name_y": "parent_of"})
print(df)
Do you have any better solution to achieve this?
Thanks!

Check with map
df['parent_of']=df.parent_of.map(df.set_index('id')['name'])
df
Out[514]:
id name parent_of
0 1 John Peter
1 2 Rachel Peter
2 3 Peter NaN

Related

Left Outer Join with two single columned dataframes

I don't see the below case mentioned in Pandas Merging 101.
I'm having trouble understanding the Pandas documentation for doing a left outer join.
import pandas as pd
left_df = pd.DataFrame({
'user_id': ['Peter', 'John', 'Robert', 'Anna']
})
right_df = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John',
'Anna']
})
pd.merge(left_df, right_df, on = 'user_id', how = 'left')
Output is:
user_id
0 Peter
1 John
2 Robert
3 Anna
Expected output:
user_id
0 Peter
1 Robert
What am I missing? Is the indicator = True parameter a must (to create a _merge column to filter on) for left outer joins?
You can use merge with indicator=True and keep only rows where value is set to left_only but it's not the best way. You can use isin to get a boolean mask then invert it:
>>> left_df[~left_df['user_id'].isin(right_df['user_id'])]
user_id
0 Peter
2 Robert
With merge:
>>> (left_df.merge(right_df, on='user_id', how='left', indicator='present')
.loc[lambda x: x.pop('present') == 'left_only'])
user_id
0 Peter
2 Robert

frequency of values in column in multiple panda data frame

I have multiple panda data frames ( more than 70), each having same columns. Let say there are only 10 rows in each data frame. I want to find the column A' value occurence in each of data frame and list it. Example:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
data = [['sam', 12], ['nick', 15], ['juli', 14]]
df2 = pd.DataFrame(data, columns = ['Name', 'Age'])
I am expecting the output as
Name Age
tom 1
sam 1
nick 2
juli 2
You can do the following:
from collections import Counter
d={'df1':df1, 'df2':df2, ..., 'df70':df70}
l=[list(d[i]['Name']) for i in d]
m=sum(l, [])
result=Counter(m)
print(result)
Do you want value counts of Name column across all dataframes?
main = pd.concat([df,df2])
main["Name"].value_counts()
juli 2
nick 2
sam 1
tom 1
Name: Name, dtype: int64
This can work if your data frames are not costly to concat:
pd.concat([x['Name'] for x in [df,df2]]).value_counts()
nick 2
juli 2
tom 1
sam 1
You can try this:
df = pd.concat([df, df2]).groupby('Name', as_index=False).count()
df.rename(columns={'Age': 'Count'}, inplace=True)
print(df)
Name Count
0 juli 2
1 nick 2
2 sam 1
3 tom 1
You can try this:
df = pd.concat([df1,df2])
df = df.groupby(['Name'])['Age'].count().to_frame().reset_index()
df = df.rename(columns={"Age": "Count"})
print(df)

Dropping the first row of a dataframe when looping through a list of dataframes

I am trying to write a function to loop through a list of dataframes containing tables I pulled from a website using pd.read_html. I want to drop the first row in each dataframe, and tried with the function I wrote below but it's not working. Does anyone know why?
for df in df_list:
df.columns = df.iloc[0]
df.drop(df.index[0])
df_list[0]
**Hospital/Location Specialty**
0 Hospital/Location Specialty
1 Maimonides Med Ctr-NY Maimonides Med Ctr-NY Medicine-Preliminary Anesthesiology
2 Jacobi Med Ctr/Einstein-NY Pediatrics
3 Jacobi Med Ctr/Einstein-NY Pediatrics
4 Temple Univ Hosp-PA Internal Medicine
You need to assign it back to df.
Like this,
df=df.drop(df.index[0])
It removed index 0 from my dataframe. And the dataframe now starts at index 1.
Let us assign it back
for idx, df in enumerate(df_list):
df.columns = df.iloc[0]
df_list[idx]=df.drop(df.index[0])
why not use a comprehension
# test data:
df1 = pd.DataFrame({0: ['col1', 'A', 'B'], 1: ['col2', '1', '2']})
df2 = pd.DataFrame({0: ['colA', 'a', 'b'], 1: ['colB', 'hello', 'goodbye']})
dfs = [df1, df2]
renamed = [d.rename(columns=df1.iloc[0]).drop(0) for d in dfs]
for df in renamed:
print(df)
# outputs:
col1 col2
1 A 1
2 B 2
colA colB
1 a hello
2 b goodbye

Pandas: How can I iterate a for loop over 2 different data-frames?

I am trying to calculate fuzz ratios for multiple rows in 2 data frames:
df1:
id name
1 Ab Cd E
2 X.Y!Z
3 fgh I
df2:
name_2
abcde
xyz
I want to calculate the fuzz ratio between all the values in df1.name and df2.name_2:
To do that I have code:
for i in df1['name']:
for r in df2['name_2']:
print(fuzz.ratio(i,r))
But I want the final result to have the ids from df1 as well. It would ideally look like this:
final_df:
id name name_2 score
1 Ab Cd E abcde 50
1 Ab Cd E xyz 0
2 X.Y!Z abcde 0
2 X.Y!Z xyz 60
3 fgh I abcde 0
3 fgh I xyz 0
Thanks for the help!
You can solve your problem like this:
Create an empty DataFrame:
final = pandas.DataFrame({'id': [], 'name': [], 'name_2': [], 'score': []})
Iterate through the two DataFrames inserting the id, names, and score and concatenating it onto the final DataFrame:
for id, name in zip(df1['id'], df1['name']):
for name2 in df2['name_2']:
tmp = pandas.DateFrame({'id': id, 'name': name, 'name_2': name2, 'score': fuzz.ratio(name, name2)})
final = pandas.concat([final, tmp], ignore_index=True)
print(final)
There is probably a cleaner and more efficient way to do this, but I hope this helps.
I don't fully understand the application of lambda functions in pd.apply, but after some SO searching, I think this is a reasonable solution.
import pandas as pd
from fuzzywuzzy import fuzz
d = [{'id': 1, 'name': 'Ab Cd e'}, {'id': 2, 'name': 'X.Y!Z'}, {'id': 3, 'name': 'fgh I'}]
df1 = pd.DataFrame(d)
df2 = pd.DataFrame({'name_2': ['abcde', 'xyz']})
This is a cross join in pandas; a tmp df is required
pandas cross join no columns in common
df1['tmp'] = 1
df2['tmp'] = 1
df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
You can .apply the function fuzz.ratio to columns in the df.
Pandas: How to use apply function to multiple columns
df['fuzz_ratio'] = df.apply(lambda row: fuzz.ratio(row['name'], row['name_2']), axis = 1)
df
I also tried setting an index on df1, but that resulted in its exclusion from the cross-joined df.

Pandas: Converting Columns to Rows based on ID

I am new to pandas,
I have the following dataframe:
df = pd.DataFrame([[1, 'name', 'peter'], [1, 'age', 23], [1, 'height', '185cm']], columns=['id', 'column','value'])
id column value
0 1 name peter
1 1 age 23
2 1 height 185cm
I need to create a single row for each ID. Like so:
id name age height
0 1 peter 23 185cm
Any help is greatly appreciated, thank you.
You can use pivot_table with aggregate join:
df = pd.DataFrame([[1, 'name', 'peter'],
[1, 'age', 23],
[1, 'height', '185cm'],
[1, 'age', 25]], columns=['id', 'column','value'])
print (df)
id column value
0 1 name peter
1 1 age 23
2 1 height 185cm
3 1 age 25
df1 = df.astype(str).pivot_table(index="id",columns="column",values="value",aggfunc=','.join)
print (df1)
column age height name
id
1 23,25 185cm peter
Another solution with groupby + apply join and unstack:
df1 = df.astype(str).groupby(["id","column"])["value"].apply(','.join).unstack(fill_value=0)
print (df1)
column age height name
id
1 23,25 185cm peter
Assuming your dataframe as "df", below line would help:
df.pivot(index="subject",columns="predicate",values="object")

Categories