Overlay of two dataframe using python

Overlay of two dataframe using python - python

Problem:
I have two dataframes viz 'infm' and 'ufl'. In 'ufl', Age and Salary column for Ben and Creg is updated. I want to update the corresponding row in 'infm' too .
Approach Taken:
I am iterating through each row of 'infm' and taking 'Name' column to match both dataframe. If corresponding names are matched, then updating the Age column of 'infm' with value in 'ufl'
Input:
NAME AGE SALARY COUNTRY
Adam 24 25000 x
Ben 25 30000 y
Creg 23 22000 x
Dawood 25 30000 w
Update on two rows of Input:
NAME AGE SAlARY COUNTRY
Ben 36 90000 y
Creg 34 92000 x
Expected Output:
NAME AGE SALARY COUNTRY
Adam 24 25000 x
Ben 36 90000 y
Creg 34 92000 x
Dawood 25 30000 w
Actual output:
NAME AGE SALARY COUNTRY
Adam 24 25000 x
Ben 25 30000 y
Creg 23 22000 x
Dawood 25 30000 w
Code used:
import pandas as pd
infm=pd.read_excel('D:/data/test.xls')
ufl=pd.read_excel('D:/data/test1.xls')
for row in infm.iterrows():
a=row[1]['Name']
b=ufl['Name'].unique().tolist()
for i in b:
if i==a:
row[1]['Age']=(ufl['Age'][ufl['Name']==a]).tolist()[0]

Related

how i can loops thourgh column in each row using python

Hey you guy I got a dataframe like this
empoyees = [('jack', 34, 'Sydney',800) ,
('Riti', 31, 'Delhi',800) ,
('Aadi', 16, 'New York',800) ,
('Mohit', 32,'Delhi',1500) ,
]
empDfObj = pd.DataFrame(empoyees, columns=['Name', 'Age', 'City',Salary], index=['a', 'b', 'c', 'd'])
how I can loop through columns in each row and get the result like this using pandas in python. Maybe add all it into a small list of each row
a Name jack Age 34 City Sydney Salary 800
b Name Riti Age 31 City Delhi Salary 800
c Name Aadi Age 16 City New York Salary 800
d Name Mohit Age 32 City Delhi Salary 1500

You could use DataFrame.to_dict with orient set to 'index'
The output of dict would be of the form:
{ idx1 : {col1:val1, col2:val2 ... coln:van},
idx2 : {col1:val1, col2:val2 ... coln:valn},
...
}
Loop through the dict and create a list of strings if would like to store them as a list.
[
f'{idx} {" ".join([str(v) for t in vals.items() for v in t])}'
for idx, vals in df.to_dict("index").items()
]
# output
# ['a Name jack Age 34 City Sydney Salary 800',
# 'b Name Riti Age 31 City Delhi Salary 800',
# 'c Name Aadi Age 16 City New York Salary 800',
# 'd Name Mohit Age 32 City Delhi Salary 1500']
If you only want to print them you don't need to build a list of strings. You could do:
for idx, vals in df.to_dict('index').items():
print(idx, *[v for t in vals.items() for v in t], sep=" ")
#output
# a Name jack Age 34 City Sydney Salary 800
# b Name Riti Age 31 City Delhi Salary 800
# c Name Aadi Age 16 City New York Salary 800
# d Name Mohit Age 32 City Delhi Salary 1500

i kept it simple
s=''
for index, row in df.iterrows():
if index in s:
pass
else:
s+=str(index)
for key, value in row[:].items():
s+=" "+ key+" "+str(value)
print(s)
s=''
output
a Name jack Age 34 City Sydney Salary 800
b Name Riti Age 31 City Delhi Salary 800
c Name Aadi Age 16 City New York Salary 800
d Name Mohit Age 32 City Delhi Salary 1500

Combine text using delimiter for duplicate column values

What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100

use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100

Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160

Add a column from an existing dataframe into another between every other column

I'll try my best to explain this as I had trouble phrasing the title. I have two dataframes. What I would like to do is add a column from df1 into df2 between every other column.
For example, df1 looks like this :
Age City
0 34 Sydney
1 30 Toronto
2 31 Mumbai
3 32 Richmond
And after adding in df2 it looks like this:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
In terms of code, I wasn't quite sure where to even start.
'''Concatenating the dataframes'''
for i in range len(df2):
pos = i+1
df3 = df2.insert
#df2 = pd.concat([df1, df2], axis=1).sort_index(axis=1)
#df2.columns = np.arange(len(df2.columns))
#print (df2)
I was originally going to run it through a loop, but I wasn't quite sure how to do it. Any help would be appreciated!

You can use itertools.zip_longest. For example:
from itertools import zip_longest
new_columns = [
v
for v in (c for a in zip_longest(df2.columns, df1.columns) for c in a)
if not v is None
]
df_out = pd.concat([df1, df2], axis=1)[new_columns]
print(df_out)
Prints:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States

Pandas: Merge many-to-one

Let's say I have 2 data frames:
df1:
Name Age
Pete 19
John 30
Max 24
df2:
Name Subject Grade
Pete Math 90
Pete History 100
John English 90
Max History 90
Max Math 80
I want to merge them df2 to df1, many to one, to end up with something like this:
Name Age Subject Grade
Pete 19 Math 90
Pete 19 History 100
John 30 English 90
Max 24 History 90
Max 24 Math 80
I don't want to group them by Subject and Grade, I need to duplicate them so it would keep everything.

Simply you could use pd.merge as follows:
import pandas as pd
if __name__ == '__main__':
df1 = pd.DataFrame({"Name": ["Pete", "John", "Max"],
"Age": [19, 30, 24]})
df2 = pd.DataFrame({"Name": ["Pete", "Pete", "John", "Max", "Max"],
"Subject": ["Math", "History", "English", "History", "Math"],
"Grade": [90, 100, 90, 90, 80]})
df3 = pd.merge(df1, df2, how="right", on="Name")
print(df1)
print(df2)
print(df3)
Result:
Name Age
0 Pete 19
1 John 30
2 Max 24
Name Subject Grade
0 Pete Math 90
1 Pete History 100
2 John English 90
3 Max History 90
4 Max Math 80
Name Age Subject Grade
0 Pete 19 Math 90
1 Pete 19 History 100
2 John 30 English 90
3 Max 24 History 90
4 Max 24 Math 80

How to split single column of pandas dataframe into multiple columns with group?

I am new to python pandas. I have one dataframe like below:
df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
'age': ['25', '22','21','32','37','26','24','30']})
print df
Name age
0 football 25
1 ramesh 22
2 suresh 21
3 pankaj 32
4 cricket 37
5 rakesh 26
6 mohit 24
7 mahesh 30
"Name" column contains "sports name" and "sport person name" also. I want to split it into two different columns like below:
Expected Output:
sports_name sport_person_name age
football ramesh 25
suresh 22
pankaj 32
cricket rakesh 26
mohit 24
mahesh 30
If I make groupby on "Name" column I'm not getting expected output and it is obviously straight-forward output because no duplicates in "Name" column. What I need to use so that I can get expected output?
Edit : If don't want to hardcode the sports names
df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
'age': ['', '22','21','32','','26','24','30']})
df = df.replace('', np.nan, regex=True)
nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
I Just Checked for except "Name" column which rows contains NAN values in all rest of the columns and It will be definitely sports names. I created list of that sports names and make use of below solutions to create sports_name and sports_person_name columns.

You can use:
#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 football suresh 21
2 football pankaj 32
3 cricket rakesh 26
4 cricket mohit 24
5 cricket mahesh 30
Similar solution with DataFrame.insert - then reorder is not necessary:
#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 football suresh 21
2 football pankaj 32
3 cricket rakesh 26
4 cricket mohit 24
5 cricket mahesh 30
If want only one value of sport add limit=1 to ffill and replace NaNs to empty string:
sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
sports_name sport_person_name age
0 football ramesh 22
1 suresh 21
2 pankaj 32
3 cricket rakesh 26
4 mohit 24
5 mahesh 30

The output you want is a dictionary and not a dataframe.
The dictionary will look:
{'Sport' : {'Player' : age,'Player2' : age}}
If you really want a dataframe:
If the name always comes before the players:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket'
,'rakesh','mohit','mahesh'],
'age': ['25', '22','21','32','37','26','24','30']})
sports=['football', 'cricket']
wanted_dict={}
current_sport=''
for val in df['sport_person_name']:
if val in sports:
current_sport=val
else:
wanted_dict[val]=current_sport
#Now you got - {name:sport_name,...}
df['sports_name']=999
for val in df['sport_person_name']
df['sports_name']=np.where((val not in sports)&
(df['sport_person_name']==val),
wanted_dict[val],'sport)
df = df[df['sports_name']!='sport']
What it should look like:
sports_name sport_person_name age
football ramesh 25
football suresh 22
football pankaj 32
cricket rakesh 26
cricket mohit 24
cricket mahesh 30

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overlay of two dataframe using python - python

Related

how i can loops thourgh column in each row using python

Combine text using delimiter for duplicate column values

Add a column from an existing dataframe into another between every other column

Pandas: Merge many-to-one

How to split single column of pandas dataframe into multiple columns with group?

Categories

Resources