Multiple Vlookups on main dataframe from multiple sources? [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a master data that I wish to do vlookups for additional columnns.
Here's what I am aiming to achieve:
Master Data:
Ctry Product
CN BTL
VN HP
Ref table 1:
Ctry Country
AU Australia
CN China
VN Vietnam
Ref table 2:
ProductID Product
BTL Bottles
HP Handphone
PRN Printer
How do I achieve it to combine all into the Master Data as below?
Expected Output:
Ctry Product Country Product
CN BTL China Bottles
VN HP Vietnam Handphone
My below codes only references 1 table and I'm stuck, how do I go about adding the additional columns to the existing Master Data Sheet?:
import pandas as pd
# IMPORT DATA
df1 = pd.read_excel("Masterdata.xlsx")
df2 = pd.read_excel("Ref_table_1.xlsx")
Left_join = pd.merge(df1,df2, on = 'Ctry', how ='left')
Left_join.to_excel("Output.xlsx", index = False)

You can use this and eliminate columns not required:
import pandas as pd
df1 = pd.DataFrame({'Ctry': ['CN', 'VN'], 'Product': ['BTL', 'HP']})
df2 = pd.DataFrame({'Ctry': ['CN', 'VN', 'AU'], 'Country': ['Australia', 'China', 'Vietnam']})
df3 = pd.DataFrame({'ProductID': ['BTL', 'HP', 'PRN'], 'Product': ['Bottle', 'HandPhone', 'Printer']})
m1 = df1.merge(df2, how='left')
m2 = m1.merge(df3, how='left', left_on='Product', right_on='ProductID')
print(m2)

Related

How do I filter a dataframe based on complicated conditions? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 months ago.
Right now my dataframes look like this (I simplified it cause the original has hundreds of rows)
import pandas as pd
Winner=[[1938,"Italy"],[1950,"Uruguay"],[2014,"Germany"]]
df=pd.DataFrame(Winner, columns=['Year', 'Winner'])
print(df)
MatchB=[[1938,"Germany",1.0],[1938,"Germany",2.0],[1938,"Brazil",1.0],[1950,"Italy",2.0],[1950,"Spain",2.0],[1950,"Spain",1.0],[1950,"Spain",1.0],[1950,"Brazil",1.0],
[2014,"Italy",2.0],[2014,"Spain",3.0],[2014,"Germany",1.0]]
df2B=pd.DataFrame(MatchB, columns=['Year', 'Away Team Name','Away Team Goals'])
df2B
I would like to filter df2B so that I will have the rows where the "Year" and "Away Team Name" match df:
Filtered List (Simplified)
I check google but can't find anything useful
You can merge.
df = pd.merge(left=df, right=df2B, left_on=["Year", "Winner"], right_on=["Year", "Away Team Name"])
print(df)
Output:
Year Winner Away Team Name Away Team Goals
0 2014 Germany Germany 1.0

Pandas - Add colums for mean und std after groupby statement [duplicate]

This question already has answers here:
Multiple aggregations of the same column using pandas GroupBy.agg()
(4 answers)
Closed 1 year ago.
I have a following dataframe:
d = {'City' : ['Paris', 'London', 'NYC', 'Paris', 'NYC'], 'ppl' : [3000,4646,33543,85687568,34545]}
df = pd.DataFrame(data=d)
df_mean = df.groupby('City').mean()
now I want to instead just calc the mean (and .std()) of the ppl column, I want to have the city, mean, std in my dataframe (of course the cities should be grouped). If this is not possible it would be ok to just add at least the column for the .std() column to my resulting dataframe
You can use .GroupBy.agg(), as follows:
df.groupby('City').agg({'ppl': ['min', 'std']})
If you don't want the column City be the index, you can do:
df.groupby('City').agg({'ppl': ['min', 'std']}).reset_index()
or
df.groupby('City')['ppl'].agg(['mean','std']).reset_index()
Result:
City mean std
0 London 4646 NaN
1 NYC 34044 7.085210e+02
2 Paris 42845284 6.058814e+07

Pandas to lookup and return corresponding values from many dataframes

A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West

Derive pandas column from another

I have a column like following in my dataframe
What I want is to create a new column based on the language column, like country, If someone's language is "eng", the country column should fill with UK
Desired output
NB: This is a sample I created in excel, I am working with pandas in jupyter notebook
Considering this to be your df:
In [1359]: df = pd.DataFrame({'driver':['Hamilton', 'Sainz', 'Giovanazi'], 'language':['eng', 'spa', 'ita']})
In [1360]: df
Out[1360]:
driver language
0 Hamilton eng
1 Sainz spa
2 Giovanazi ita
And this to be your language-country mapping:
In [1361]: mapping = {'eng': 'UK', 'spa': 'Spain', 'ita': 'Italy'}
You can use df.map to solve it:
In [1363]: df['country'] = df.language.map(mapping)
In [1364]: df
Out[1364]:
driver language country
0 Hamilton eng UK
1 Sainz spa Spain
2 Giovanazi ita Italy

How do I deal with a column in Pandas that equals "NA"? [duplicate]

This question already has an answer here:
pandas redefine isnull to ignore 'NA'
(1 answer)
Closed 2 years ago.
I know this sounds dumb, but I can't figure out what to do about data in a spreadsheet that equals "NA" (in my case, it's an abbreviation for "North America"). When I do a Pandas "read_excel", the data gets brought in as "NaN" instead of "NA".
Is "NA" also considered "Not a Number" like NaN is?
The input Excel sheet cells contain NA. The dataframe contains "NaN".
Any way to avoid this?
Solution
You can switch-off auto-detection of na-values by using keep_defaul_na=False in pandas.read_excel() as follows.
I am using the demo test.xlsx file that I created in the Dummy Data section.
pd.read_excel('test.xlsx', keep_default_na=False)
## Output
# Region Country
# 0 NA Canada
# 1 NA USA
# 2 SA Brazil
# 3 EU Sweden
# 4 AU Australia
Dummy Data
import pandas as pd
# Create a dummy dataframe for demo purpose
df = pd.DataFrame({'Region': ['NA', 'NA', 'SA', 'EU', 'AU'],
'Country': ['Canada', 'USA', 'Brazil', 'Sweden', 'Australia']})
# Create an excel file with this data
df.to_excel('test.xlsx', index=False)
# Show dataframe
print(df)
Output
Region Country
0 NA Canada
1 NA USA
2 SA Brazil
3 EU Sweden
4 AU Australia

Categories