Trying to achieve join / vlookup in Python Pandas (Updated) - python

Disclaimer: I am brand new to SO and Python.
I am trying to convert my SQL left join query to python.
Example:
df1 is a dataframe which contains the columns: City, Event, date
df2: City, Zip Code, State, Country, etc.
SQL:
SELECT Events.City, Events.Event, Events.Date, Masterlist.State, Masterlist.Country, Masterlist.[Zip Code]
FROM Events LEFT JOIN Masterlist ON Events.City = Masterlist.City
PYTHON:
df1 = pd.read_csv('Events.csv')
df2 = pd.read_csv('Masterlist.csv')
df3 = df1.join(df2, how='left')
df3 output:
City, Event, date, Zip Code, State, Country
Fremont, Charity, 6/11, 99999, CA, US
Oakland, Protest, 6/11, 99998, CA, US
Fremont, Concert, 6/12, null, null, null
Oakland, Concert, 6/12, null, null, null
Ideal output is that it references df2 and returns the value based on City.It is currently only returning it for the first found value with that City. How can I get it to populate with its respective State, Zip Code, and Country for each row item?

You didn't specify on, I believe the below works:
import pandas as pd
df1 = pd.DataFrame({'City':["Tucson","Tucson","Portland","San Diego"],
"Event":[1,5,3,2],
"date":[1,2,3,4]})
df2 = pd.DataFrame({"City":["San Diego", "Tucson", "Portland"],
"zip":[1,2,3], "state":["CA", "AZ", "OR"],
"country":["USA","USA","USA"]})
pd.merge(df1,df2, how="left", on="City")
It is also best to provide a minimal dataset (like I did above) to make it easy for people to work with and help you out.

Related

Find strings in a column of one dataframe from another column in a different dataframe

I'm trying to create a mapping file.
The main issue is to compare two dataframes by using one column, then return a file of all matchine strings in both dataframes alongside some columns from the dataframes.
Example data
df1 = pd.DataFrame({
'Artist':
['50 Cent', 'Ed Sheeran', 'Celine Dion', '2 Chainz', 'Kendrick Lamar'],
'album':
['Get Rich or Die Tryin', '+', 'Courage', 'So Help Me God!', 'DAMN'],
'album_id': ['sdf34', '34tge', '34tgr', '34erg', '779uyj']
})
df2 = pd.DataFrame({
'Artist': ['Beyonce', 'Ed Sheeran', '2 Chainz', 'Kendrick Lamar', 'Jay-Z'],
'Artist_ID': ['frd345', '3te43', '32fh5', '235he', '345fgrt6']
})
So the main idea is to create a function that provides a mapping file that will take an item in artist name column from df1 and then check df2 artist name column to see if there are any similarities then create a mapping dataframe which contains the similar artist column, the album_id and the artist_id.
I tried the code below but I'm new to python so I got lost in the function. I would appreciate some help on a new function or a build up on what I was trying to do.
Thanks!
Code I failed to build:
def get_mapping_file(df1, df2):
# I don't know what I'm doing :'D
for i in df2['Artist']:
if i == df1['Artist'].any():
name = i
df1_id = df1.loc[df1['Artist'] == name, ['album_id']]
id_to_use = df1_id.album_id[0]
df2.loc[df2['Artist'] == i, 'Artist_ID'] = id_to_use
return df2
The desired output is:
Artist
Artist_ID
album_id
Ed Sheeran
3te43
34tge
2 Chainz
32fh5
34erg
Kendrick Lamar
235he
779uyj
I am not sure if this is actually what you need, but your desired output is an inner join between the two dataframes:
pd.merge(df1, df2, on='Artist', how='inner')
This will give you the rows for Artists present in both dataframes.
For me, it's easy to find that result. So you may do this:
frame = df1.merge(df2, how='inner')
frame = frame.drop('album', axis=1)
and then you'll have your result. Thanks !

Substituting column value if particular column exists in two DataFrames with Pandas

I have 2 data frames representing CSV files as such:
# 1.csv
id,email
1,someone#email.com
2,someoneelse#email.com
...
# 2.csv
id,email
3,someone#otheremail.com
4,someone#email.com
...
What I'm trying to do is to merge both tables into one DataFrame using Pandas based on whether a particular column (in this case column 2, email) is identical in both DataFrames.
I need the merged DataFrame to choose the id from 2.csv.
For example, using the sample data above, since the email column value someone#email.com exists in both CSVs, I need the merged DataFrame to output the following:
# 3.csv
id,email
4,someone#gmail.com
2,someoneelse#email.com
3,someone#otheremail.com
What I have so far is as follows:
df1 = pd.read_csv('/path/to/1.csv')
print("df1 has {} rows".format(len(df1.index)))
# "df1 has 14072 rows"
df2 = pd.read_csv('/path/to/2.csv')
print("df2 has {} rows".format(len(df2.index)))
# "df2 has 56766 rows"
join = pd.merge(df1, df2, on="email", how="inner")
print("join has {} rows".format(len(join.index)))
# "join has 321 rows"
The problem is that the join DataFrame produces only the rows where the email field exists in both DataFrames. What I expect is that the output DataFrame contain 56766 + 14072 - 321 = 70517 rows with the id values be the ones from 2.csv when the email field is identical in both source DataFrames. I tried to change the merge(how="left|right") but the results are identical.
from datatable import dt, f, by
df1 = dt.Frame("""
id,email
1,someone#email.com
2,someoneelse#email.com
""")
df1['csv'] = 1
df2 = dt.Frame("""
id,email
3,someone#otheremail.com
4,someone#email.com
""")
df2['csv'] = 2
df_all = dt.rbind(df1, df2)
df_result = df_all[-1, ['id'], by('email')]
Resolved it by uploading the files to Google Spreadsheet and usingVLOOKUP

Python : How to set dataframe as parameter to a function in python?

I have 4 columns in CSV and I want to set CSV as parameter to a function in python. The 'key' should be my first column in CSV.
df = pd.DataFrame({'Country': ['US','France','Germany'],'daycount':['Actual360','Actual365','ActaulFixed'],'frequency':['Annual','Semi','Quart'], 'calendar':['United','FRA','Ger'})
From the above data frame I want to set parameter to the following variables, based on 'Country' as key in the dataframe and it should populate the corresponding values in following variables. I need some function or loop through which I can populate values. These values will further used in next program.
day_count = Actual360
comp_frequency = Annual
gl_calendar = UnitedStates
If I understood correctly:
def retrieve_value(attribute, country, df): #input attribute and country as str
return df.loc[df['Country'] == country, attribute].iloc[0]
Ex:
retrieve_value('daycount', 'Germany', df) -> 'ActualFixed'
This?
def populate(df, country):
day_count=df[df['Country']==country]['daycount'][0]
comp_frequency=df[df['Country']==country]['frequency'][0]
gl_calendar=df[df['Country']==country]['calendar'][0]
return (day_count, comp_frequency, gl_calendar)
populate(df,'US')
Out: ('Actual360', 'Annual', 'United')
I'm not sure I got your question, let me try to reformulate it.
You have a pandas DataFrame with 4 columns, one of which (Country) acts as an index (=primary key in DB language). You would like to iterate on all the rows, and retrieve for each row the corresponding values in the other 3 columns.
If I didn't betray your intent, here is a code that'll do the job. Note that DataFrame.set_index(<column_name>) function, it tells pandas that this column should be used to index the rows (instead of the default numeric one).
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'Country': ['US','France','Germany'],'daycount':['Actual360','Actual365','ActaulFixed'],'frequency':['Annual','Semi','Quart'], 'calendar':['United','FRA','Ger']}).set_index('Country')
In [3]: df
Out[3]:
daycount frequency calendar
Country
US Actual360 Annual United
France Actual365 Semi FRA
Germany ActaulFixed Quart Ger
In [4]: for country, attributes in df.iterrows():
...: day_count = attributes['daycount']
...: comp_frequency = attributes['frequency']
...: # idem for the last value
...: print(f"{country} -> {day_count}, {comp_frequency}")
...:
US -> Actual360, Annual
France -> Actual365, Semi
Germany -> ActaulFixed, Quart
In [5]: df.loc['US', 'daycount'] # use df.loc[<country>, <attribute>] to retrieve specific value
Out[5]: 'Actual360'

Save multiple dataFrames in a loop using to_pickle

hi i have 4 pandas dataframe: df1, df2 ,df3, df4.
What i like to do is iterate (using a for loop) the save of this dataframe using to_pickle.
what i did is this:
out = 'mypath\\myfolder\\'
r = [ orders, adobe, mails , sells]
for i in r:
i.to_pickle( out + '\\i.pkl')
The command is fine but it does not save every database with his name but overwriting the same databse i.pkl (i think because is not correct my code)
It seem it can't rename every database with his name (e.g. for orders inside the for loop orders is saved with the name i.pkl and so on with the orders dataframe involved)
What i expect is to have 4 dataframe saved with the name inserted in the object r (so : orders.pkl, adobe.pkl ,mails.pkl, sells.pkl)
How can i do this?
You can't stringify the variable name (this is not something you generally do), but you can do something simple:
import os
out = 'mypath\\myfolder\\'
df_list = [df1, df2, df3, df4]
for i, df in enumerate(df_list, 1):
df.to_pickle(os.path.join(out, f'\\df{i}.pkl')
If you want to provide custom names for your files, here is my suggestion: use a dictionary.
df_map = {'orders': df1, 'adobe': df2, 'mails': df3, 'sells': df4}
for name, df in df_map.items():
df.to_pickle(os.path.join(out, f'\\{name}.pkl')

Group by with where query on Pandas Python

I have dataset consists of categorical and numerical columns.
For instance: salary dataset
columns: ['job', 'country_origin', 'age', 'salary', 'degree','marital_status']
four categorical columns and two numerical columns and I want to use three aggregate functions:
cat_col = ['job', 'country_origin','degree','marital_status']
num_col = [ 'age', 'salary']
aggregate_function = ['avg','max','sum']
Currently, I have my Python code that using raw query, while my objective is to get the group-by query results from all combinations from lists above:
my query: "SELECT cat_col[0], aggregate_function[0](num_col[0]) from DB where marital_status = 'married' groub by cat_col[0]"
So queries are:
q1 = select job, avg(age) from DB where marietal_status='married' groub by job
q2 = select job, avg(salary) from DB where marietal_status='married' groub by job
etc
I used for loop to get the result from all combinations.
My problem is, I want to change that query to Pandas query. I've spent a couple of hours but could not solve it.
Pandas has a different way to querying data.
Sample dataframe:
df2 = pd.DataFrame(np.array([['programmer', 'US', 28,4000, 'master','unmarried'],
['data scientist', 'UK', 30,5000, 'PhD','unmarried'],
['manager', 'US', 48,9000, 'master','married']]),
columns=[['job', 'country_origin', 'age', 'salary', 'degree','marital_status']])
First import the libaries
import pandas as pd
Build the sample dataframe
df = pd.DataFrame( {
"job" : ["programmer","data scientist","manager"] ,
"country_origin" : ["US","UK","US"],
"age": [28,30,48],
"salary": [4000,5000,9000],
"degree": ["master","PhD","master"],
"marital_status": ["unmarried","unmarried","married"]} )
apply the where clause, save as a new dataframe (not necessary, but easier to read), you can of course use the filtered df inside the groupby
married=df[df['marital_status']=='married']
q1 = select job, avg(age) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"age":"mean"} )
or
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} )
age
job
manager 48
q2 = select job, avg(salary) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"salary":"mean"} )
salary
job
manager 9000
You can flatten the table by resetting the index
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} ).reset_index()
job age
0 manager 48
output the two stats together:
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean","salary":"mean"} ).reset_index()
job age salary
0 manager 48 9000
After you create your dataframe (df), the following command builds your desired table.
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
Here is a complete example:
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['job']=['tech','coder','admin','admin','admin','tech']
df['country_origin']=['japan','japan','US','US','India','India']
df['degree']=['cert','bs','bs','ms','bs','cert']
df['age']=[22,23,30,35,40,28]
df['salary']=[30,50,60,90,65,40]
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])

Categories