So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')
Related
I've been learning pandas for a couple of days. I am migrating a SQL DB to PYTHON and have encountered the sql statement (example):
select * from
table_A a
left join table_B b
on a.ide = b.ide
and a.credit_type = case when b.type > 0 then b.credit_type else a.credit_type end
I've only been able to migrate to the first condition. My difficulty is in the last line and I don't know how to migrate it. Tables are actually sql queries that I've stored in dataframes.
merge = pd.merge(df_query_a, df_query_b),on='ide', how='left')
any suggestions please.
The Case condition is like an if-then-else statement, which you can implement in Pandas using np.where() like below:
Based on left join resulting dataframe merge:
import numpy as np
merge['credit_type_x'] = np.where(merge['type_y'] > 0, merge['credit_type_y'], merge['credit_type_x'])
Here the column names credit_type_x credit_type_y should have been created on the Pandas merge function after renaming conflicting (same) column names on the 2 sources dataframes. In case dataframe merge doesn't have the column type_y because column type appears only on Table_B but not on Table_A, you can use column name type here instead.
Alternatively, as you just need to modify the value of credit_type_x only when type_y > 0 and retain the value of credit_type_x without modification if NOT type_y > 0, we can also do it simply by:
merge.loc[merge['type_y'] > 0, 'credit_type_x'] = merge['credit_type_y']
Below two options to face your problem
You can add a column in df_query_a based on the condition that you need considering two dataframes, and after that, make the merge.
You can try with some library as pandasql3.
I'm working on how to cluster a patstat (reference database) database.
With my own agorithm I came up with a dataframe which shows me the author, beginpage, endpage, volume and publication_year of a reference.
running:
dfhead = df.head(10)
shows me
Now I want the following:
Show inner join with the SAME dataframe such that for example author, beginpage and endpage are the same. (at least 3 similarities between the rows)
I tried:
c = ['author', 'beginpage','endpage', 'volume','publication year']
df_merge = dfhead.merge(dfhead, how = 'inner',on = [c[0],c[1],c[2]])
where
The answer will then be given such that there only exists an inner join with exactly the same row, but I don't want those to include.
In the example above the df_merge should not take any values since there are no 3 similar columns.
What if there would be some how to same row, I will show an example:
x = pd.Dataframe({'author':['lee','lee'], 'beginpage':[455,456],'endpage':[477,477],'volume':[300,300]})
Note that the two rows have (at least) 3 similar columns and therefore the merge/join should be visible.
BUT note that in don't want to include to join of exactly the same row!!!
You could do an inner join and apply filtering to exclude the same row, but maybe it would be more straightforward to use groupby instead?
df.groupby(by=['author', 'beginpage','endpage'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Applying aggregations/calculations/ect to groups:
https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html
I have the 2 Excel sheets one have 63000 rows and the other one had 67000 rows which contains careers and their elgibility both have same title so I merged based on the title but the output shows me 44,00,000 rows why so , pls help me in this problem thank you,
Import pandas as pd
Df = pd.read_excel('c/downloads/knowledge.xlsx')
Df1 = pd.read_excel('c/downloads/Abilities.xlsx')
Df2 = pd .merge(df,df1,on = 'Title')
# Create a list of the files in the order you want to merge
all_df_list = [df, df1]
# Merge all the dataframes in all_df_list. Pandas will automatically append based on similar column names if that is what you meant by "same title".
appended_df = pd.concat(all_df_list)
# export as an excel file
appended_df.to_excel("data.xlsx", index=False)
Let me know if this helps. Works only if you have same labels in both of the files.
Make sure you're using the correct join type. Left, Right, Inner, Outer etc. It sounds like you need to use a Left Join. That will match data from the table on the right to the one on the left and return values accordingly, similar to a VLOOKUP. If the default join type is an Outer join then it will include all values from both tables and will dramatically increase your records.
I am trying to merge those two dataframes in order to replace in the left one values that are present in the right one with the same ticker and datetime.
Here is a small example
Here's a way using update:
# update uses index matching
left_df = left_df.set_index('Timestamp')
right_df = right_df.set_index('Timestamp')
# update does inplace modification, so returns nothing.
left_df.update(right_df)
print(left_df)
I have been trying to make a comparison of two dataframes, creating new dataframes for the ones which have the same entries in two columns. I thought I had cracked it but the code I have now just looks at the two columns of interest and if the string is found anywhere in that column it considers it a match. I need the two strings to be common on the same row across the columns. A sample of the code follows.
#produce table with common items
vto_in_jeff = df_vto[(df_vto['source'].isin(df_jeff['source']) & df_vto['target'].isin(df_jeff['target']))].dropna().reset_index(drop=True)
#vto_in_jeff.index = vto_in_jeff.index + 1
vto_in_jeff['compare'] = 'Common_terms'
print(vto_in_jeff)
vto_in_jeff.to_csv(output_path+'vto_in_'+f+'.csv', index=False)
So this code comes out with a table which has a list of the rows which has both source and target strings, but not the source and target strings necessarily having to appear in the same row. Can anyone help me look specifically row by row?
you can use the pandas merge method
result = pd.merge(df1, df2, on='key')
here are more details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra