I'm working on how to cluster a patstat (reference database) database.
With my own agorithm I came up with a dataframe which shows me the author, beginpage, endpage, volume and publication_year of a reference.
running:
dfhead = df.head(10)
shows me
Now I want the following:
Show inner join with the SAME dataframe such that for example author, beginpage and endpage are the same. (at least 3 similarities between the rows)
I tried:
c = ['author', 'beginpage','endpage', 'volume','publication year']
df_merge = dfhead.merge(dfhead, how = 'inner',on = [c[0],c[1],c[2]])
where
The answer will then be given such that there only exists an inner join with exactly the same row, but I don't want those to include.
In the example above the df_merge should not take any values since there are no 3 similar columns.
What if there would be some how to same row, I will show an example:
x = pd.Dataframe({'author':['lee','lee'], 'beginpage':[455,456],'endpage':[477,477],'volume':[300,300]})
Note that the two rows have (at least) 3 similar columns and therefore the merge/join should be visible.
BUT note that in don't want to include to join of exactly the same row!!!
You could do an inner join and apply filtering to exclude the same row, but maybe it would be more straightforward to use groupby instead?
df.groupby(by=['author', 'beginpage','endpage'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Applying aggregations/calculations/ect to groups:
https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html
Related
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I have two dataframes. One has 4 days worth of data while other one has 2 days. Dataframe one looks like this
while df2 looks like this:
I need to join these. There are two options for joining these. First join based on dates that are existent in both. Second
I am merging them like this:
using this code:
pd.merge(freq_df_two,freq_df_one, on=["date","hour"])
Issue is that if the date from df1 is not present in df2 then it simply drops it. Forexample as you can see it doesnt have 2020-09-02. I want it to display NaN or 0 if that date and hour is not present in second df. How do I do that?
Add how='outer' to your merge:
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how='outer')
Pandas merge function by default uses the inner strategy when merging, similar to INNER JOIN in sql. Meaning if the data is not present in the second one then its simply dropped. You should use the left strategy to merge, similar to LEFT JOIN.
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how='left')
you can specify a how parameter.
Here outer is the equivalent of the SQL full outer join
pd.merge(freq_df_two,freq_df_one, on=["date","hour"], how="outer")
More info here
So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')
I have been trying to make a comparison of two dataframes, creating new dataframes for the ones which have the same entries in two columns. I thought I had cracked it but the code I have now just looks at the two columns of interest and if the string is found anywhere in that column it considers it a match. I need the two strings to be common on the same row across the columns. A sample of the code follows.
#produce table with common items
vto_in_jeff = df_vto[(df_vto['source'].isin(df_jeff['source']) & df_vto['target'].isin(df_jeff['target']))].dropna().reset_index(drop=True)
#vto_in_jeff.index = vto_in_jeff.index + 1
vto_in_jeff['compare'] = 'Common_terms'
print(vto_in_jeff)
vto_in_jeff.to_csv(output_path+'vto_in_'+f+'.csv', index=False)
So this code comes out with a table which has a list of the rows which has both source and target strings, but not the source and target strings necessarily having to appear in the same row. Can anyone help me look specifically row by row?
you can use the pandas merge method
result = pd.merge(df1, df2, on='key')
here are more details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra
I'm trying to merge two Pandas dataframes on two columns. One column has a unique identifier that could be used to simply .merge() the two dataframes. However, the second column merge would actually use .merge_asof() because it would need to find the closest date, not an exact date match.
There is a similar question here: Pandas Merge on Name and Closest Date, but it was asked and answered nearly three years ago, and merge_asof() is a much newer addition.
I asked a similar here question a couple months ago, but the solution only needed to use merge_asof() without any exact matches required.
In the interest of including some code, it would look something like this:
df = pd.merge_asof(df1, df2, left_on=['ID','date_time'], right_on=['ID','date_time'])
where the ID's will match exactly, but the date_time's will be "near matches".
Any help is greatly appreciated.
Consider merging first on the ID and then run a DataFrame.apply to return highest date_time from first dataframe on matched IDs less than the current row date_time from second dataframe.
# INITIAL MERGE (CROSS-PRODUCT OF ALL ID PAIRINGS)
mdf = pd.merge(df1, df2, on=['ID'])
def f(row):
col = mdf[(mdf['ID'] == row['ID']) &
(mdf['date_time_x'] < row['date_time_y'])]['date_time_x'].max()
return col
# FILTER BY MATCHED DATES TO CONDITIONAL MAX
mdf = mdf[mdf['date_time_x'] == mdf.apply(f, axis=1)].reset_index(drop=True)
This assumes you want to keep all rows of df2 (i.e., right join). Simply flip _x / _y suffixes for left join.
The current solution would work on a small dataset but if you have hundreds of rows... I'm afraid not.
So, what you want to do is as follows:
df = pd.merge_asof(df1, df2, on = 'date_time', by = 'ID', direction = 'nearest')