I'm trying to merge two Pandas dataframes on two columns. One column has a unique identifier that could be used to simply .merge() the two dataframes. However, the second column merge would actually use .merge_asof() because it would need to find the closest date, not an exact date match.
There is a similar question here: Pandas Merge on Name and Closest Date, but it was asked and answered nearly three years ago, and merge_asof() is a much newer addition.
I asked a similar here question a couple months ago, but the solution only needed to use merge_asof() without any exact matches required.
In the interest of including some code, it would look something like this:
df = pd.merge_asof(df1, df2, left_on=['ID','date_time'], right_on=['ID','date_time'])
where the ID's will match exactly, but the date_time's will be "near matches".
Any help is greatly appreciated.
Consider merging first on the ID and then run a DataFrame.apply to return highest date_time from first dataframe on matched IDs less than the current row date_time from second dataframe.
# INITIAL MERGE (CROSS-PRODUCT OF ALL ID PAIRINGS)
mdf = pd.merge(df1, df2, on=['ID'])
def f(row):
col = mdf[(mdf['ID'] == row['ID']) &
(mdf['date_time_x'] < row['date_time_y'])]['date_time_x'].max()
return col
# FILTER BY MATCHED DATES TO CONDITIONAL MAX
mdf = mdf[mdf['date_time_x'] == mdf.apply(f, axis=1)].reset_index(drop=True)
This assumes you want to keep all rows of df2 (i.e., right join). Simply flip _x / _y suffixes for left join.
The current solution would work on a small dataset but if you have hundreds of rows... I'm afraid not.
So, what you want to do is as follows:
df = pd.merge_asof(df1, df2, on = 'date_time', by = 'ID', direction = 'nearest')
Related
I'm attempting to merge two dataframes using two columns as keys: "Date" and "Instrument"
Here is my code:
merge_df = pd.merge(df1 , df2, how='outer', left_on=['Date','Instrument'], right_on = ['Date','Instrument'])
df1:
df2:
You'll notice that the row in each dataframe has the same instrument and date value: AEA000201011 & 2008-01-31.
The merged dataframe is stacking the two rows instead of combining them:
merged_df:
I have ensured that the dataframe key columns dtypes match:
df1:
df2:
Any advice would be much appreciated!
Man I wish I could use add comment section.
Even though you've probably already tried, have tried to use "left" or "right" instead of "outer"
Or for once check them like
df1["Instrument"].iloc[0] == df2["Instrument"].iloc[0]
Maybe they got some invisible chars in them. If it's like that you can try using strip() functions.
Nothing other than these comes to my mind.
I have a DataFrame that is the result of a large SQL query. I am trying to sort the DataFrame into 2 separate DataFrames. NVI and Main. They are both a list of repairs to trucks. I need to sort it based on if there is a specific profile id which is 7055. Which will go into the NVI DataFrame
If that job is encountered I need to grab the values from the "RO" "Unit Number" and Repair Date column. I then need to take those values and search the DataFrame again and grab any rows that have a matching RO and Unit number or a matching Unit number and a Repair date that is equal to or earlier than the date value in the the row that the 7055 was found. Those rows then need to go into the NVI df. Any remaining rows that do not match will go into the Main df.
The only static value is the profile id of 7055. The RO Unit Number and Repair date will all be different.
class nvi_dict(dict):
def __setitem__(self, key, value):
key = key.profile()
super().__setitem__(key, value)
nvisort = pd.DataFrame()
def sort_nvi_dict(row, component):
if row ['PROFILE_ID'] in cfg[component]['nvi']:
nvi_ro = nvi_dict()
nvi_ro ['RO'] = row ['RO']
nvi_ro ['UnitNum'] = row ['VFUNIT']
nvi_ro ['date']= row['REPAIR_DATE']
nvisort = nvidf.apply(lambda x: sort_nvi_dict(x, 'nvi_ro'), axis=1, result_type='expand')
I thought about trying to use a class to create a temp dict object to store the values from RO, UnitNum and Date. Which I can then call on to iterate over the df again looking for matching values.
I am using a .yml file to store dictionaries. That I am using to further sort each of the NVI and Main df's after they have been sorted out. Because they will then need to each be sorted by truck manufacturer
I think this might work, unable to test without the test data though...
df1 = nvisort[nvisort['profile_id'] = 7055]
df2 = pd.merge(nvisort,df1[['RO','Unit Number']],on=['RO','Unit number'],how='right')
df3 = pd.merge(nvisort,df1[['Unit Number','Repair Date']],on='Unit Number'],how='right')
df3 = df3[df3['Repair Date_x'] <= df3['Repair Date_y']]
df3 = df3.drop(columns='Repair Date_y']
df3 = df3.rename(columns={'Repair Date_x':'Repair Date'})
NVI = pd.concat([df1,df2,df3])
Main = pd.concat([NVI,nvisort]).drop_duplicates(keep=False)
I'm assuming that your original/starting dataframe here is the nvisort, and then we filter that just to get profile_id of 7055 and call that df1
Then we are going to get your two different pieces of criteria into df2 and df3.
df2 is just a filter on the original dataframe where RO and Unit Number match, so we can use pd.merge() to effectively get that filter.
df3 is a more complicated filter since it is the less than or equal, not the equal. So first we do the merge to filter on matching unit numbers, but we also bring over the Repair Date from both tables into df3, and these get appended _x and _y on the column names. So then we filter where the date on the _x is less than on _y and then clean it up.
Last, you get Main by finding everything from the original nvisort that is not in NVI. Since NVI is a subset of nvisort, you can just concat them and drop all duplicates, leaving only data that exists in one of the dataframes.
From what i understand of your question, you want to divide a dataframe into 2 based on certain conditions?
df1 = df[<condition>]
condition can be - df[profile id] == 7055 and Allunits.contains(df[unit])
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I'm working on how to cluster a patstat (reference database) database.
With my own agorithm I came up with a dataframe which shows me the author, beginpage, endpage, volume and publication_year of a reference.
running:
dfhead = df.head(10)
shows me
Now I want the following:
Show inner join with the SAME dataframe such that for example author, beginpage and endpage are the same. (at least 3 similarities between the rows)
I tried:
c = ['author', 'beginpage','endpage', 'volume','publication year']
df_merge = dfhead.merge(dfhead, how = 'inner',on = [c[0],c[1],c[2]])
where
The answer will then be given such that there only exists an inner join with exactly the same row, but I don't want those to include.
In the example above the df_merge should not take any values since there are no 3 similar columns.
What if there would be some how to same row, I will show an example:
x = pd.Dataframe({'author':['lee','lee'], 'beginpage':[455,456],'endpage':[477,477],'volume':[300,300]})
Note that the two rows have (at least) 3 similar columns and therefore the merge/join should be visible.
BUT note that in don't want to include to join of exactly the same row!!!
You could do an inner join and apply filtering to exclude the same row, but maybe it would be more straightforward to use groupby instead?
df.groupby(by=['author', 'beginpage','endpage'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Applying aggregations/calculations/ect to groups:
https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html
Given 2 data frames like the link example, I need to add to df1 the "index income" from df2. I need to search by the df1 combined key in df2 and if there is a match return the value into a new column in df1. There is not an equal number of instances in df1 and df2 and there are about 700 rows in df1 1000 rows in df2.
I was able to do this in excel with a vlookup but I am trying to apply it to python code now.
This should solve your issue:
df1.merge(df2, how='left', on='combind_key')
This (left join) will give you all the records of df1 and matching records from df2.
https://www.geeksforgeeks.org/how-to-do-a-vlookup-in-python-using-pandas/
Here is an answer using joins. I modified my df2 to only include useful columns then used pandas left join.
Left_join = pd.merge(df,
zip_df,
on ='State County',
how ='left')