I have two DataFrames that I want to merge. I have read about merging on multiple columns, and preserving the index when merging. My problem needs to cater for both, and I am having difficulty figuring out the best way to do this.
The first DataFrame looks like this
and the second looks like this
I want to merge these based on the Date and the ID. In the first DataFrame the Date is the index and the ID is a column; in the second DataFrame both Date and ID are part of a MultiIndex.
Essentially, as a result I want a DataFrame that looks like DataFrame 2 with an additional column for the Events from DataFrame 1.
I'd suggest reseting the index (reset_index) and then merging the DataFrame, as you've read. Then you can set the index (set_index) to reproduce your desired MultiIndex.
Related
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I am struggling to merge two pandas dataframes to replicate a vlookup function using two columns as lookup value.
The first dataframe df has 6 columns including three columns: perf, ticker and date. The perf column is empty and this is the one I would like to see populated. The second dataframe u includes the same three columns, including values in the perf column but only for a specific date.
I have tried this:
df=pd.merge(df,u,how='left',on=['ticker_and_exch_code', 'date'])
But the result I get is a dataframe with new perf columns instead of populating the one existing perf column. Would really appreciate insights into what I am missing, thanks!
Vincent
If the 'perf' column is empty in the first DataFrame, may I suggest removing it before merging the two DataFrames?
df=pd.merge(
df.drop(columns='perf'),
u,
how='left',
on=['ticker_and_exch_code', 'date'],
)
I have three DataFrames for which I am trying to merge and output the result. The common column in each DataFrame I am trying to merge on is COUNTRY.
Case1:
Before merging the three DataFrames I have set the index of each DataFrame to COUNTRY and did
pd.merge(leftdf,rightdf,left_index=True,right_index=True,how="inner")
I am getting the required answer. But when I am not setting the indices of each DataFrame to Country, leaving them as columns, and performing the merge
pd.merge(leftdf,rightdf,on="Country",how="inner")
the resultant DataFrame is reduced in size. I am loosing some rows. Why is this happening? I do not understand.
I have a pivot table with a multi-index in the name of the columns like this :
I want to keep the same data it is correct, but I want to give one name to each column that summarizes all the indexes to have something like this:
You can flatten a multi-index by converting it to a dataframe with text columns and joining them:
df.columns = df.columns.to_frame().astype(str).apply(''.join, axis=1)
The result should not be far from what you want. But as you have not given any reproducible example, I could not test against your data...
I have multiple dataframe with timeseries index in dfList.(example dataframe is shown below)
I tried to concatenate these dataframe into one dataframe by following command.
db=pd.concat(dfList)
and I got following dataframe.
Timeseries index are duplicated (many index are 2012-10-12 20:00:00) since timeseries in base dataframe was overlapping each other.
I want to remove this duplicate. Does anyone know how to do this?
some example dataframe in which timeseries index are overlapping is shown below
Thank you!!
You can simply drop the duplicates by a particular column value as mentioned in the docs here. You may do something like this:
db = db.drop_duplicates(cols="Timestamp")
which will drop all rows with duplicates in the column "Timestamp" except the first occurrence.