My pandas dataframe consists of a Column "timeStamp". I'm trying to obtain the difference between two values of two set of data frames. I use the following piece of code for it (see code). My question: How can I keep the date the same and only subtract the values?
merge is a nice approach as SwaggaTing suggested. Alternatively, you can set your date as the index:
a.set_index('date')['values_TProducing'] - b.set_index('date')['values_AProducing']
Assuming the dates are unique, you can join the dataframes on the date column and then subtract:
merged = a.merge(b, on='date')
merged['diff'] = merged['values_AProducing'] - merged['values_TProducing']
This assumes that the dates line up as they do in your example:
x = a.copy().drop('values_TProducing', 1)
x['values'] = a['values_TProducing'] - b['values_AProducing']
Related
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I am faced with a problem that is above my level of pandas - but might well be simple once I know the steps.
I have a dataframe with column names as below and I want to extract the period from the string of each column and pivot the period to the row as in the second example below.
I also want to format each column differently - currently it is just a number, but some should be % and some numbers and with certain amount of decimals. What I have now and what I want is outlined below.
I have tried a few things - creating a multi index with a string splitting method and then pivoting the multi index. I feel I am on the right track but just cannot make it work at present. Any help appreciated.
what I have now in a dataframe
client_return_12m,client_return_36m,client_return_60m,client_sharpe_12m,client_sharpe_36m,client_sharpe_60m
0.34116,0.56439,0.701156,0.74320,0.82349,0.76889
after
period,client_return,client_sharpe
12m,34.1%,0.74
36m,56.4%,0.82
60m,70.1%,0.77
Use str.rsplit by last _ and then reshape by DataFrame.stack:
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = df.stack().reset_index(level=0, drop=True).rename_axis('period').reset_index()
print (df)
period client_return client_sharpe
0 12m 0.341160 0.74320
1 36m 0.564390 0.82349
2 60m 0.701156 0.76889
Here is my dataframe:
import pandas as pd
dates = ('2020-09-24','2020-10-19','2020-12-17','2021-03-17','2021-06-17','2021-09-17','2022-03-17','2022-09-20','2023-09-19','2024-09-17','2025-09-17','2026-09-17','2027-09-17','2028-09-19','2029-09-18','2030-09-17','2031-09-17','2032-09-17','2035-09-18','2040-09-18','2045-09-19')
factors = ('1','0.999994','0.999875','1.000166','1.000303','1.000438','1.00056','1.000817','1.001046','1.001412','1.001525','1.001334','1.000685','0.999376','0.997456','0.994626','0.991244','0.986754','0.982072','0.962028','0.925136')
df = pd.DataFrame()
df['dates']=dates
df['factors']=factors
df['dates'] = pd.to_datetime(df['dates'])
df.set_index(['dates'],inplace=True)
df
Here is another dataframe with a timeseries with fixed interval
interpolated = pd.DataFrame(0, index=pd.date_range('2020-09-24', '2045-09-19', freq='3M'),columns=['result'])
The goal is to populate the second dataframe with the cubic spline interpolated values from the first table.
Thanks for all the ideas
Attempt
interpolated['result'] = df['factors'].interpolate(method='cubic')
However it gives only NaN values in the intepolated dataframe. Not sure how to correctly refer to the first table.
First things first, the shapes don't match. Since it seems none of the dates in the index from the df match the dates in interpolated, you just end up with NaN being filled in on the dates. I think you want something more like merge or join, as described in this post: Merging time series data by timestamp using numpy/pandas
merge
and join will also be helpful.
I'm trying to merge two Pandas dataframes on two columns. One column has a unique identifier that could be used to simply .merge() the two dataframes. However, the second column merge would actually use .merge_asof() because it would need to find the closest date, not an exact date match.
There is a similar question here: Pandas Merge on Name and Closest Date, but it was asked and answered nearly three years ago, and merge_asof() is a much newer addition.
I asked a similar here question a couple months ago, but the solution only needed to use merge_asof() without any exact matches required.
In the interest of including some code, it would look something like this:
df = pd.merge_asof(df1, df2, left_on=['ID','date_time'], right_on=['ID','date_time'])
where the ID's will match exactly, but the date_time's will be "near matches".
Any help is greatly appreciated.
Consider merging first on the ID and then run a DataFrame.apply to return highest date_time from first dataframe on matched IDs less than the current row date_time from second dataframe.
# INITIAL MERGE (CROSS-PRODUCT OF ALL ID PAIRINGS)
mdf = pd.merge(df1, df2, on=['ID'])
def f(row):
col = mdf[(mdf['ID'] == row['ID']) &
(mdf['date_time_x'] < row['date_time_y'])]['date_time_x'].max()
return col
# FILTER BY MATCHED DATES TO CONDITIONAL MAX
mdf = mdf[mdf['date_time_x'] == mdf.apply(f, axis=1)].reset_index(drop=True)
This assumes you want to keep all rows of df2 (i.e., right join). Simply flip _x / _y suffixes for left join.
The current solution would work on a small dataset but if you have hundreds of rows... I'm afraid not.
So, what you want to do is as follows:
df = pd.merge_asof(df1, df2, on = 'date_time', by = 'ID', direction = 'nearest')
I have the following dataframe:
I want to see which country has the biggest difference between the column "Gold" and "Gold 1". The index currently is the countries.
As an example, with Afghanistan, it would be 0 - 0 = 0. I do this with every country and than the highest number in that list is my response. That's how I figured I want to do it.
Does anyone know how I can do that? Or is there a built-in function that can calculate that?
You can subtract these two columns using the built-in vector subtraction:
df1['Gold'] - df2['Gold 1']
The country with biggest difference is
df.Gold.sub(df['Gold 1']).idxmax()
The biggest absolute difference
df.Gold.sub(df['Gold 1']).abs().idxmax()
You can also sort this by the difference
df.loc[df.Gold.sub(df['Gold 1']).sort_values().index]
Or the absolute differences
df.loc[df.Gold.sub(df['Gold 1']).abs().sort_values().index]
you can try out the code below:
import pandas as pd
df = pd.DataFrame([['Afgh',0,0],['Agnt',18,0]], columns = ['Country','Gold','Gold1'])
df['GoldDiff'] = df['Gold'] - df['Gold1']
df.sort_values(by = 'GoldDiff', ascending = False)
df is just a test dataframe based on yours above. df['GoldDiff'] creates a new column to store differences.
Then you can simply sort the values by using the sort_values function from pandas. You can also add the option inplace = True if you want to modify your dataframe as the sorted one.