Pandas: how to adapt one dataframe to another based on the date?

Pandas: how to adapt one dataframe to another based on the date? - python

I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))

Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627

Related

How to interpolate only over a specific window?

I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?

Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0

Reindex a dataframe using tickers from another dataframe

I read csv files into dataframe using
from glob import glob
import pandas as pd
def read_file(f):
df = pd.read_csv(f)
df['ticker'] = f.split('.')[0]
return df
df = pd.concat([read_file(f) for f in glob('*.csv')])
df = df.set_index(['Date','ticker'])[['Close']].unstack()
And got the following dataframe:
Close
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
.
.
.
now I would like to use the 'ticker' to reindex another random dataframe created by
data = np.random.random((df.shape[1], 100))
df1 = pd.DataFrame(data)
which looks like:
0 1 2 3 4 5 6 \...
0 0.493036 0.114539 0.862388 0.156381 0.030477 0.094902 0.132268
1 0.486184 0.483585 0.090874 0.751288 0.042761 0.150361 0.781567
2 0.318586 0.078662 0.238091 0.963334 0.815566 0.274273 0.320380
3 0.708489 0.354177 0.285239 0.565553 0.212956 0.275228 0.597578
4 0.150210 0.423037 0.785664 0.956781 0.894701 0.707344 0.883821
5 0.005920 0.115123 0.334728 0.874415 0.537229 0.557406 0.338663
6 0.066458 0.189493 0.887536 0.915425 0.513706 0.628737 0.132074
7 0.729326 0.241142 0.574517 0.784602 0.287874 0.402234 0.926567
8 0.284867 0.996575 0.002095 0.325658 0.525330 0.493434 0.701801
9 0.355176 0.365045 0.270155 0.681947 0.153718 0.644909 0.952764
10 0.352828 0.557434 0.919820 0.952302 0.941161 0.246068 0.538714
11 0.465394 0.101752 0.746205 0.897994 0.528437 0.001023 0.979411
I tried
df1 = df1.set_index(df.columns.values)
but it seems my df only has one level of index since the error says
IndexError: Too many levels: Index has only 1 level, not 2
But if I check the index by df.index it gives me the Date, can someone help me solve this problem?

You can get the column labels of a particular level of the MultiIndex in df by MultiIndex.get_level_values, as follows:
df_ticker = df.columns.get_level_values('ticker')
Then, if df1 has the same number of columns, you can copy the labels extracted to df1 by:
df1.columns = df_ticker

How to get mean of last month in pandas

I have a data set with first column is the Date, Second column is the Collaborator and third column is price paid.
I want to get the mean price paid of every Collaborator for the previous month. I want to return a table tha looks like this:
I used some solutions like rolling but i could get only the past X days, not the past month

Pandas has a built-in method .rolling
x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean
Hence:
df['LastMonthMean'] = df['Price'].rolling(x).mean()
I'm not sure how you want to calculate your mean but hope this helps

I would first add month column and then use groupby and would retrieve the first item
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 1, 2, 2, 2],
'collaborator': [1, 2, 3, 1, 2, 3],
'price': [100, 200, 300, 400, 500, 600]
})
df.groupby(['collaborator', 'month']).mean()

The rolling() method would have to be applied to the DataFrame grouped by Collaborator to obtain the mean sale price of every collaborator in the previous month.
Because the data would be grouped by and summarised, the number of data points would not match the original dataset, thus not allowing you to easily append the result to the original dataset.
If you use a DatetimeIndex in your DataFrame it will be considered a time series and then you can resample() the data more easily.
I have produced a replicable solution below, based on your initial question in which I resample the data and append the last month's mean to it. Thanks to #akilat90 for the function to generate random dates within a range.
import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from #akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)
monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()
data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])
data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']
This is the output:
Collaborator Price LastMonthMean
2021-01-31 04:26:16 2 21.838910 NaN
2021-01-31 05:33:04 2 19.164086 NaN
2021-01-31 12:32:44 2 24.949444 NaN
2021-01-31 12:58:02 2 8.907224 NaN
2021-01-31 14:43:07 1 7.446839 NaN
2021-01-31 18:38:11 3 6.565208 NaN
2021-02-01 00:08:25 2 24.520149 15.230642
2021-02-01 09:25:54 2 20.614261 15.230642
2021-02-01 09:59:48 2 10.879633 15.230642
2021-02-02 10:12:51 1 22.134549 14.180087
2021-02-02 17:22:18 2 24.469944 15.230642
As you can see, the records in January 2021, the first month in this time series, do not have a valid Last Month Mean, unlike the records in February.

How to create a groupby dataframe without a multi-level index

I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())

a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')

You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:

How to combine daily data with monthly data in Pandas?

I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.

You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)

You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: how to adapt one dataframe to another based on the date? - python

Related

How to interpolate only over a specific window?

Reindex a dataframe using tickers from another dataframe

How to get mean of last month in pandas

How to create a groupby dataframe without a multi-level index

How to combine daily data with monthly data in Pandas?

Categories

Resources