I have a data set with first column is the Date, Second column is the Collaborator and third column is price paid.
I want to get the mean price paid of every Collaborator for the previous month. I want to return a table tha looks like this:
I used some solutions like rolling but i could get only the past X days, not the past month
Pandas has a built-in method .rolling
x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean
Hence:
df['LastMonthMean'] = df['Price'].rolling(x).mean()
I'm not sure how you want to calculate your mean but hope this helps
I would first add month column and then use groupby and would retrieve the first item
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 1, 2, 2, 2],
'collaborator': [1, 2, 3, 1, 2, 3],
'price': [100, 200, 300, 400, 500, 600]
})
df.groupby(['collaborator', 'month']).mean()
The rolling() method would have to be applied to the DataFrame grouped by Collaborator to obtain the mean sale price of every collaborator in the previous month.
Because the data would be grouped by and summarised, the number of data points would not match the original dataset, thus not allowing you to easily append the result to the original dataset.
If you use a DatetimeIndex in your DataFrame it will be considered a time series and then you can resample() the data more easily.
I have produced a replicable solution below, based on your initial question in which I resample the data and append the last month's mean to it. Thanks to #akilat90 for the function to generate random dates within a range.
import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from #akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)
monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()
data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])
data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']
This is the output:
Collaborator Price LastMonthMean
2021-01-31 04:26:16 2 21.838910 NaN
2021-01-31 05:33:04 2 19.164086 NaN
2021-01-31 12:32:44 2 24.949444 NaN
2021-01-31 12:58:02 2 8.907224 NaN
2021-01-31 14:43:07 1 7.446839 NaN
2021-01-31 18:38:11 3 6.565208 NaN
2021-02-01 00:08:25 2 24.520149 15.230642
2021-02-01 09:25:54 2 20.614261 15.230642
2021-02-01 09:59:48 2 10.879633 15.230642
2021-02-02 10:12:51 1 22.134549 14.180087
2021-02-02 17:22:18 2 24.469944 15.230642
As you can see, the records in January 2021, the first month in this time series, do not have a valid Last Month Mean, unlike the records in February.
Related
My raw data is a dataframe with three columns that describe journeys: quantity, start date, end date. My goal is to create a new dataframe with a daily index and one single column that shows the sum of the quantities of the journeys that were "on the way" each day i.e. sum quantity if day > start date and day < end date.
I think I can achieve this by creating a daily index and then using a for loop that on each day uses a mask to filter the data, then sums. I haven't managed to make it work but I think that there might actually be a better approach? Below is my attempt with some dummy data...
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
index2 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df2 = pd.DataFrame(0,index2,'quantities')
for t in index2:
mask = (df['start']<t) & (df['end']>t)
df2['quantities'] = df[mask]['quantity'].sum()
Maybe you could create date range for each record, then explode and groupby:
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
df['range'] = df.apply(lambda x: pd.date_range(x['startdate'],x['enddate'],freq='D'), axis=1)
df = df.explode('range')
df.groupby('range')['quantity'].sum()
Your data describes a step function, ie on the 2nd of March (midnight) it increases by a value of 10, and on the 27th of March (midnight) it decreases by 10.
This solution uses a package called staircase which is built on pandas and numpy for working with (mathematical) step functions.
setup
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
dates = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df["startdate"] = pd.to_datetime(df["startdate"])
df["enddate"] = pd.to_datetime(df["enddate"])
solution
Create a staircase.Stairs object (which is to staircase as pandas.Series is to pandas) which represents a step function. It is as easy as passing the start times, end times, and values, which since your data is in a pandas.Dataframe can be done by passing the column names
import staircase as sc
sf = sc.Stairs(frame=df, start="startdate", end="enddate", value="quantity")
The step function will be composed of left-closed intervals by default.
There are lots of things you can do with step functions including plotting
sf.plot(style="hlines")
If you want to just get the value at the start of each day then you can sample the step function like this
sf(dates, include_index=True)
The result will be a pandas.Series indexed by your date range
2020-01-01 0
2020-01-02 33
2020-01-03 33
2020-01-04 33
2020-01-05 33
..
2020-05-28 0
2020-05-29 0
2020-05-30 0
2020-05-31 0
2020-06-01 0
Freq: D, Length: 153, dtype: int64
A more general solution to your problem which includes start and end times at any datetime (not just midnight) and arbitrary bins can be achieved with slicing and integrating.
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0
I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))
Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
I would like to get the mean time between timestamps per group. However, the groups are not ordered.
Code to create df:
d = {'ID': ['AI100', 'AI200', 'AI200', 'AI100','AI200','AI100'],
'Date': ['2019-01-10', '2018-06-01', '2018-06-11','2019-01-15','2018-06-21', '2019-01-22']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
data
ID Date
0 AI100 2019-01-10
1 AI200 2018-06-01
2 AI200 2018-06-11
3 AI100 2019-01-15
4 AI200 2018-06-21
5 AI100 2019-01-22
I tried the following:
data = data.sort_values(['ID','Date'],ascending=True).groupby('ID').head(3) #group the IDs
data['diffs'] = data['Date'].diff()
data['diffs'] = data['diffs'].apply(lambda x: x.days)
data = data.groupby(['ID'])[('diffs')].agg('mean')
However, this yields:
data.add_suffix('ID').reset_index()
ID diffs
0 AI100ID 6.000000
1 AI200ID -71.666667
The mean time for group AI100ID is correct, but not for group AI200ID.
What is going wrong?
I think the issue you're having here is that you aren't calculating your diffs by the group so it's calculating the difference between the previous group's last value and the new group's first value.
Change your line to this and you should get the expected result:
data['diffs'] = data.groupby('ID')['Date'].diff()
Footnote:
Another other tip unrelated to the main problem, but just in case you were unaware:
data['diffs'] = data['diffs'].apply(lambda x: x.days)
Can be written to use faster vectorised operations using the .dt accessor:
data['diffs'] = data['diffs'].dt.days
I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.
You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)
You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]