RE pandas resample - python

Been trying to take an average of a month worth of data but I wanted to check that:
df=df.resample('M').mean()
Does give the monthly mean and NOT the mean of the last calander day of month
Also I've seen W-Mon which would give an average of the monday at a frequency of a week. What would be the equivalent to compare the monthly average of October over multiple years.
I thought it would be this- but it doesn't seem to recognise the command
df=df.resample("M-OCT").mean()

try this:
df.assign(y=df.index.year, m=df.index.month).query('m==10').groupby(['y', 'm']).mean()
PS if you need a neat and tested answer please post sample data set and desired output in your question...

Related

How to include year fixed effect (in a daily panel data)

I am working on a panel dataset that includes daily stock returns of 450 firms for 5 years and daily ESG score(momentum based) for 5 years. I want to regress stock return on daily ESG scores, keeping Firm and year fixed effect. I have used linearmodels.panel function in python and set the index('Stock ticker", "Date") before running the regressions with entity and time effects. In the regression result, the number of entities shows 450, which is perfect but the time period shows 1800. I am wondering how python is capturing the time effects? Is it based on year or some other way? What I want is a year fixed effects, where for a particular year all firm will have same indicator variable. Can someone please help me to do it in the right way?
the image shows the format of the data, where panel is based on daily returns
Sounds like your model is capturing daily fixed effects instead of yearly fixed effects. This is happening because you set Date as an index, so you're telling Python that you want one fixed effect per date.
You have to create a new column that only contains the year. That is, convert the date column to datetime format (see pandas.to_datetime) and then:
# Extract year from Date
df['Year'] = pd.DatetimeIndex(df['Date']).year
# Set indices
df = df.set_index(['Ticker','Year'])
Then run your model.
I recommend using linearmodels.PanelOLS because that module is specifically made for fitting fixed effects models.
For future reference, post your code and a replicable example so we can help you out more easily.

How to select specific weeks in a datetime indexed dataframe and plot their mean

I'm sorry my title may be a little bit confusing.
I got a huge datetime indexed dataframe with 1 entry per hour during 2 years.
I'm trying to study different seasonality at different scale (year, month, week, day...)
I had no problem to plot year, month, weeks or daily graphs, but I'm stuck for the next step.
I want to select a specific range of data (let's say for example between May and September if I want to study the summer), and plot a graphic representing the mean hour per hour of the selected period, on a week.
Like my first point would be the mean of all Mondays at 01:00 AM during this period, the second point the mean at 02:00, etc. till Sunday 23:00.
I just can't figure out how to do this, if someone can give me a clue :/
Hope you'll have all a nice day
Edit I don't have much code but I'll try to show you anyway.
I have tried to find one value, now I want to create a function that can find all the others and plot the graphic
season=df.loc['2019-04':'2019-09']
x=season[["column_name"]][(saison["hour"]==1)&(season["day"]==1)]
x.mean()
This give me the value i'm looking for for 1 AM, the monday.
Now I want to create a loop that can generate all the values in the right order to plot the whole week
Well, I Finally got my solution by using this
def graph(a,b,c): #With "a' the date starting the period I want, 'b' the date ending it and 'c' the column I want to plot
saison=df.loc[a:b]
test=[]
for i in range (7):
for j in range (24):
x=season[[c]][(season["hour"]==j)&(season["day"]==i)]
z=x.mean()
test.append(z)
plt.figure(figsize=[18,10])
plt.xticks([0,24,48,72,96,120,144],['Monday','Tuesday','Wednesday','Thursday','Friday', 'Saturday', 'Sunday'])
plt.plot(test)
pretty sure it's not the sexiest way to do it, but it seems working enough for now :)
Finally found a way to do it, I edited the first post I hope it will help other peoples

Prophet Parameters

I am currently using Prophet to forecast usage in a year period. This is my first time using this algo and I have some questions in mind.
I am utilising the code attached below. I am wondering if anyone has included holidays as parameter before and how to do so while including holidays from other calendar (lunar/islamic etc). Also since February may have 1 more day in a leap year, would be great as well to know if the algorithm take this into consideration?
m = Prophet(
growth='logistic',
seasonality_mode='multiplicative',
seasonality_prior_scale=1.5,
mcmc_samples=5,
n_changepoints=25,
changepoint_range=0.8,
yearly_seasonality='auto',
weekly_seasonality='auto',
daily_seasonality='auto',
holidays=None,
holidays_prior_scale=10.0,
changepoint_prior_scale=0.05,
interval_width=0.8,
stan_backend=None,
)
The holidays parameter takes in a dataframe. The minimal set of columns required in that dataframe are date and holiday name.
The important thing to note here is that you provide both historical and future holidays in this dataframe.
Apart from the 2 columns mentioned above, the following columns are optional:
lower_window, upper_window (int) - to extend holiday effect around the date of holiday.
prior_scale(float) - to set a different prior scale for each holiday.
Also to answer your second question i.e.
Also since February may have 1 more day in a leap year, would be great
as well to know if the algorithm take this into consideration?
It depends on the modelling data. Since the data you'd be providing would already include leap year, Prophet will take that into consideration.

How to visualize aggregate VADER sentiment score values over time in Python?

I have a Pandas dataframe containing tweets from the period July 24 2019 to 19 October 2019. I have applied the VADER sentiment analysis method to each tweet and added the sentiment scores in new columns.
Now, my hope was to visualize this in some kind of line chart in order to analyse how the averaged sentiment scores per day have changed over this three-months period. I therefore need the dates to be on the x-axis, and the averaged negative, positive and compound scores (three different lines) on the y-axis.
I have an idea that I need to somehow group or resample the data in order to show the aggregated sentiment value per day, but since my Python skills are still limited, I have not succeeded in finding a solution that works yet.
If anyone has an idea as to how I can proceed, that would be much appreciated! I have attached a picture of my dateframe as well as an example of the type of plot I had in mind :)
Cheers,
Nicolai
You should have a look at the groupby() method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Simply create a day column which contains a timestamp/datetime_object/dict/tuple/str ... which represents the day of the tweet and not it's exact time . Then use the groupby() method on this column.
If you don't know how to create this column, an easy way of doing it is using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Keep in mind that groupby method doesn't return a DataFrame but a groupby_generic.DataFrameGroupBy so you'll have to choose a way of aggreating the data in your groups (you should probably do groupby().mean() in your case, see grouby method documentation for more information)

Time series: EWMA pandas forecast

I have searched extensively in Google and here but cannot seem to find the answer I am looking for or at least, some thing I understand. Is it possible to use EWMA in Pandas for forecasting ? For example, if I had daily data of website clicks for 2 months 1st Feb to 31st Mar. and don't see any trend or seasonality in the data, it seems like I should be able to use EWMA to "predict" number of clicks at a later date say on 10th April. In Excel, I can imagine just filling approximately 10 dates or rows after 31st March and computing a moving average where the 5-day EWMA for 10th April will be based on weighted forecasts of prior days. Is there a way I can do this in Python ?
Thanks !
It's a one-liner to implement, but you're going to be a little bored by EWMA's predictions of the future (the mean is simply the most recent observation). If you'd like a python package that lets you experiment with EWMA level, trend and seasonality, try my Holt Winters implementation:
https://github.com/welch/seasonal
https://pypi.python.org/pypi/seasonal

Categories