I would like to calculate the "first of quarter" in a pandas dataframe. However I am running into some problems. My Pandas version is 0.17.1.
import pandas as pd
import datetime as dt
test=pd.Timestamp(dt.datetime(2011,1,20))
test=test.tz_localize('Europe/Rome')
previousquarter=test-pd.tseries.offsets.QuarterBegin()
nextquarter=test+pd.tseries.offsets.QuarterBegin()
My expected results would be previousquarter = (2011,1,1) and nextquarter = (2011,4,1). But what I get is previousquarter = (2010,12,1) and nextquarter = (2011,3,1).
I have also tried it without tz_localize. However, it did not change the result.
Am I doing something wrong here or is this a bug somewhere?
Thanks in advance!
P.S. I know I could correct it by shifting one month, but this seems to be a rather crude workaround.
Yup, looks like a bug: https://github.com/pydata/pandas/issues/8435
There is a better workaround than shifting a month though:
offsets.QuarterBegin(startingMonth=1)
The answer given my Marshall worked fine for me except for first days of each year where it was pointing to first date of the last quarter of the previous year. eg of 2018-01-01 I was getting 2017-10-01
I had to do the following to handle that :
(date + pd.tseries.offsets.DateOffset(days=1)) - pd.tseries.offsets.QuarterBegin(startingMonth=1)
where date is a datetime.datetime object.
Reproduced #Abhi's bug in Pandas 1.1.3. In fact, couldn't get consistent results from QuarterEnd or QuarterBegin for all starting dates within the quarter. Instead resorted to going to the end of the prior quarter and adding a day, or to the beginning of the next quarter and subtracting a day. Note QuarterEnd(startingMonth=12)
import pandas as pd
print(" date Quarter Quarter begin Quarter end ")
for yr in range(2020, 2021):
for mo in range(1,13):
for dy in range(1,4):
date = pd.Timestamp(yr, mo, dy)
if dy == 3:
date = date + pd.tseries.offsets.MonthEnd()
qbegin = date + pd.offsets.QuarterEnd(-1, startingMonth=12) + pd.offsets.Day(1)
qend = date + pd.offsets.QuarterBegin(1, startingMonth=1) - pd.offsets.Day(1)
print("{} {} {} {}".format(date, date.quarter, qbegin, qend))
Related
I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64
I've written this function to get the last Thursday of the month
def last_thurs_date(date):
month=date.dt.month
year=date.dt.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
But its not working with the lambda function.
datelist['Date'].map(lambda x: last_thurs_date(x))
Where datelist is
datelist = pd.DataFrame(pd.date_range(start = pd.to_datetime('01-01-2014',format='%d-%m-%Y')
, end = pd.to_datetime('06-03-2019',format='%d-%m-%Y'),freq='D').tolist()).rename(columns={0:'Date'})
datelist['Date']=pd.to_datetime(datelist['Date'])
Jpp already added the solution, but just to add a slightly more readable formatted string - see this awesome website.
import calendar
def last_thurs_date(date):
year, month = date.year, date.month
cal = calendar.monthcalendar(year, month)
# the last (4th week -> row) thursday (4th day -> column) of the calendar
# except when 0, then take the 3rd week (February exception)
last_thurs_date = cal[4][4] if cal[4][4] > 0 else cal[3][4]
return f'{year}-{month:02d}-{last_thurs_date}'
Also added a bit of logic - e.g. you got 2019-02-0 as February doesn't have 4 full weeks.
Scalar datetime objects don't have a dt accessor, series do: see pd.Series.dt. If you remove this, your function works fine. The key is understanding that pd.Series.apply passes scalars to your custom function via a loop, not an entire series.
def last_thurs_date(date):
month = date.month
year = date.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
You can rewrite your logic more succinctly via f-strings (Python 3.6+) and a ternary statement:
def last_thurs_date(date):
month = date.month
year = date.year
last_thurs_date = calendar.monthcalendar(year, month)[4][4]
return f'{year}{"-0" if month < 10 else "-"}{month}-{last_thurs_date}'
I know that a lot of time has passed since the date of this post, but I think it would be worth adding another option if someone came across this thread
Even though I use pandas every day at work, in that case my suggestion would be to just use the datetutil library. The solution is a simple one-liner, without unnecessary combinations.
from dateutil.rrule import rrule, MONTHLY, FR, SA
from datetime import datetime as dt
import pandas as pd
# monthly options expiration dates calculated for 2022
monthly_options = list(rrule(MONTHLY, count=12, byweekday=FR, bysetpos=3, dtstart=dt(2022,1,1)))
# last satruday of the month
last_saturday = list(rrule(MONTHLY, count=12, byweekday=SA, bysetpos=-1, dtstart=dt(2022,1,1)))
and then of course:
pd.DataFrame({'LAST_ST':last_saturdays}) #or whatever you need
This question answer Calculate Last Friday of Month in Pandas
This can be modified by selecting the appropriate day of the week, here freq='W-FRI'
I think the easiest way is to create a pandas.DataFrame using pandas.date_range and specifying freq='W-FRI.
W-FRI is Weekly Fridays
pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')
Creates all the Fridays in the date range between the min and max of the dates in df
Use a .groupby on year and month, and select .last(), to get the last Friday of every month for every year in the date range.
Because this method finds all the Fridays for every month in the range and then chooses .last() for each month, there's not an issue with trying to figure out which week of the month has the last Friday.
With this, use pandas: Boolean Indexing to find values in the Date column of the dataframe that are in last_fridays_in_daterange.
Use the .isin method to determine containment.
pandas: DateOffset objects
import pandas as pd
# test data: given a dataframe with a datetime column
df = pd.DataFrame({'Date': pd.date_range(start=pd.to_datetime('2014-01-01'), end=pd.to_datetime('2020-08-31'), freq='D')})
# create a dateframe with all Fridays in the daterange for min and max of df.Date
fridays = pd.DataFrame({'datetime': pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')})
# use groubpy and last, to get the last Friday of each month into a list
last_fridays_in_daterange = fridays.groupby([fridays.datetime.dt.year, fridays.datetime.dt.month]).last()['datetime'].tolist()
# find the data for the last Friday of the month
df[df.Date.isin(last_fridays_in_daterange)]
ANSWER:
I found a way to answer my own question. Assuming I am looking for the location of one given day only (then extrapolate for my specific question):
group_by = df.groupby(level='lvl_1')
ans = group_by.nth(df.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
Ideally, I would work with the location of each groupid, considering that the datetime vector could be different. However, I am having a hard time to figure out the last step...:
group_by = df.groupby(level='lvl_1')
loc = group_by.apply(lambda x: x.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
ans = group_by.nth(loc.groupby(level='lvl_1'))
But it gives me an error for my last line:
TypeError: n needs to be an int or a list/set/tuple of ints
If someone finds a way to solve this slight issue, fire up! thxs
----------------------------------------------------------------------------------------------------------------------------------------------------------------
QUESTION
I have been looking around for an answer but most of the posts are related to difference in days, but not value difference between two dates.
Assuming the following code :
import pandas as pd
import numpy as np
import datetime
np.random.seed(15)
day = datetime.date.today()
day_1 = datetime.date.today() - datetime.timedelta(1)
day_2 = datetime.date.today() - datetime.timedelta(2)
day_3 = datetime.date.today() - datetime.timedelta(3)
ticker_date = [('fi', day), ('fi', day_1), ('fi', day_2), ('fi', day_3),
('di', day), ('di', day_1), ('di', day_2), ('di', day_3)]
index_df = pd.MultiIndex.from_tuples(ticker_date, names=['lvl_1', 'lvl_2'])
df = pd.DataFrame(np.random.rand(8), index_df, ['value'])
output:
value
lvl_1 lvl_2
fi 2018-02-15 0.848818
2018-02-14 0.178896
2018-02-13 0.054363
2018-02-12 0.361538
di 2018-02-15 0.275401
2018-02-14 0.530000
2018-02-13 0.305919
2018-02-12 0.304474
I am looking for a method to groupby 'lvl_1' then get the difference between two given dates.
For instance, the difference between February 14th and February 12th would be -0.1864 for 'fi' and 0.225526 for 'di'.
I was working on the following lines of codes:
group_by = df.groupby(level='lvl_1')
nd = group_by.get_loc(day_3, method='nearest')
st = group_by.get_loc(day_1, method='nearest')
out = group_by.iloc[nd] - group_by.iloc[st]
But it looks like it is not a valid method...
AttributeError: 'DataFrameGroupBy' object has no attribute 'get_loc'
Anyone?
This is a bit different from yours in spirit, but it should give what you want (although if your database is very big it might waste memory):
expanded = df.reset_index().pivot_table(index='lvl_1',columns='lvl_2',values='value')
expanded[day_3] - expanded[day_1]
This returns a Series with the difference:
lvl_1
di -0.225526
fi 0.182643
dtype: float64
ANSWER:
I found a way to answer my own question. Assuming I am looking for the location of one given day only (then extrapolate for my specific question):
group_by = df.groupby(level='lvl_1')
ans = group_by.nth(df.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
Ideally, I would work with the location of each groupid, considering that the datetime vector could be different. However, I am having a hard time to figure out the last step...:
group_by = df.groupby(level='lvl_1')
loc = group_by.apply(lambda x: x.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
ans = group_by.nth(loc.groupby(level='lvl_1'))
But it gives me an error for my last line:
TypeError: n needs to be an int or a list/set/tuple of ints
If someone finds a way to solve this slight issue, fire up! In the meantime, my temporary answer does the job. thxs
From the daily stock price data, I want to sample and select end of the month price. I am accomplishing using the following code.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin=end-pd.DateOffset(365*2)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2])).set_index(data.index)
The line above selects end of the month data and here is the output.
If I want to select penultimate value of the month, I can do it using the following code.
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2]))
Here is the output.
However the index shows end of the month value. When I choose penultimate value of the month, I want index to be 2015-12-30 instead of 2015-12-31.
Please suggest the way forward. I hope my question is clear.
Thanking you in anticipation.
Regards,
Abhishek
I am not sure if there is a way to do it with resample. But, you can get what you want using groupby and TimeGrouper.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin = end - pd.DateOffset(365*2)
st = begin.strftime('%Y-%m-%d')
ed = end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
data['Date'] = data.index
mon_data = (
data[['Date', 'Adj Close']]
.groupby(pd.TimeGrouper(freq='M')).nth(-2)
.set_index('Date')
)
simplest solution is to take the index of your newly created dataframe and subtract the number of days you want to go back:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-1-n]))
mon_data.index = mon_data.index - datetime.timedelta(days=n)
also, seeing your data, i think that you should resample not to ' month end frequency' but rather to 'business month end frequency':
.resample('BM')
but even that won't cover it all, because for instance December 29, 2017 is a business month end, but this date doesn't appear in your data (which ends in December 08 2017). so you could add a small fix to that (assuming the original data is sorted by the date):
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
so, the full code will look like:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('BM').apply(lambda x: x[-1-n]))
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
mon_data.index = mon_data.index - datetime.timedelta(days=n)
btw: your .set_index(data.index) throw an error because data and mon_data are in different dimensions (mon_data is monthly grouped_by)
I am working with a dataset that encodes dates as the integer number of months since December 1899, so month 1 is January 1900 and month 1165 is January 1997. I would like to convert to a pandas DateTimeIndex. So far the best I've come up with is:
month0 = np.datetime64('1899-12-15')
one_month = np.timedelta64(30, 'D') + np.timedelta64(10.5, 'h')
birthdates = pandas.DatetimeIndex(month0 + one_month * resp.cmbirth)
The start date is the 15th of the month, and the timedelta is 30 days 10.5 hours, the average length of a calendar month. So the date within the month drifts by a day or two.
So this seems a little hacky and I wondered if there's a better way.
You can use built-in pandas date-time functionality.
import pandas as pd
import numpy as np
indexed_months = np.random.random_integers(0, high=1165, size=100)
month0 = pd.to_datetime('1899-12-01')
date_list = [month0 + pd.DateOffset(months=mnt) for mnt in indexed_months]
birthdates = pd.DatetimeIndex(date_list)
I've made an assumption that your resp.cmbirth object looks like an array of integers between 0 and 1165.
I'm not quite clear on why you want the bin edges of the indices to be offset from the start or end of the month. This can be done:
shifted_birthdates = birthdates.shift(15, freq=pd.datetools.day)
and similarly for hours if you want. There is also useful info in the answers to this SO question and the related pandas github issue.