In Python/Pandas how do I convert century-months to DateTimeIndex? - python

I am working with a dataset that encodes dates as the integer number of months since December 1899, so month 1 is January 1900 and month 1165 is January 1997. I would like to convert to a pandas DateTimeIndex. So far the best I've come up with is:
month0 = np.datetime64('1899-12-15')
one_month = np.timedelta64(30, 'D') + np.timedelta64(10.5, 'h')
birthdates = pandas.DatetimeIndex(month0 + one_month * resp.cmbirth)
The start date is the 15th of the month, and the timedelta is 30 days 10.5 hours, the average length of a calendar month. So the date within the month drifts by a day or two.
So this seems a little hacky and I wondered if there's a better way.

You can use built-in pandas date-time functionality.
import pandas as pd
import numpy as np
indexed_months = np.random.random_integers(0, high=1165, size=100)
month0 = pd.to_datetime('1899-12-01')
date_list = [month0 + pd.DateOffset(months=mnt) for mnt in indexed_months]
birthdates = pd.DatetimeIndex(date_list)
I've made an assumption that your resp.cmbirth object looks like an array of integers between 0 and 1165.
I'm not quite clear on why you want the bin edges of the indices to be offset from the start or end of the month. This can be done:
shifted_birthdates = birthdates.shift(15, freq=pd.datetools.day)
and similarly for hours if you want. There is also useful info in the answers to this SO question and the related pandas github issue.

Related

How to use relativedelta to dynamically add dates to a list of dates in a dataframe

I am new to python and have a few questions regarding dates.
Here is an example - I have a list of dates going from 01/01/2012 - 01/01/2025 with a monthly frequency. They also will change based on the data frame. Say one column of dates will have 130 months in between, the other will have 140 months, and so on.
The end goal is: regardless of how many months each set has, I need each "group" to have 180 months. So, in the above example of 01/01/2012 - 1/1/2025, I would need to add enough months to reach 1/1/2027.
Please let me know if this makes sense.
So if I understand you correctly, you have some data like:
import pandas as pd, numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
from random import randint
starts = [dt for i in range(130) if (dt := date(2012, 1, 1) + relativedelta(months=i)) <= date(2020, 1, 1)]
ends = [dt + relativedelta(months=randint(1, 5)) for dt in starts]
df = pd.DataFrame({ 'start': starts, 'end': ends })
so the current duration in months is:
df['duration'] = ((df.end - df.start)/np.timedelta64(1, 'M')).round().astype(int)
and you want to know how many to add to make the duration 180 months?
df['need_to_add'] = 180 - df.duration
then you can calculate a new end by something like:
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
df['new_end'] = df.apply(lambda r: add_months(r['end'], r['need_to_add']), axis=1)
I'm sure I haven't quite understood, as you could just add 180 months to the start date, but hopefully this gets you close to where you need to be.

Python: slice yearly data between February and June with pandas

I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)
You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()

Why does pandas interpret Aug-30 as 1930-08, but not 2030-08? [duplicate]

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.
Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.
E.g.:
dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']
pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
'2061-01-09', '2055-02-08'],
dtype='datetime64[ns]', freq=None)
The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What's going on here?
That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:
datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)
datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)
Two digits year ambiguity
So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900
The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.
If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)
I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).
If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference
If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.
from the docs
Year 2000 (Y2K) issues: Python depends on the platform’s C library,
which generally doesn’t have year 2000 issues, since all dates and
times are represented internally as seconds since the epoch. Function
strptime() can parse 2-digit years when given %y format code. When
2-digit years are parsed, they are converted according to the POSIX
and ISO C standards: values 69–99 are mapped to 1969–1999, and values
0–68 are mapped to 2000–2068.
For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:
from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)
You may need to tune the threshold date closer to the present depending on the earliest dates in your data.
You can write a simple function to correct this parsing of wrong year as stated below:
import datetime
def fix_date(x):
if x.year > 1989:
year = x.year - 100
else:
year = x.year
return datetime.date(year,x.month,x.day)
df['date_column'] = data['date_column'].apply(fix_date)
Hope this helps..
Another quick solution to the problem:-
import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])
for i in dates:
tempyear=pd.to_numeric(dates[i].str[-2:])
dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
dates["temp_month"]=dates[i].str[:-2]
dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
tempdrops=["temp_year","temp_month","temp_flyr",i]
dates.drop(tempdrops, axis=1, inplace=True)
And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime
pddt
0 2005-09-26
1 2005-09-26
2 1970-06-15
3 1994-12-05
4 1961-01-09
5 1955-02-08
As mentioned in some other answers this works best if there is no overlap between the dates of the two centuries.
If running into the same problem using a pandas DataFrame, try using the current year or year greater than a particular year, then apply a lambda function similar to below:
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > dt.datetime.now() else x)
or
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > 2022 else x)

Python Find # of Months Between 2 Dates

I am trying to find the # of months between 2 dates. Some solutions are off by 1 month and others are off by several months. I found This solution on SO but the solutions are either too complicated or incorrect.
For example, given the starting date of 04/30/12 and ending date of 03/31/16,
def diff_month(d1, d2):
return (d1.year - d2.year)*12 + d1.month - d2.month
returns 47 months, not 48
and
dates = [dt for dt in rrule(MONTHLY, dtstart=strt_dt, until=end_dt)]
returns 44 (Reason being that February does not have a day # 30 so it does not see it as a valid date)
I can of course fix that by doing
dates = [dt for dt in rrule(MONTHLY, dtstart=strt_dt.replace(day=2), until=end_dt.replace(day=1))]
But this does not seem like a proper solution (I mean the answer is right but the method sucks).
Is there a proper way of calculating the # of months so that given my example dates, it would return 48?
I realize this post doesn't have a Pandas tag, but if you are willing to use it you can simply do the following which takes the difference between two monthly periods:
import pandas as pd
>>> pd.Period('2016-3-31', 'M') - pd.Period('2012-4-30', 'M')
47

Pandas QuarterBegin(): Possible Bug when calculating First of quarter

I would like to calculate the "first of quarter" in a pandas dataframe. However I am running into some problems. My Pandas version is 0.17.1.
import pandas as pd
import datetime as dt
test=pd.Timestamp(dt.datetime(2011,1,20))
test=test.tz_localize('Europe/Rome')
previousquarter=test-pd.tseries.offsets.QuarterBegin()
nextquarter=test+pd.tseries.offsets.QuarterBegin()
My expected results would be previousquarter = (2011,1,1) and nextquarter = (2011,4,1). But what I get is previousquarter = (2010,12,1) and nextquarter = (2011,3,1).
I have also tried it without tz_localize. However, it did not change the result.
Am I doing something wrong here or is this a bug somewhere?
Thanks in advance!
P.S. I know I could correct it by shifting one month, but this seems to be a rather crude workaround.
Yup, looks like a bug: https://github.com/pydata/pandas/issues/8435
There is a better workaround than shifting a month though:
offsets.QuarterBegin(startingMonth=1)
The answer given my Marshall worked fine for me except for first days of each year where it was pointing to first date of the last quarter of the previous year. eg of 2018-01-01 I was getting 2017-10-01
I had to do the following to handle that :
(date + pd.tseries.offsets.DateOffset(days=1)) - pd.tseries.offsets.QuarterBegin(startingMonth=1)
where date is a datetime.datetime object.
Reproduced #Abhi's bug in Pandas 1.1.3. In fact, couldn't get consistent results from QuarterEnd or QuarterBegin for all starting dates within the quarter. Instead resorted to going to the end of the prior quarter and adding a day, or to the beginning of the next quarter and subtracting a day. Note QuarterEnd(startingMonth=12)
import pandas as pd
print(" date Quarter Quarter begin Quarter end ")
for yr in range(2020, 2021):
for mo in range(1,13):
for dy in range(1,4):
date = pd.Timestamp(yr, mo, dy)
if dy == 3:
date = date + pd.tseries.offsets.MonthEnd()
qbegin = date + pd.offsets.QuarterEnd(-1, startingMonth=12) + pd.offsets.Day(1)
qend = date + pd.offsets.QuarterBegin(1, startingMonth=1) - pd.offsets.Day(1)
print("{} {} {} {}".format(date, date.quarter, qbegin, qend))

Categories