Python - Pandas - Groupby - Value (not days) difference between two dates - python

ANSWER:
I found a way to answer my own question. Assuming I am looking for the location of one given day only (then extrapolate for my specific question):
group_by = df.groupby(level='lvl_1')
ans = group_by.nth(df.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
Ideally, I would work with the location of each groupid, considering that the datetime vector could be different. However, I am having a hard time to figure out the last step...:
group_by = df.groupby(level='lvl_1')
loc = group_by.apply(lambda x: x.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
ans = group_by.nth(loc.groupby(level='lvl_1'))
But it gives me an error for my last line:
TypeError: n needs to be an int or a list/set/tuple of ints
If someone finds a way to solve this slight issue, fire up! thxs
----------------------------------------------------------------------------------------------------------------------------------------------------------------
QUESTION
I have been looking around for an answer but most of the posts are related to difference in days, but not value difference between two dates.
Assuming the following code :
import pandas as pd
import numpy as np
import datetime
np.random.seed(15)
day = datetime.date.today()
day_1 = datetime.date.today() - datetime.timedelta(1)
day_2 = datetime.date.today() - datetime.timedelta(2)
day_3 = datetime.date.today() - datetime.timedelta(3)
ticker_date = [('fi', day), ('fi', day_1), ('fi', day_2), ('fi', day_3),
('di', day), ('di', day_1), ('di', day_2), ('di', day_3)]
index_df = pd.MultiIndex.from_tuples(ticker_date, names=['lvl_1', 'lvl_2'])
df = pd.DataFrame(np.random.rand(8), index_df, ['value'])
output:
value
lvl_1 lvl_2
fi 2018-02-15 0.848818
2018-02-14 0.178896
2018-02-13 0.054363
2018-02-12 0.361538
di 2018-02-15 0.275401
2018-02-14 0.530000
2018-02-13 0.305919
2018-02-12 0.304474
I am looking for a method to groupby 'lvl_1' then get the difference between two given dates.
For instance, the difference between February 14th and February 12th would be -0.1864 for 'fi' and 0.225526 for 'di'.
I was working on the following lines of codes:
group_by = df.groupby(level='lvl_1')
nd = group_by.get_loc(day_3, method='nearest')
st = group_by.get_loc(day_1, method='nearest')
out = group_by.iloc[nd] - group_by.iloc[st]
But it looks like it is not a valid method...
AttributeError: 'DataFrameGroupBy' object has no attribute 'get_loc'
Anyone?

This is a bit different from yours in spirit, but it should give what you want (although if your database is very big it might waste memory):
expanded = df.reset_index().pivot_table(index='lvl_1',columns='lvl_2',values='value')
expanded[day_3] - expanded[day_1]
This returns a Series with the difference:
lvl_1
di -0.225526
fi 0.182643
dtype: float64

ANSWER:
I found a way to answer my own question. Assuming I am looking for the location of one given day only (then extrapolate for my specific question):
group_by = df.groupby(level='lvl_1')
ans = group_by.nth(df.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
Ideally, I would work with the location of each groupid, considering that the datetime vector could be different. However, I am having a hard time to figure out the last step...:
group_by = df.groupby(level='lvl_1')
loc = group_by.apply(lambda x: x.index.get_level_values('lvl_2').unique().get_loc(day_2, method='nearest'))
ans = group_by.nth(loc.groupby(level='lvl_1'))
But it gives me an error for my last line:
TypeError: n needs to be an int or a list/set/tuple of ints
If someone finds a way to solve this slight issue, fire up! In the meantime, my temporary answer does the job. thxs

Related

How do I calculate time difference in days or months in python3

I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).
# split air_dates column into start and end date
dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)
# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)
df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)
def time_diff(time_series):
return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)
The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!
As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.
To use this, something like this would be used:
import datetime
# variable = datetime(year, month, day, hour, minute, second)
start = datetime(2017,5,8,18,56,40)
end = datetime(2019,6,27,12,30,58)
print( start - end ) # this will print the difference of these 2 date and time
Hope this answer helps you.
Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.
here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.
EX:
import pandas as pd
df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})
df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]
df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")
df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days
That will give you for the dummy example
df['timediff_days']
0 456.0
1 NaN
Name: timediff_days, dtype: float64
Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with #piRSquared's approach:
df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
(df['air_end_date'].dt.month - df['air_start_date'].dt.month))
df['timediff_months']
0 15.0
1 NaN
Name: timediff_months, dtype: float64

Why does pandas interpret Aug-30 as 1930-08, but not 2030-08? [duplicate]

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.
Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.
E.g.:
dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']
pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
'2061-01-09', '2055-02-08'],
dtype='datetime64[ns]', freq=None)
The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What's going on here?
That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:
datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)
datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)
Two digits year ambiguity
So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900
The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.
If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)
I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).
If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference
If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.
from the docs
Year 2000 (Y2K) issues: Python depends on the platform’s C library,
which generally doesn’t have year 2000 issues, since all dates and
times are represented internally as seconds since the epoch. Function
strptime() can parse 2-digit years when given %y format code. When
2-digit years are parsed, they are converted according to the POSIX
and ISO C standards: values 69–99 are mapped to 1969–1999, and values
0–68 are mapped to 2000–2068.
For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:
from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)
You may need to tune the threshold date closer to the present depending on the earliest dates in your data.
You can write a simple function to correct this parsing of wrong year as stated below:
import datetime
def fix_date(x):
if x.year > 1989:
year = x.year - 100
else:
year = x.year
return datetime.date(year,x.month,x.day)
df['date_column'] = data['date_column'].apply(fix_date)
Hope this helps..
Another quick solution to the problem:-
import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])
for i in dates:
tempyear=pd.to_numeric(dates[i].str[-2:])
dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
dates["temp_month"]=dates[i].str[:-2]
dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
tempdrops=["temp_year","temp_month","temp_flyr",i]
dates.drop(tempdrops, axis=1, inplace=True)
And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime
pddt
0 2005-09-26
1 2005-09-26
2 1970-06-15
3 1994-12-05
4 1961-01-09
5 1955-02-08
As mentioned in some other answers this works best if there is no overlap between the dates of the two centuries.
If running into the same problem using a pandas DataFrame, try using the current year or year greater than a particular year, then apply a lambda function similar to below:
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > dt.datetime.now() else x)
or
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > 2022 else x)

Find the closest date from today in a list

My objective is to get the next closest date (in the future, not past) to today's date from a list. For simplicity's sake, the list (in the format of e.g. 2017-01-31; YYYY-MM-DD) is each football game in the season and I am trying to create a script that in part finds the "next" football game.
I have searched the internet and Stack Overflow for answers and found a promising post, however the solutions provided are using a different format and when I try to tailor it to mine, it trips exceptions.
My logic includes parsing an RSS feed, so I am just going to provide the raw list instead. With this in mind, my simplified code is as follows:
today = str(datetime.date.today())
print(today)
scheduledatelist = ['2017-09-01', '2017-09-09', '2017-09-16', '2017-09-23', '2017-09-30', '2017-10-07', '2017-10-14', '2017-10-21', '2017-10-27', '2017-11-11', '2017-11-18', '2017-11-25']
scheduledatelist = list(reversed(scheduledatelist)) #purpose: to have earliest dates first
This is my attempt at adapting the previous post's solution (I am not well versed in functional programming, so I may not be adapting it right):
get_datetime = lambda s: datetime.datetime.strptime(s, "%Y-%m-%d")
base = get_datetime(today)
later = filter(lambda d: today(d[0]) > today, scheduledatelist)
closest_date = min(later, key = lambda d: today(d[0]))
print(closest_date)
Regardless of my attempt (which may not be the best in my situation as it changes the format and I need the end value to still be YYYY-MM-DD), is there an easier way of doing this? I need that next game (closest to today) value as that will continue on to be used in my logic. So to recap, how can I find the closest date in my list, looking toward the future, from today. Thank you for your help!
You can do:
min(scheduledatelist, key=lambda s:
datetime.datetime.strptime(s, "%Y-%m-%d").date()-datetime.date.today())
For the single closest date to today.
You can use the same function to sort by distance from today:
sorted(scheduledatelist, key=lambda s:
datetime.datetime.strptime(s, "%Y-%m-%d").date()-datetime.date.today())
And the returned list will be in increasing distance in days from today. Works if the dates are before or after today.
If you want only dates in the future, filter out the dates in the past. Since the date strings are in ISO 8601 format, you can compare lexically:
min([d for d in scheduledatelist if d>str(datetime.date.today())], key=lambda s:
datetime.datetime.strptime(s, "%Y-%m-%d").date()-datetime.date.today())
first of all let's create datetime.date objects from strings using datetime.datetime.strptime and datetime.datetime.date methods since datetime.date objects are ordered and easier to work with:
date_format = '%Y-%m-%d'
dates = [datetime.datetime.strptime(date_string,
date_format).date()
then let's filter out dates that take place in future (after today)
today = datetime.date.today()
future_dates = [date
for date in dates
if date >= today]
then we can simply find next closest date using min
next_closest_date = min(future_dates)
which gives us
>>>next_closest_date
2017-09-01
for given example
WARNING
If there is no dates going after today this will cause error like
ValueError: min() arg is an empty sequence
if it's ok then we can leave it, but if we don't want to get errors – we can specify default value for min in case of empty sequence like
next_closest_date = min(future_dates, default=None)
Finally we can write a function as follows
import datetime
# `default` value is returned when there is no future date strings found
def get_next_closest_date(date_strings, date_format, default=None):
today = datetime.date.today()
dates = [datetime.datetime.strptime(date_string,
date_format).date()
for date_string in date_strings]
future_dates = [date
for date in dates
if date >= today]
return min(future_dates, default)
and use it like
scheduledatelist = ['2017-09-01', '2017-09-09', '2017-09-16', '2017-09-23',
'2017-09-30', '2017-10-07', '2017-10-14', '2017-10-21',
'2017-10-27', '2017-11-11', '2017-11-18', '2017-11-25']
next_closest_date = get_next_closest_date(date_strings=scheduledatelist,
date_format='%Y-%m-%d')
print(next_closest_date)

Python Find # of Months Between 2 Dates

I am trying to find the # of months between 2 dates. Some solutions are off by 1 month and others are off by several months. I found This solution on SO but the solutions are either too complicated or incorrect.
For example, given the starting date of 04/30/12 and ending date of 03/31/16,
def diff_month(d1, d2):
return (d1.year - d2.year)*12 + d1.month - d2.month
returns 47 months, not 48
and
dates = [dt for dt in rrule(MONTHLY, dtstart=strt_dt, until=end_dt)]
returns 44 (Reason being that February does not have a day # 30 so it does not see it as a valid date)
I can of course fix that by doing
dates = [dt for dt in rrule(MONTHLY, dtstart=strt_dt.replace(day=2), until=end_dt.replace(day=1))]
But this does not seem like a proper solution (I mean the answer is right but the method sucks).
Is there a proper way of calculating the # of months so that given my example dates, it would return 48?
I realize this post doesn't have a Pandas tag, but if you are willing to use it you can simply do the following which takes the difference between two monthly periods:
import pandas as pd
>>> pd.Period('2016-3-31', 'M') - pd.Period('2012-4-30', 'M')
47

Pandas QuarterBegin(): Possible Bug when calculating First of quarter

I would like to calculate the "first of quarter" in a pandas dataframe. However I am running into some problems. My Pandas version is 0.17.1.
import pandas as pd
import datetime as dt
test=pd.Timestamp(dt.datetime(2011,1,20))
test=test.tz_localize('Europe/Rome')
previousquarter=test-pd.tseries.offsets.QuarterBegin()
nextquarter=test+pd.tseries.offsets.QuarterBegin()
My expected results would be previousquarter = (2011,1,1) and nextquarter = (2011,4,1). But what I get is previousquarter = (2010,12,1) and nextquarter = (2011,3,1).
I have also tried it without tz_localize. However, it did not change the result.
Am I doing something wrong here or is this a bug somewhere?
Thanks in advance!
P.S. I know I could correct it by shifting one month, but this seems to be a rather crude workaround.
Yup, looks like a bug: https://github.com/pydata/pandas/issues/8435
There is a better workaround than shifting a month though:
offsets.QuarterBegin(startingMonth=1)
The answer given my Marshall worked fine for me except for first days of each year where it was pointing to first date of the last quarter of the previous year. eg of 2018-01-01 I was getting 2017-10-01
I had to do the following to handle that :
(date + pd.tseries.offsets.DateOffset(days=1)) - pd.tseries.offsets.QuarterBegin(startingMonth=1)
where date is a datetime.datetime object.
Reproduced #Abhi's bug in Pandas 1.1.3. In fact, couldn't get consistent results from QuarterEnd or QuarterBegin for all starting dates within the quarter. Instead resorted to going to the end of the prior quarter and adding a day, or to the beginning of the next quarter and subtracting a day. Note QuarterEnd(startingMonth=12)
import pandas as pd
print(" date Quarter Quarter begin Quarter end ")
for yr in range(2020, 2021):
for mo in range(1,13):
for dy in range(1,4):
date = pd.Timestamp(yr, mo, dy)
if dy == 3:
date = date + pd.tseries.offsets.MonthEnd()
qbegin = date + pd.offsets.QuarterEnd(-1, startingMonth=12) + pd.offsets.Day(1)
qend = date + pd.offsets.QuarterBegin(1, startingMonth=1) - pd.offsets.Day(1)
print("{} {} {} {}".format(date, date.quarter, qbegin, qend))

Categories