Finding the closest date inside a Pandas dataframe given a condition - python

I have this S&P 500 historical data sample, and I want to compare the dates inside of him.
>> df
High Low Open Close Volume Adj Close
Date
2011-01-03 127.599998 125.699997 126.709999 127.050003 138725200.0 104.119293
2011-01-04 127.370003 126.190002 127.330002 126.980003 137409700.0 104.061905
2011-01-05 127.720001 126.459999 126.580002 127.639999 133975300.0 104.602806
2011-01-06 127.830002 127.010002 127.690002 127.389999 122519000.0 104.397934
2011-01-07 127.769997 126.150002 127.559998 127.139999 156034600.0 104.193031
... ... ... ... ... ... ...
2020-12-14 369.799988 364.470001 368.640015 364.660004 69216200.0 363.112183
2020-12-15 369.589996 365.920013 367.399994 369.589996 64071100.0 368.021240
2020-12-16 371.160004 368.869995 369.820007 370.170013 58420500.0 368.598816
2020-12-17 372.459991 371.049988 371.940002 372.239990 64119500.0 370.660004
2020-12-18 371.149994 367.019989 370.970001 369.179993 135359900.0 369.179993
let latest be the most recent SnP OHLC prices
latest = df.iloc[-1]
How can I find the date inside this dataframe index that is the closest to latest lagged by 1 year (latest.replace(year=latest.year-1)? Just using the pd.Timestamp.replace method sometimes doesn't work, it can generates a date that is not inside my index.

This approach only works if your index column ('Date') contains DateTime objects. If it contains strings, you first have to convert the index to DateTime format.
df.index = pd.to_datetime(df.index)
With that, you can get the latest time either with latest = df.index[-1] or df.index.max().
Then we offset the lastest date by one year using pd.DateOffset and get the theoretical lagged date.
lagged_theoretical = latest - pd.DateOffset(years=1)
To obtain the closest date to the calculated date that is actually present in your DataFrame, we calculate the time delta between all the dates in your dataframe and the calculated date. From there, we selected the minimum to get the closest date. We fetch the index of the minimum in the timedelta array and use that index to get the actual date from the DataFrame`s index column. Here is the whole code:
latest = df.index[-1]
lagged_theoretical = latest - pd.DateOffset(years=1)
td = (abs(df.index - lagged_theoretical)).values
idx = np.where(td == td.min())[0][0]
lagged_actual = df.index[idx]

Related

How to aggregate irregularly sampled data for Time Series Analysis

I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28

Is it possible to resample and sum values in a Pandas dataframe by specifying a date range?

I have a dataframe like the following (dates with an associated binary value (whether or not a flood occurs), spanning a total of 20 years):
...
2019-12-27 0.0
2019-12-28 1.0
2019-12-29 1.0
2019-12-30 0.0
2019-12-31 0.0
...
I need to produce a count (i.e. sum, considering the values are binary) over a series of custom date ranges, e.g. '24-05-2019 to 09-09-2019', or '15-10-2019 to 29-12-2019', etc.
My initial thoughts were to use the resample method, however as I understand this will not allow me to select a custom date range, rather it will allow me to resample over a set time period, e.g. month or year.
Any ideas out there?
Thanks in advance
If the dates are a DatetimeIndex and the index of the dataframe or Series, you can directly select the relevant rows:
df.loc['24-05-2019':'09-09-2019', 'flood'].sum()
Since it's a Pandas dataframe you should be able to do something like:
start_date = df[df.date == '24-05-2019']].index.values
end_date = df[df.date == '09-09-2019'].index.values
subset = df[start_date:end_date]
sum(subset.flood) # Or other maths
where 'date' and 'flood' are your column headers, and 'df' is your dataframe. This assumes your dates are strings, and that each date only appears once. If not, you'll have to pick which date you want from the list of index values in 'start_date' and 'end_date'.

Keyerror when searching for a date PRIOR to Date TIme in python pandas dataframe using getloc ffill

I have the following data frame. I want to get the nearest date using get_loc ffill parameter as mentioned in the code snippet.
DataFrame:
Date Time,Close,High,Low,Open,Volume
2020-01-02 22:45:00,326.75,329.3,326.5,329.3,0.0
2020-01-02 22:50:00,328.0606,330.0708,325.6666,326.7106,9178.0
2020-01-02 22:55:00,327.4,328.3,327.4,328.05,1035.0
...
2020-02-07 04:50:00,372.05,375.0,372.0,373.0,4936.0
2020-02-07 04:55:00,372.1156,373.3588,370.3559,372.3656,7548.0
Code Snippet
df_colname = 'Date Time'
pandas_datetime_colname = 'Pandas Date Time'
df[pandas_datetime_colname] = pd.to_datetime(df[df_colname])
df.set_index(pandas_datetime_colname, inplace=True)
dt = pd.to_datetime(inputdatetime)
idx = df.index.get_loc(dt, 'ffill')
print("Date Time: " + str(inputdatetime) + " :idx " + str(idx))
df.reset_index(inplace=True)
This returns the correct date 2020-01-02 22:45:00 when I provide date as 2020-02-02 22:50:00 but when I give a date PRIOR to the first date, I get a keyerror
KeyError: Timestamp('2019-12-20 22:45:00')
I also don't get an error when I give a date AFTER the last date in the dataframe
I looked through the documents but could not find why I get an error only for dates PRIOR . I was hoping to get some kind of a None object
From the docs:
ffill: find the PREVIOUS index value if no exact match.
Giving a datetime with no prior datetimes in the index, will result in an error due to it giving the closest match in the PREVIOUS index value. Since there is no such index value, it throws an error.
I believe what you're looking for is the nearest argument, instead of ffill:
nearest: use the NEAREST index value if no exact match. Tied distances are broken by preferring the larger index value.

Using GroupBy with DateTime in Pandas (Python)

I have a data that looks like this that I get from an API (of course in JSON form):
0,1500843600,8872
1,1500807600,18890
2,1500811200,2902
.
.
.
where the second column is the date/time in ticks, and the third column is some value. I basically have the data for every hour of the day, for every day for a couple of months. Now, what I want to achieve is that I want to get the minimum value for the third column for every week. I have the code segment below, which correctly returns the minimum value for me, but apart from returning the minimum value, I also want to return the specific Timestamp as which date/time the lowest that week occurred. How can I modify my code below, so I can get also the Timestamp along with the minimum value.
df = pandas.DataFrame(columns=['Timestamp', 'Value'])
# dic holds the data that I get from my API.
for i in range(len(dic)):
df.loc[i] = [dic[i][1], dic[i][2]]
df['Timestamp'] = pandas.to_datetime(df['Timestamp'], unit='s')
df.sort_values(by=['Timestamp'])
df.set_index(df['Timestamp'], inplace=True)
df.groupby([pandas.Grouper(key='Timestamp', freq='W-MON')])['Value'].min()
I think you need DataFrameGroupBy.idxmin for index of min value of column Timestamp and then select by loc:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
df = df.loc[df.groupby([pd.Grouper(key='Timestamp', freq='W-MON')])['Value'].idxmin()]
print (df)
Timestamp Value
2 2017-07-23 12:00:00 2902
Detail:
print (df.groupby([pd.Grouper(key='Timestamp', freq='W-MON')])['Value'].idxmin())
Timestamp
2017-07-24 2
Freq: W-MON, Name: Value, dtype: int64

Sum large pandas dataframe based on smaller date ranges

I have a large pandas dataframe that has hourly data associated with it. I then want to parse that into "monthly" data that sums the hourly data. However, the months aren't necessarily calendar months, they typically start in the middle of one month and end in the middle of the next month.
I could build a list of the "months" that each of these date ranges fall into and loop through it, but I would think there is a much better way to do this via pandas.
Here's my current code, the last line throws an error and is the crux of the question:
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
month = pd.DataFrame({'start':['1/4/2015 00:00','1/24/2015 00:00'], 'end':['1/23/2015 23:00','2/23/2015 23:00']})
month['start'] = pd.to_datetime(month['start'])
month['end'] = pd.to_datetime(month['end'])
month['num'] = df['num'][(df['date'] >= month['start']) & (df['date'] <= month['end'])].sum()
I would want an output similar to:
start end num
0 2015-01-04 2015-01-23 23:00:00 33,251
1 2015-01-24 2015-02-23 23:00:00 39,652
but of course, I'm not getting that.
pd.merge_asof only available with pandas 0.19
combination of pd.merge_asof + query + groupby
pd.merge_asof(df, month, left_on='date', right_on='start') \
.query('date <= end').groupby(['start', 'end']).num.sum().reset_index()
explanation
pd.merge_asof
From docs
For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.
But this only takes into account the start date.
query
I take care of end date with query since I now conveniently have end in my dataframe after pd.merge_asof
groupby
I trust this part is obvious`
Maybe you can convert to a period and add a number of days
# create data
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
# offset days and then create period
df['periods'] = (df.date + pd.tseries.offsets.Day(23)).dt.to_period('M')]
# group and sum
df.groupby('periods')['num'].sum()
Output
periods
2015-01 10051
2015-02 34229
2015-03 37311
2015-04 26655
You can then shift the dates back and make new columns

Categories