Resampling DataFrame in days but retaining original datetime index format - python

I have tick data that I would like to refine by removing the first and last rows of each day. The original dataframe has a datetime64[ns] index with a format of '%Y-%m-%d %H:%M:%S'
To do so I used
pd.resample('D').first()
pd.resample('D').last()
and successfully sampled out the first and last rows of each day
The problem is when resampling in days the original datetime index transforms into a '%Y-%m-%d' format
How do I use resample so that it retains the original datetime index format?
or is there a way I can reformat datetime index in the new dataframe to display until seconds?

IIUC
Your problem is that you are resampling daily and getting the first value per day. But that you want to include the associated date for that first value.
You want to aggregate the date in your index as well.
df.assign(NewDate=df.index).resample('D').first().set_index('NewDate')
Or you can resample the index and grab min values
df.loc[df.index.to_series().resample('D').min()]

Related

Convert Datetime formats and merge two OHLC timeseries on pandas

My Plan:
I have two different datasets with OHLC values, one representing the Weekly (1W) timeframes: weekly_df and the other representing the hourly (1H) timeframes hourly_df.
Here's what the two data frame looks like:
My goal is to merge the weekly OHLC values to the hourly df by using pandas merge and followed by ffill. However, Before I do it, I need to get the date columns in the same format and type. Meaning I need to reformat the weekly dates with 00:00:00 after the date. Here's how I'm doing it:
The Problem:
Once this is done, everything is now a string and when I try to convert it back to datetime, the 00:00:00 in the date column disappears:
Once this is done, I wanted to merge the data frames by the date and fill, so that all the hourly OHLC values in a given date also have a column displaying their weekly OHLC value.
As of right now, this is not working as the merge only merges the dates common between he data frames and omits the rest:
If there is an easier way to do it? As most of the methods I have tried are returning an error.
The two data frame CSV files:
Incase you need to test it, here are the two CSV files:
Hourly
Weekly
Any help would be appreciated. Thanks in advance!
For anyone who would face a similar issue in the future, here's how I fixed it:
Since the datetime format applied on the dataframe is not enforcing 00:00:00, I have done an offset to the time by 1 second as 00:00:01 to both the dataframes as follows:
hourly_df['date'] = hourly_df['date'] + pd.DateOffset(seconds=1)
This helps me enforce the same format on the weekly df by offsetting it with 1 sec.
Finally as I now have the same date columns I can merge & ffill them as follows:
merged_df = hourly_df.merge(weekly_df ,on =['date'], how='left').ffill()
which merges and displays the result as follows:
Do let me know if anyone else finds another way to solve this by keeping the original time. Cheers!

Group by similar dates in pandas

I have a dataframe that has one column which is a datetime series object. It has some data associated with every date in another column. The year ranges from 2005-2014. I want to group similar dates in each year together, i.e, all the 1st January falling in 2005-15 must be grouped together irrespective of the year.Similarly for all the 365 days in a year. So I should have 365 days as the output. How can I do that?
Assuming your DataFrame has a column Date, you can make it the index of the DataFrame and then use strftime, to convert to a format with only day and month (like "%m-%d"), and groupby plus the appropriate function (I just used mean):
df = df.set_index('Date')
df.index = df.index.strftime("%m-%d")
dfAggregated = df.groupby(level=0).mean()
Please note that the output will have 366 days, due to leap years. You might want to filter out the data associated to Feb 29th or merge it into Feb 28th/March 1st (depending on the specific use case of your application)

Length between two dates in a time series in pandas data frame

I have a time series composed of weekdays with anomalous/unpredictable holidays. On any given day, I want to know the length/number of rows to a date specified under column 'date1'. See below.
len(df.loc['2019-10-18':'2019-11-15']) returns the correct answer
I am trying to create a column 'shift' that will calculate the above.
Both DatetimeIndex and the 'date1' are dtype 'datetime64[ns]'
df['shift']=len(df.loc[df.index : df['date1']]) clearly doesn't work but might there be a solution that does?
IIUC use:
df['len'] = (df.index - df['date1']).dt.days

Resampling Pandas Timeseries so that the Date indicated the 1st of each Month

I have a Pandas timeseries where the Date indicates the last day of each month. I would like to change it so that it contains the first day of each Month. E.g., instead of '2018-08-31' to become '2018-08-01' and so on for all the Dates.
To that end I tried to resample using the 'convention' argument with the value 'start' but the method returned the timeseries intact.
For a reproducible example:
toy_data.to_json()
'{"GDP_Quarterly_Growth_Rate":{"-710294400000":-0.266691,"-707616000000":-0.266691,"-704937600000":-0.266691,"-702345600000":-0.206496,"-699667200000":-0.206496,"-697075200000":-0.206496,"-694396800000":1.564208,"-691718400000":1.564208,"-689212800000":1.564208,"-686534400000":1.504256}}'
toy_data.resample('M', convention = 'start').mean()
Returns the toy_data intact.
Change M to MS, check offset aliases:
toy_data.resample('MS', convention = 'start').mean()

How can a DataFrame change from having two columns (a "from" datetime and a "to" datetime) to having a single column for a date?

I've got a DataFrame that looks like this:
It has two columns, one of them being a "from" datetime and one of them being a "to" datetime. I would like to change this DataFrame such that it has a single column or index for the date (e.g. 2015-07-06 00:00:00 in datetime form) with the variables of the other columns (like deep) split proportionately into each of the days. How might one approach this problem? I've meddled with groupby tricks and I'm not sure how to proceed.
So I don't have time to work through your specific problem at the moment. But the way to approach this is to us pandas.resample(). Here are the steps I would take. 1) Resample your to date column by minute. 2) Populate the other columns out over that resample. 3) Add the date column back in as an index.
If this doesn't work or is being tricky to work with I would create a date range from your earliest date to your latest date (at the smallest interval you want - so maybe hourly?) and then run some conditional statements over your other columns to fill in the data.
Here is somewhat what your code may look like for the resample portion (replace day with hour or whatever):
drange = pd.date_range('01-01-1970', '01-20-2018', freq='D')
data = data.resample('D').fillna(method='ffill')
data.index.name = 'date'
Hope this helps!

Categories