My Plan:
I have two different datasets with OHLC values, one representing the Weekly (1W) timeframes: weekly_df and the other representing the hourly (1H) timeframes hourly_df.
Here's what the two data frame looks like:
My goal is to merge the weekly OHLC values to the hourly df by using pandas merge and followed by ffill. However, Before I do it, I need to get the date columns in the same format and type. Meaning I need to reformat the weekly dates with 00:00:00 after the date. Here's how I'm doing it:
The Problem:
Once this is done, everything is now a string and when I try to convert it back to datetime, the 00:00:00 in the date column disappears:
Once this is done, I wanted to merge the data frames by the date and fill, so that all the hourly OHLC values in a given date also have a column displaying their weekly OHLC value.
As of right now, this is not working as the merge only merges the dates common between he data frames and omits the rest:
If there is an easier way to do it? As most of the methods I have tried are returning an error.
The two data frame CSV files:
Incase you need to test it, here are the two CSV files:
Hourly
Weekly
Any help would be appreciated. Thanks in advance!
For anyone who would face a similar issue in the future, here's how I fixed it:
Since the datetime format applied on the dataframe is not enforcing 00:00:00, I have done an offset to the time by 1 second as 00:00:01 to both the dataframes as follows:
hourly_df['date'] = hourly_df['date'] + pd.DateOffset(seconds=1)
This helps me enforce the same format on the weekly df by offsetting it with 1 sec.
Finally as I now have the same date columns I can merge & ffill them as follows:
merged_df = hourly_df.merge(weekly_df ,on =['date'], how='left').ffill()
which merges and displays the result as follows:
Do let me know if anyone else finds another way to solve this by keeping the original time. Cheers!
Related
I am trying to join to dataframes based on their dates. I have a column in each for date and time (formatted like the following)
'%y-%m-%d'
and I would like to join the two datasets if their dates are 3 or less dates apart. Appreciate the help! Thanks!
I have tick data that I would like to refine by removing the first and last rows of each day. The original dataframe has a datetime64[ns] index with a format of '%Y-%m-%d %H:%M:%S'
To do so I used
pd.resample('D').first()
pd.resample('D').last()
and successfully sampled out the first and last rows of each day
The problem is when resampling in days the original datetime index transforms into a '%Y-%m-%d' format
How do I use resample so that it retains the original datetime index format?
or is there a way I can reformat datetime index in the new dataframe to display until seconds?
IIUC
Your problem is that you are resampling daily and getting the first value per day. But that you want to include the associated date for that first value.
You want to aggregate the date in your index as well.
df.assign(NewDate=df.index).resample('D').first().set_index('NewDate')
Or you can resample the index and grab min values
df.loc[df.index.to_series().resample('D').min()]
I've got a DataFrame that looks like this:
It has two columns, one of them being a "from" datetime and one of them being a "to" datetime. I would like to change this DataFrame such that it has a single column or index for the date (e.g. 2015-07-06 00:00:00 in datetime form) with the variables of the other columns (like deep) split proportionately into each of the days. How might one approach this problem? I've meddled with groupby tricks and I'm not sure how to proceed.
So I don't have time to work through your specific problem at the moment. But the way to approach this is to us pandas.resample(). Here are the steps I would take. 1) Resample your to date column by minute. 2) Populate the other columns out over that resample. 3) Add the date column back in as an index.
If this doesn't work or is being tricky to work with I would create a date range from your earliest date to your latest date (at the smallest interval you want - so maybe hourly?) and then run some conditional statements over your other columns to fill in the data.
Here is somewhat what your code may look like for the resample portion (replace day with hour or whatever):
drange = pd.date_range('01-01-1970', '01-20-2018', freq='D')
data = data.resample('D').fillna(method='ffill')
data.index.name = 'date'
Hope this helps!
Beginner python (and therefore pandas) user. I am trying to import some data into a pandas dataframe. One of the columns is the date, but in the format "YYYYMM". I have attempted to do what most forum responses suggest:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m')
This doesn't work though (ValueError: unconverted data remains: 3). The column actually includes an additional value for each year, with MM=13. The source used this row as an average of the past year. I am guessing to_datetime is having an issue with that.
Could anyone offer a quick solution, either to strip out all of the annual averages (those with the last two digits "13"), or to have to_datetime ignore them?
pass errors='coerce' and then dropna the NaT rows:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m', errors='coerce').dropna()
The duff month values will get converted to NaT values
In[36]:
pd.to_datetime('201613', format='%Y%m', errors='coerce')
Out[36]: NaT
Alternatively you could filter them out before the conversion
df_cons['YYYYMM'] = pd.to_datetime(df_cons.loc[df_cons['YYYYMM'].str[-2:] != '13','YYYYMM'], format='%Y%m', errors='coerce')
although this could lead to alignment issues as the returned Series needs to be the same length so just passing errors='coerce' is a simpler solution
Clean up the dataframe first.
df_cons = df_cons[~df_cons['YYYYMM'].str.endswith('13')]
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'])
May I suggest turning the column into a period index if YYYYMM column is unique in your dataset.
First turn YYYYMM into index, then convert it to monthly period.
df_cons = df_cons.reset_index().set_index('YYYYMM').to_period('M')
This seems like it would be fairly straight forward but after nearly an entire day I have not found the solution. I've loaded my dataframe with read_csv and easily parsed, combined and indexed a date and a time column into one column but now I want to be able to just reshape and perform calculations based on hour and minute groupings similar to what you can do in excel pivot.
I know how to resample to hour or minute but it maintains the date portion associated with each hour/minute whereas I want to aggregate the data set ONLY to hour and minute similar to grouping in excel pivots and selecting "hour" and "minute" but not selecting anything else.
Any help would be greatly appreciated.
Can't you do, where df is your DataFrame:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.dt.hour, times.dt.minute]).value_col.sum()
Wes' code didn't work for me. But the DatetimeIndex function (docs) did:
times = pd.DatetimeIndex(data.datetime_col)
grouped = df.groupby([times.hour, times.minute])
The DatetimeIndex object is a representation of times in pandas. The first line creates a array of the datetimes. The second line uses this array to get the hour and minute data for all of the rows, allowing the data to be grouped (docs) by these values.
Came across this when I was searching for this type of groupby. Wes' code above didn't work for me, not sure if it's because changes in pandas over time.
In pandas 0.16.2, what I did in the end was:
grp = data.groupby(by=[data.datetime_col.map(lambda x : (x.hour, x.minute))])
grp.count()
You'd have (hour, minute) tuples as the grouped index. If you want multi-index:
grp = data.groupby(by=[data.datetime_col.map(lambda x : x.hour),
data.datetime_col.map(lambda x : x.minute)])
I have an alternative of Wes & Nix answers above, with just one line of code, assuming your column is already a datetime column, you don't need to get the hour and minute attributes separately:
df.groupby(df.timestamp_col.dt.time).value_col.sum()
This might be a little late but I found quite a good solution for any one that has the same problem.
I have a df like this:
datetime value
2022-06-28 13:28:08 15
2022-06-28 13:28:09 30
... ...
2022-06-28 14:29:11 20
2022-06-28 14:29:12 10
I want to convert those timestamps which are in intervals of a second to timestamps with an interval of minutes adding the value column in the process.
There is a neat way of doing it:
df['datetime'] = pd.to_datetime(df['datetime']) #if not already as datetime object
grouped = df.groupby(pd.Grouper(key='datetime', axis=0, freq='T')).sum()
print(grouped.head())
Result:
datetime value
2022-06-28 13:28:00 45
... ...
2022-06-28 14:29:00 30
freq='T' stands for minutes. You could also group it by hours or days. They are called Offset aliases.