Reindexing timeseries data

Reindexing timeseries data - python

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2

One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9

You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.

I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation

I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

Related

Resampling on a multi index

I have a DataFrame of the following form:
You see that it has a multi index. For each muni index I want to do a resampling of the form .resample('A').mean() of the popDate index. Hence, I want python to fill in the missing years. NaN values shall be replaced by a linear interpolation. How do I do that?
Update: Some mock input DataFrame:
interData=pd.DataFrame({'muni':['Q1','Q1','Q1','Q2','Q2','Q2'],'popDate':['2015','2021','2022','2015','2017','2022'],'population':[5,11,22,15,17,22]})
interData['popDate']=pd.to_datetime(interData['popDate'])
interData=interData.set_index(['muni','popDate'])

It looks like you want a groupby.resample:
interData.groupby(level='muni').resample('A', level='popDate').mean()
Output:
population
muni popDate
Q1 2015-12-31 5.0
2016-12-31 NaN
2017-12-31 NaN
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 11.0
2022-12-31 22.0
Q2 2015-12-31 15.0
2016-12-31 NaN
2017-12-31 17.0
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 NaN
2022-12-31 22.0
If you also need interpolation, combine with interpolate:
out = (interData.groupby(level='muni')
.apply(lambda g: g.resample('A', level='popDate').mean()
.interpolate(method='time'))
)
Output:
population
muni popDate
Q1 2015-12-31 5.000000
2016-12-31 6.001825
2017-12-31 7.000912
2018-12-31 8.000000
2019-12-31 8.999088
2020-12-31 10.000912
2021-12-31 11.000000
2022-12-31 22.000000
Q2 2015-12-31 15.000000 # 366 days between 2015-12-31 and 2016-12-31
2016-12-31 16.001368 # 365 days between 2016-12-31 and 2017-12-31
2017-12-31 17.000000
2018-12-31 17.999452
2019-12-31 18.998905
2020-12-31 20.001095
2021-12-31 21.000548
2022-12-31 22.000000

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.

You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000

Continuous dates for products in Pandas

I started to work with Pandas and I have some issues that I don't really know how to solve.
I have a dataframe with date, product, stock and sales. Some dates and products are missing. I would like to get a timeseries for each product in a range of dates.
For example:
product udsStock udsSales
date
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-30 14 856 0
2019-12-25 4 3132 439
2019-12-27 4 3177 616
2020-01-01 4 500 883
It has to be the same range for all products even if one product doesn't appear in one date in the range.
If I want the range 2019-12-25 to 2020-01-01, the final dataframe should be like this one:
product udsStock udsSales
date
2019-12-25 14 NaN NaN
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-28 14 NaN NaN
2019-12-29 14 NaN NaN
2019-12-30 14 856 0
2019-12-31 14 NaN NaN
2020-01-01 14 NaN NaN
2019-12-25 4 3132 439
2019-12-26 4 NaN NaN
2019-12-27 4 3177 616
2019-12-28 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-31 4 NaN NaN
2020-01-01 4 500 883
I have tried to reindex by the range but it doesn't work because there are identical indexes.
idx = pd.date_range('25-12-2019', '01-01-2020')
df = df.reindex(idx)
I also have tried to index by date and product and then reindex, but I don't know how to put the product that is missing.
Any more ideas?
Thanks in advance

We can use pd.date_range and groupby.reindex to achieve your result:
date_range = pd.date_range(start='2019-12-25', end='2020-01-01', freq='D')
df = df.groupby('product', sort=False).apply(lambda x: x.reindex(date_range))
df['product'] = df.groupby(level=0)['product'].ffill().bfill()
df = df.droplevel(0)
product udsStock udsSales
2019-12-25 14.0 NaN NaN
2019-12-26 14.0 161.0 848.0
2019-12-27 14.0 1340.0 914.0
2019-12-28 14.0 NaN NaN
2019-12-29 14.0 NaN NaN
2019-12-30 14.0 856.0 0.0
2019-12-31 14.0 NaN NaN
2020-01-01 14.0 NaN NaN
2019-12-25 4.0 3132.0 439.0
2019-12-26 4.0 NaN NaN
2019-12-27 4.0 3177.0 616.0
2019-12-28 4.0 NaN NaN
2019-12-29 4.0 NaN NaN
2019-12-30 4.0 NaN NaN
2019-12-31 4.0 NaN NaN
2020-01-01 4.0 500.0 883.0

Convert index to datetime object :
df2.index = pd.to_datetime(df2.index)
Create unique combinations of date and product :
import itertools
idx = pd.date_range("25-12-2019", "01-01-2020")
product = df2["product"].unique()
temp = itertools.product(idx, product)
temp = pd.MultiIndex.from_tuples(temp, names=["date", "product"])
temp
MultiIndex([('2019-12-25', 14),
('2019-12-25', 4),
('2019-12-26', 14),
('2019-12-26', 4),
('2019-12-27', 14),
('2019-12-27', 4),
('2019-12-28', 14),
('2019-12-28', 4),
('2019-12-29', 14),
('2019-12-29', 4),
('2019-12-30', 14),
('2019-12-30', 4),
('2019-12-31', 14),
('2019-12-31', 4),
('2020-01-01', 14),
('2020-01-01', 4)],
names=['date', 'product'])
Reindex dataframe :
df2.set_index("product", append=True).reindex(temp).sort_index(
level=1, ascending=False
).reset_index(level="product")
product udsStock udsSales
date
2020-01-01 14 NaN NaN
2019-12-31 14 NaN NaN
2019-12-30 14 856.0 0.0
2019-12-29 14 NaN NaN
2019-12-28 14 NaN NaN
2019-12-27 14 1340.0 914.0
2019-12-26 14 161.0 848.0
2019-12-25 14 NaN NaN
2020-01-01 4 500.0 883.0
2019-12-31 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-28 4 NaN NaN
2019-12-27 4 3177.0 616.0
2019-12-26 4 NaN NaN
2019-12-25 4 3132.0 439.0
In R, specifically tidyverse, it can be achieved with the complete method. In Python, the pyjanitor package has something similar, but a few kinks remain to be ironed out (A PR has been submitted already for this).

Group by column and resampled date and get rolling sum of other column

I have the following data:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
UPDATE:
As per the suggestions of #user3483203 I have gotten a bit further but not quite there. I've amended the example data above with a new row to illustrate better.
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
For my requirements, the row (SE0000191827, 2018-03-31) should have a calc value since it has four consecutive rows with a value. Currently the row is being removed with the dropna call and I can't figure out how to solve that problem.
What I need
Calculations: The dates in my initial data is quarterly dates. However, I need to transform this data into monthly rows ranging between the first and last date of each id and for each month calculate the sum of the four closest consecutive rows of the input data within that id. That's a mouthful. This led me to resample. See expected output below. I need the data to be grouped by both id and the monthly dates.
Performance: The data I'm testing on now is just for benchmarking but I will need the solution to be performant. I'm expecting to run this on upwards of 100k unique ids which may result in around 10 million rows. (100k ids, dates range back up to 10 years, 10years * 12months = 120 months per id, 100k*120 = 12million rows).
What I've tried
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
This data looks very nice for my case since it's nicely grouped by id and has the dates nicely lined up by month. Here it seems like I could use something like df['val'].rolling(4) and make sure it skips NaN values and put that result in a new column.
Expected output (new column calc):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
Here the result in calc is 26 for the first date since it adds the three preceding (8+7+6+5). The rest for that id is NaN since four values are not available.
The problems
While it may look like the data is grouped by id and date, it seems like it's actually grouped by date. I'm not sure how this works. I need the data to be grouped by id and date.
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
The result of the resample above returns a DatetimeIndexResamplerGroupby which doesn't have rolling...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
What to do? My guess is that my approach is wrong but after scouring the documentation I'm not sure where to start.

Python Pandas resample, odd behaviour

I have 2 datasets (cex2.txt and cex3) wich I would like to resample in pandas. With one dataset I get the expected output, with the other not.
The datasets are tick data and are exactly equally formatted. Actually, the 2 datasets are only from two different days.
import pandas as pd
import datetime as dt
pd.set_option ('display.mpl_style', 'default')
time_converter = lambda x: dt.datetime.fromtimestamp(float(x))
data_frame = pd.read_csv('cex2.txt', sep=';', converters={'time': time_converter})
data_frame.drop('Unnamed: 7', axis=1, inplace=True)
data_frame.drop('low', axis=1, inplace=True)
data_frame.drop('high', axis=1, inplace=True)
data_frame.drop('last', axis=1, inplace=True)
data_frame = data_frame.reindex_axis(['time', 'ask', 'bid', 'vol'], axis=1)
data_frame.set_index(pd.DatetimeIndex(data_frame['time']), inplace=True)
ask = data_frame['ask'].resample('15Min', how='ohlc')
bid = data_frame['bid'].resample('15Min', how='ohlc')
vol = data_frame['vol'].resample('15Min', how='sum')
print ask
from the cex2.txt dataset I get this wrong output:
open high low close
1970-01-01 01:00:00 NaN NaN NaN NaN
1970-01-01 01:15:00 NaN NaN NaN NaN
1970-01-01 01:30:00 NaN NaN NaN NaN
1970-01-01 01:45:00 NaN NaN NaN NaN
1970-01-01 02:00:00 NaN NaN NaN NaN
1970-01-01 02:15:00 NaN NaN NaN NaN
1970-01-01 02:30:00 NaN NaN NaN NaN
1970-01-01 02:45:00 NaN NaN NaN NaN
1970-01-01 03:00:00 NaN NaN NaN NaN
1970-01-01 03:15:00 NaN NaN NaN NaN
from the cex3.txt dataset I get correct values:
open high low close
2014-08-10 13:30:00 0.003483 0.003500 0.003483 0.003485
2014-08-10 13:45:00 0.003485 0.003570 0.003467 0.003471
2014-08-10 14:00:00 0.003471 0.003500 0.003470 0.003494
2014-08-10 14:15:00 0.003494 0.003500 0.003493 0.003498
2014-08-10 14:30:00 0.003498 0.003549 0.003498 0.003500
2014-08-10 14:45:00 0.003500 0.003533 0.003487 0.003533
2014-08-10 15:00:00 0.003533 0.003600 0.003520 0.003587
I'm really at my wits' end. Does anyone have an idea why this happens?
Edit:
Here are the data sources:
https://dl.dropboxusercontent.com/u/14055520/cex2.txt
https://dl.dropboxusercontent.com/u/14055520/cex3.txt
Thanks!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reindexing timeseries data - python

You can try this for example: import pandas as pd ts = pd.read_excel('E:\DATA\AP.xlsx') ts['Time'] = pd.to_datetime(ts['Time']) ts.set_index('Time', inplace=True) ts.resample('5T').mean() More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

Set the Time column as the index, making sure it is DateTime type, then try ts.asfreq('5T') use ts.asfreq('5T', method='ffill') to pull previous values forward.

I have got it to work. thank you everyone for your time. I am providing the working code. import pandas as pd df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']]) df = df.set_index(['Date_Time']).resample('5min').last().reset_index() print(df)

Related

Resampling on a multi index

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

Continuous dates for products in Pandas

Group by column and resampled date and get rolling sum of other column

Python Pandas resample, odd behaviour

Categories

Resources