Pandas, Data Frame set_index & reindexing - python

I want read in weekday data and then reindex the data to fill the weekend with Friday's data. I have tried the following code but it will not reindex the data. Set_index produces a length error message.
import pandas as pd
def fill_dataframe(filename):
dataf = pd.read_csv(filename, header= None, index_col = [0])
return(dataf)
rng = pd.date_range('10/1/2010', periods=61)
date_rng = pd.DataFrame(rng,index = rng)
data_1.reindex(date_rng, method = 'ffill')
The data read in has 41 rows and the generated date values have 61 rows. Any suggestions?
data read in by csv (1st 7 rows)
X0 X1
10/1/2010 71.27
10/4/2010 70.33
10/5/2010 72.94
10/6/2010 74.15
10/7/2010 71.40
10/8/2010 72.58
10/11/2010 72.66
dates generated by rng in the second Data Frame (first 11 rows)
0
2010-10-01 2010-10-01 00:00:00
2010-10-02 2010-10-02 00:00:00
2010-10-03 2010-10-03 00:00:00
2010-10-04 2010-10-04 00:00:00
2010-10-05 2010-10-05 00:00:00
2010-10-06 2010-10-06 00:00:00
2010-10-07 2010-10-07 00:00:00
2010-10-08 2010-10-08 00:00:00
2010-10-09 2010-10-09 00:00:00
2010-10-10 2010-10-10 00:00:00
2010-10-11 2010-10-11 00:00:00

Reindexing just by the (1D) timeseries or as a Series this works (in 0.10.1):
data_1.reindex(rng, method = 'ffill')
data_1.reindex(Series(rng, index=rng), method = 'ffill')
.
With date_rng as the DataFrame I get TypeError: Cannot compare Timestamp with 0, I suspect this could be a bug, but I'm not entirely sure what the expected behaviour should be...

Related

Time Series Data Reformat

I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN

Extend datetimeindex to previous times in pandas

MRE:
idx = pd.date_range('2015-07-03 08:00:00', periods=30,
freq='H')
data = np.random.randint(1, 100, size=len(idx))
df = pd.DataFrame({'index':idx, 'col':data})
df.set_index("index", inplace=True)
which looks like:
col
index
2015-07-03 08:00:00 96
2015-07-03 09:00:00 79
2015-07-03 10:00:00 15
2015-07-03 11:00:00 2
2015-07-03 12:00:00 84
2015-07-03 13:00:00 86
2015-07-03 14:00:00 5
.
.
.
Note that dataframe contain multiple days. Since frequency is in hours, starting from 07/03 08:00:00 it will contain hourly date.
I want to get all data from 05:00:00 including day 07/03 even if it will contain value 0 in "col" column.
I want to extend it backwards so it starts from 05:00:00.
No I just can't start from 05:00:00 since I already have dataframe that starts from 08:00:00. I am trying to keep everything same but add 3 rows in the beginning to include 05:00:00, 06:00:00, and 07:00:00
The reindex method is handy for changing the index values:
idx = pd.date_range('2015-07-03 08:00:00', periods=30, freq='H')
data = np.random.randint(1, 100, size=len(idx))
# use the index param to set index or you might lose the freq
df = pd.DataFrame({'col':data}, index=idx)
# reindex with a new index
start = df.tshift(-3).index[0]
end = df.index[-1]
new_index = pd.date_range(start, end, freq='H')
new_df = df.reindex(new_index)
resample is also very useful for date indices
Just change the time from 08:00:00 to 05:00:00 in your code and create 3 more rows and update this dataframe to the existing one.
idx1 = pd.date_range('2015-07-03 05:00:00', periods=3,freq='H')
df1 = pd.DataFrame({'index': idx1 ,'col':np.random.randint(1,100,size = 3)})
df1.set_index('index',inplace=True)
df = df1.append(df)
print(df)
Add this snippet to your code...

How I can group timestamp column to hourly and aggregate the rows in pandas dataframe

I have a dataframe with data for each minutes, it also contains a date column which is used to keep track of the date in timestamp format.
Here I'm trying to aggregate the data by hours instead of minute.
I tried the following code which is working but it needs to index based on date column which I don't want because then I cannot loop through the dataframe using df.loc function.
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='T')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index('date')
df.index = pd.to_datetime(df.index, unit='s')
df = df.resample('H').sum()
df.head(15)
I also tried groupby but it's not working, following is the code.
df.groupby([df.date.dt.hour]).data.sum()
print(df.head(15))
How I can groupby date without indexing it?
Thanks.
Try pd.Grouper and specify the freq parameter:
df.groupby([pd.Grouper(key='date', freq='1H')]).sum()
Full code:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='T')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
print(df.groupby([pd.Grouper(key='date', freq='1H')]).sum())
# data
# date
# 2018-01-01 00:00:00 2958
# 2018-01-01 01:00:00 3084
# 2018-01-01 02:00:00 2991
# 2018-01-01 03:00:00 3021
# 2018-01-01 04:00:00 2894
# ... ...
# 2018-01-07 20:00:00 2863
# 2018-01-07 21:00:00 2850
# 2018-01-07 22:00:00 2823
# 2018-01-07 23:00:00 2805
# 2018-01-08 00:00:00 25
# [169 rows x 1 columns]
Hope that helps !

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1

Categories