How to expand a DateTimeIndex with some months before? - python

I have a time-series corresponding to the end of the month for some dates of interest:
Date
31-01-2005 0.0
28-02-2006 0.0
30-06-2020 0.0
Name: Whatever, dtype: float64
I'd like to expand this dataframe's index with two month samples before each data point resulting in the following dataframe:
Date
30-11-2004 NaN
31-12-2004 NaN
31-01-2005 0.0
31-12-2005 NaN
31-01-2006 NaN
28-02-2006 0.0
30-04-2020 NaN
31-05-2020 NaN
30-06-2020 0.0
Name: Whatever, dtype: float64
How can I do that? Note that I am only interested in the resulting index.
My naive attempt was to do:
df.index.apply(lambda x: [x - pd.DateOffset(months=2), x - pd.DateOffset(months=1), x])
but index doesn't have an apply function.

I think you need DataFrame.reindex with date_range:
idx = [y for x in df.index for y in pd.date_range(x - pd.DateOffset(months=2), x, freq='M')]
df = df.reindex(pd.to_datetime(idx))
print (df)
Whatever
2004-11-30 NaN
2004-12-31 NaN
2005-01-31 0.0
2005-12-31 NaN
2006-01-31 NaN
2006-02-28 0.0
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 0.0

Related

How to sort and delete columns in a multiindexed dataframe

I have a multiindexed dataframe (but with more columns)
2020-12-22 09:47:50 2020-12-23 16:43:45 2020-12-22 15:00
Lines VehicleNumber
102 9405 3 NaN 3
9415 NaN NaN NaN
9416 NaN NaN NaN
Now I want to sort the columns such that I have the earliest date as a first column and the lastest as last. After that I want to delete columns, which are not in between two dates let's say 2020-12-22 10:00:00 < date < 2020-12-23 10:00:00. I tried transposing the dataframe, but it seems not to work when I have a multiindex.
Expected output:
2020-12-22 15:00 2020-12-23 16:43:45
Lines VehicleNumber
102 9405 3 NaN
9415 NaN NaN
9416 NaN NaN
So first we sort the columns by date and then check if they are between the two dates:
2020-12-22 10:00:00 < date < 2020-12-23 10:00:00 hence delete one column
First convert str columns to date time columns:
In [2244]: df.columns = pd.to_datetime(df.columns)
Then, sort df based on datetimes:
In [2246]: df = df.reindex(sorted(df.columns), axis=1)
Suppose you want to keep only column that are greater than following:
In [2251]: x = '2020-12-22 10:00:00'
Use List comprehension:
In [2257]: m = [i for i in df.columns if i > pd.to_datetime(x)]
In [2258]: df[m]
Out[2258]:
2020-12-22 15:00:00 2020-12-23 16:43:45
Lines VehicleNumber
102 9405.0 3.0 NaN
9415 NaN NaN NaN
9416 NaN NaN NaN

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

pandas: given a start and end date, add a column for each day in between, then add values?

This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01

How to flatten a pandas DataFrameGroupBy

I have a grouped object which is of type DataFrameGroupBy. I want to use this to aggregate some data like so:
aggregated = grouped.aggregate([np.sum, np.mean], axis=1)
This returns a DataFrame with the format:
aggregated[:3].to_dict()
"""
{('VALUE1', 'sum'): {
('US10adam034', 'PRCP'): 701,
('US10adam036', 'PRCP'): 1015,
('US10adam036', 'SNOW'): 46},
('VALUE1', 'mean'): {
('US10adam034', 'PRCP'): 100.14285714285714,
('US10adam036', 'PRCP'): 145.0,
('US10adam036', 'SNOW'): 46.0}}
"""
Printing out the head produces this:
VALUE1
sum mean
ID ELEMENT
US10adam034 PRCP 701 100.142857
US10adam036 PRCP 1015 145.000000
SNOW 46 46.000000
US10adam046 PRCP 790 131.666667
US10adam051 PRCP 5 0.555556
US10adam056 PRCP 540 31.764706
SNOW 25 1.923077
SNWD 165 15.000000
This works great. It easily computes sums and means for my sample where the grouped indices are (ID, ELEMENT). However, I'd really like to get this into a single row format where ID is unique and the columns are a combination of ELEMENT & (sum|mean). I can almost get there using apply like so:
def getNewSeries(t):
# type(t) => Series
element = t.name[1] # t.name is a tuple ('ID', 'ELEMENT')
sum_index=f'{element}sum'
mean_index=f'{element}mean'
return pd.Series(t['VALUE1'].values, index=[sum_index, mean_index])
aggregated.apply(getNewSeries, axis=1, result_type='expand')
Printing out the head again I get:
PRCPmean PRCPsum SNOWmean SNOWsum SNWDmean ...
ID ELEMENT
US10adam034 PRCP 100.142857 701.0 NaN NaN NaN
US10adam036 PRCP 145.000000 1015.0 NaN NaN NaN
SNOW NaN NaN 46.000000 46.0 NaN
US10adam046 PRCP 131.666667 790.0 NaN NaN NaN
US10adam051 PRCP 0.555556 5.0 NaN NaN NaN
US10adam056 PRCP 31.764706 540.0 NaN NaN NaN
SNOW NaN NaN 1.923077 25.0 NaN
SNWD NaN NaN NaN NaN 15.0
I would like my final DataFrame to look like this:
PRCPmean PRCPsum SNOWmean SNOWsum SNWDmean ...
ID
US10adam034 100.142857 701.0 NaN NaN NaN
US10adam036 145.000000 1015.0 46.000000 46.0 NaN
US10adam046 131.666667 790.0 NaN NaN NaN
US10adam051 0.555556 5.0 NaN NaN NaN
US10adam056 31.764706 540.0 1.923077 25.0 15.0
Is there a way, using apply, agg or transform to aggregate this data into single rows? I've tried also creating my own iterator over unique IDs but it was painfully slow. I like the ease of using agg to compute sum/mean.
I like using f-string with list comprehensions.. Python 3.6+ required for f-string formatting.
df_out = df.unstack()['VALUE1']
df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
df_out
Output:
PRCPsum SNOWsum PRCPmean SNOWmean
US10adam034 701.0 NaN 100.142857 NaN
US10adam036 1015.0 46.0 145.000000 46.0
You can do:
new_df = agg_df.unstack(level=1)
new_df.columns = [c+b for _,b,c in new_df.columns.values]
Output:
PRCPsum SNOWsum PRCPmean SNOWmean
US10adam034 701.0 NaN 100.142857 NaN
US10adam036 1015.0 46.0 145.000000 46.0
IIUC
aggregated = grouped['VALUE1'].aggregate([np.sum, np.mean], axis=1)
aggregated=aggregated.unstack()
aggregated.columns=aggregated.columns.map('{0[1]}|{0[0]}'.format)
Please check if reset_index is working as per your need
aggregated.apply(getNewSeries, axis=1, result_type='expand').reset_index()
I think you can try with unstack() to move the innermost row index to become the innermost column index to reshape you data.
And you can also use fill_value to change NaNs to 0

How to create a timeseries from a dataframe of event durations?

I have a dataframe full of bookings for one room (rows: booking_id, check-in date and check-out date that I want to transform into a timeseries indexed by all year days (index: days of year, feature: booked or not).
I have calculated the duration of the bookings, and reindexed the dataframe daily.
Now I need to forward-fill the dataframe, but only a limited number of times: the duration of each booking.
Tried iterating through each row with ffill but it applies to the entire dataframe, not to selected rows.
Any idea how I can do that?
Here is my code:
import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
[2, '2019-01-03', '2019-01-07', 4],
[3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])
#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])
#create timeseries indexed on check-in date
df2 = df.set_index('check-in')
#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)
I have this:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 NaN NaT NaN
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 NaN NaT NaN
2019-01-05 NaN NaT NaN
2019-01-06 NaN NaT NaN
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 NaN NaT NaN
2019-01-12 NaN NaT NaN
2019-01-13 NaN NaT NaN
I expect to have:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 1.0 2019-01-02 1.0
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 2.0 2019-01-07 4.0
2019-01-05 2.0 2019-01-07 4.0
2019-01-06 2.0 2019-01-07 4.0
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 3.0 2019-01-13 3.0
2019-01-12 3.0 2019-01-13 3.0
2019-01-13 NaN NaT NaN
filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values
#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)
ts = ts.ffill().where(m)
First we create a series where the dates are ffilled. Then we create a mask where the index is less than the filled values. Then we fill based on our mask.
If you want to include the row with the check out date, change m from < to <=
I think to "forward-fill the dataframe" you should use pandas interpolate method. Documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
you can do something like this:
int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')
look at the specific documentation for interpolate, there is a lot of custom functionality you can add with flags to the method.
EDIT:
to do this using the row value in the duration column for each interpolation, this is a bit messy but I think it should work (there may be a less hacky, cleaner solution using some functionality in pandas or another library i am unaware of):
#get rows with nans in them:
nans_df = df2[df2.isnull()]
#get rows without nans in them:
non_nans_df = df2[~df2.isnull()]
#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []
#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
previous_day = nan_index - pd.DateOffset(1)
#this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
if previous_day not in non_nans_df.index:
continue
date_offset = 0
#here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
date_offset += 1
#this gets us the last date in the sequence of continuous days with all nan values after this current one.
end_sequence_date = nan_index + pd.DateOffset(date_offset)
#this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate.
df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date])
# now we pull the duration value for the first row in our df_to_interpolate dataframe.
limit_val = int(df_to_interpolate['duration'][0])
#here we interpolate the dataframe using the limit_val
df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')
#append df_to_interpolate to our list that gets combined at the end.
dfs.append(df_to_interpolate)
#gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value.
final_df = pd.concat(dfs)

Categories