I have a dataframe full of bookings for one room (rows: booking_id, check-in date and check-out date that I want to transform into a timeseries indexed by all year days (index: days of year, feature: booked or not).
I have calculated the duration of the bookings, and reindexed the dataframe daily.
Now I need to forward-fill the dataframe, but only a limited number of times: the duration of each booking.
Tried iterating through each row with ffill but it applies to the entire dataframe, not to selected rows.
Any idea how I can do that?
Here is my code:
import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
[2, '2019-01-03', '2019-01-07', 4],
[3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])
#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])
#create timeseries indexed on check-in date
df2 = df.set_index('check-in')
#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)
I have this:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 NaN NaT NaN
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 NaN NaT NaN
2019-01-05 NaN NaT NaN
2019-01-06 NaN NaT NaN
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 NaN NaT NaN
2019-01-12 NaN NaT NaN
2019-01-13 NaN NaT NaN
I expect to have:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 1.0 2019-01-02 1.0
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 2.0 2019-01-07 4.0
2019-01-05 2.0 2019-01-07 4.0
2019-01-06 2.0 2019-01-07 4.0
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 3.0 2019-01-13 3.0
2019-01-12 3.0 2019-01-13 3.0
2019-01-13 NaN NaT NaN
filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values
#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)
ts = ts.ffill().where(m)
First we create a series where the dates are ffilled. Then we create a mask where the index is less than the filled values. Then we fill based on our mask.
If you want to include the row with the check out date, change m from < to <=
I think to "forward-fill the dataframe" you should use pandas interpolate method. Documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
you can do something like this:
int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')
look at the specific documentation for interpolate, there is a lot of custom functionality you can add with flags to the method.
EDIT:
to do this using the row value in the duration column for each interpolation, this is a bit messy but I think it should work (there may be a less hacky, cleaner solution using some functionality in pandas or another library i am unaware of):
#get rows with nans in them:
nans_df = df2[df2.isnull()]
#get rows without nans in them:
non_nans_df = df2[~df2.isnull()]
#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []
#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
previous_day = nan_index - pd.DateOffset(1)
#this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
if previous_day not in non_nans_df.index:
continue
date_offset = 0
#here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
date_offset += 1
#this gets us the last date in the sequence of continuous days with all nan values after this current one.
end_sequence_date = nan_index + pd.DateOffset(date_offset)
#this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate.
df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date])
# now we pull the duration value for the first row in our df_to_interpolate dataframe.
limit_val = int(df_to_interpolate['duration'][0])
#here we interpolate the dataframe using the limit_val
df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')
#append df_to_interpolate to our list that gets combined at the end.
dfs.append(df_to_interpolate)
#gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value.
final_df = pd.concat(dfs)
Related
I have a multiindexed dataframe (but with more columns)
2020-12-22 09:47:50 2020-12-23 16:43:45 2020-12-22 15:00
Lines VehicleNumber
102 9405 3 NaN 3
9415 NaN NaN NaN
9416 NaN NaN NaN
Now I want to sort the columns such that I have the earliest date as a first column and the lastest as last. After that I want to delete columns, which are not in between two dates let's say 2020-12-22 10:00:00 < date < 2020-12-23 10:00:00. I tried transposing the dataframe, but it seems not to work when I have a multiindex.
Expected output:
2020-12-22 15:00 2020-12-23 16:43:45
Lines VehicleNumber
102 9405 3 NaN
9415 NaN NaN
9416 NaN NaN
So first we sort the columns by date and then check if they are between the two dates:
2020-12-22 10:00:00 < date < 2020-12-23 10:00:00 hence delete one column
First convert str columns to date time columns:
In [2244]: df.columns = pd.to_datetime(df.columns)
Then, sort df based on datetimes:
In [2246]: df = df.reindex(sorted(df.columns), axis=1)
Suppose you want to keep only column that are greater than following:
In [2251]: x = '2020-12-22 10:00:00'
Use List comprehension:
In [2257]: m = [i for i in df.columns if i > pd.to_datetime(x)]
In [2258]: df[m]
Out[2258]:
2020-12-22 15:00:00 2020-12-23 16:43:45
Lines VehicleNumber
102 9405.0 3.0 NaN
9415 NaN NaN NaN
9416 NaN NaN NaN
I'm trying to find ways of analysing log data from a git repo. I've dumped the data to a file that looks like this:
"hash","email","date","subject"
"65319af6e","jbrockmendel#gmail.com","2020-11-28","REF-IntervalIndex._assert_can_do_setop-38112"
"0bf58d8a9","simonjayhawkins#gmail.com","2020-11-28","DOC-add-contibutors-to-1.2.0-release-notes-38132"
"d16df293c","45562402+rhshadrach#users.noreply.github.com","2020-11-28","TYP-Add-cast-to-ABC-Index-like-types-38043"
"2d661a899","jbrockmendel#gmail.com","2020-11-28","CLN-simplify-mask_missing-38127"
"ba2ae2f73","jbrockmendel#gmail.com","2020-11-28","CLN-remove-unreachable-in-core.dtypes-38128"
I am able to get rows that have a count more than:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
dataframe = pd.read_csv("git-log-2020.csv")
dataframe['date'] = pd.to_datetime(dataframe['date'])
grouped_dataframe = dataframe.groupby([pd.Grouper(key='date', freq='M'), "email"])[["subject"]].count()
# Select all the users that have contributed more than 20 times.
print(grouped_dataframe[grouped_dataframe['subject'] > 20])
But I would like to be able to find the following:
What is the top three users per month?
What are the total commits for each month?
what is the average number of commits per user / per month? What is the monthly average activity of the users?
My code and data can be found here: https://github.com/mooperd/git-analysis-pandas
Ta, Andrew
All answers using your sample data that I call df
Top three :
df.groupby(pd.Grouper(key='date', freq='1M')).apply(lambda d: d['email'].value_counts().sort_values(ascending = False).head(3))
produces
date email
2020-11-30 jbrockmendel#gmail.com 3
simonjayhawkins#gmail.com 1
45562402+rhshadrach#users.noreply.github.com 1
Name: email, dtype: int64
total commits per month
df.groupby(pd.Grouper(key='date', freq='M'))['subject'].count()
output:
date
2020-11-30 5
Freq: M, Name: subject, dtype: int64
as for average number of commits per user/per month not entirely clear what you want? For each user, the ratio of the number of commits for that user by the number of months in your dataset? or the number of months they made actual commits?. Would be useful to get sample output here
But I think the following transformation is useful, it produces a table of # of commits by email and month so you can take averages every which way
df2 = df.groupby([pd.Grouper(key='date', freq='M')])['email'].value_counts().unstack(level=0)
df2
On your real data it produces a table that is too large to reproduce here but it starts with
date 2020-01-31 2020-02-29 2020-03-31 2020-04-30 2020-05-31 2020-06-30 2020-07-31 2020-08-31 2020-09-30 2020-10-31 2020-11-30
email
10430241+xh2#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
12420863+danchev#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
12769364+tnwei#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
12874561+jeet-parekh#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
as you can see it has a lot of NaNs that correspon to users not making any commits that month, which is useful when eg calculating averages over months with commits.For example
df2.mean(axis=1).sort_values(ascending = False)
produces average monthly commits for a user (sorted), and
df2.mean(axis=0).sort_values(ascending = False)
produces average monthly commits per month (sorted)
I have this df:
revenue pct_yoy pct_qoq
2020-06-30 99.721 0.479013 0.092833
2020-03-31 91.250 0.478283 0.087216
2019-12-31 83.930 0.676253 0.135094
2019-09-30 73.941 NaN 0.096657
2019-06-30 67.424 NaN 0.092293
2019-03-31 61.727 NaN 0.232814
2018-09-30 50.070 NaN NaN
However, if you look at last index value with 2018, I seem to be missing 2018-12-31 when looking at the index as a sequential quarterly time-series. The index jumps straight to 2018-9-30.
How to ensure that any missing quarterly dates are inserted with nan values for their respective columns?
I'm not quite sure how to approach this problem.
You'll need to generate a list of your own quarterly dates that includes the missing dates. Then you can use .reindex to re-align your dataframe to this new list of dates.
# Get the oldest and newest dates which will be the bounds
# for our new Index
first_date = df.index.min()
last_date = df.index.max()
# Generate dates for every 3 months (3M) from first_date up to last_date
quarterly = pd.date_range(first_date, last_date, freq="3M")
# realign our dataframe using our new quarterly date index
# this will fill NaN for dates that did not exist in the
# original index
out = df.reindex(quarterly)
# if you want to order this from most recent date to least recent date
# do: out.sort_index(ascending=False)
print(out)
revenue pct_yoy pct_qoq
2018-09-30 50.070 NaN NaN
2018-12-31 NaN NaN NaN
2019-03-31 61.727 NaN 0.232814
2019-06-30 67.424 NaN 0.092293
2019-09-30 73.941 NaN 0.096657
2019-12-31 83.930 0.676253 0.135094
2020-03-31 91.250 0.478283 0.087216
2020-06-30 99.721 0.479013 0.092833
If your data contains only quarter-enddates as in the sample, you may use resample and asfreq to fill missing quarter-ends
df_final = df.resample('Q').asfreq()[::-1]
Out[122]:
revenue pct_yoy pct_qoq
2020-06-30 99.721 0.479013 0.092833
2020-03-31 91.250 0.478283 0.087216
2019-12-31 83.930 0.676253 0.135094
2019-09-30 73.941 NaN 0.096657
2019-06-30 67.424 NaN 0.092293
2019-03-31 61.727 NaN 0.232814
2018-12-31 NaN NaN NaN
2018-09-30 50.070 NaN NaN
Pandas does not restrict DatetimeIndex keys to only Timestamps. Why it is so and is there any way to make such restriction?
df = pd.DataFrame({"A":{"2019-01-01":12.0,"2019-01-03":27.0,"2019-01-04":15.0},
"B":{"2019-01-01":25.0,"2019-01-03":27.0,"2019-01-04":27.0}}
)
df.index = pd.to_datetime(df.index)
df.loc['2010-05-05'] = 1 # string index
df.loc[150] = 1 # integer index
print(df)
I get the following dataframe:
A B
2019-01-01 00:00:00 12.0 25.0
2019-01-03 00:00:00 27.0 27.0
2019-01-04 00:00:00 15.0 27.0
2010-05-05 1.0 1.0
150 1.0 1.0
Of course I cannot do
df.index = pd.to_datetime(df.index)
once again because of last two rows.
However I'd like if 2 last rows could not be added throwing an error.
Is it possible?
You have a slight misconception about the type of your index. It is not a DateTimeIndex:
>>> df.index
Index([2019-01-01 00:00:00, 2019-01-03 00:00:00, 2019-01-04 00:00:00,
'2010-05-05', 150],
dtype='object')
The index becomes an Object dtype index as soon as you add a different type value. DateTimeIndex's can't have types of than timestamps, the type of the index is changed.
If you would like to remove all values that are not datetimes from your index, you can do that with pd.to_datetime and errors='coerce'
df.index = pd.to_datetime(df.index, errors='coerce')
A B
2019-01-01 12.0 25.0
2019-01-03 27.0 27.0
2019-01-04 15.0 27.0
2010-05-05 1.0 1.0
NaT 1.0 1.0
To access only elements that have a valid Timestamp as index, you can use notnull:
df[df.index.notnull()]
A B
2019-01-01 12.0 25.0
2019-01-03 27.0 27.0
2019-01-04 15.0 27.0
2010-05-05 1.0 1.0
You can check if each index is a pd._libs.tslibs.timestamps.Timestamp instance:
flags = [isinstance(idx, pd._libs.tslibs.timestamps.Timestamp) for idx in df.reset_index()['index']]
df = df[flags]
However, note that you can certainly do both pd.to_datetime('2010-05-05') and pd.to_datetime(150). At least, they still result in valid datetime stamp without throwing an exception/error/
I have a dictionary name date_dict keyed by datetime dates with values corresponding to integer counts of observations. I convert this to a sparse series/dataframe with censored observations that I would like to join or convert to a series/dataframe with continuous dates. The nasty list comprehension is my hack to get around the fact that pandas apparently won't automatically covert datetime date objects to an appropriate DateTime index.
df1 = pd.DataFrame(data=date_dict.values(),
index=[datetime.datetime.combine(i, datetime.time())
for i in date_dict.keys()],
columns=['Name'])
df1 = df1.sort(axis=0)
This example has 1258 observations and the DateTime index runs from 2003-06-24 to 2012-11-07.
df1.head()
Name
Date
2003-06-24 2
2003-08-13 1
2003-08-19 2
2003-08-22 1
2003-08-24 5
I can create an empty dataframe with a continuous DateTime index, but this introduces an unneeded column and seems clunky. I feel as though I'm missing a more elegant solution involving a join.
df2 = pd.DataFrame(data=None,columns=['Empty'],
index=pd.DateRange(min(date_dict.keys()),
max(date_dict.keys())))
df3 = df1.join(df2,how='right')
df3.head()
Name Empty
2003-06-24 2 NaN
2003-06-25 NaN NaN
2003-06-26 NaN NaN
2003-06-27 NaN NaN
2003-06-30 NaN NaN
Is there a simpler or more elegant way to fill a continuous dataframe from a sparse dataframe so that there is (1) a continuous index, (2) the NaNs are 0s, and (3) there is no left-over empty column in the dataframe?
Name
2003-06-24 2
2003-06-25 0
2003-06-26 0
2003-06-27 0
2003-06-30 0
You can just use reindex on a time series using your date range. Also it looks like you would be better off using a TimeSeries instead of a DataFrame (see documentation), although reindexing is also the correct method for adding missing index values to DataFrames as well.
For example, starting with:
date_index = pd.DatetimeIndex([pd.datetime(2003,6,24), pd.datetime(2003,8,13),
pd.datetime(2003,8,19), pd.datetime(2003,8,22), pd.datetime(2003,8,24)])
ts = pd.Series([2,1,2,1,5], index=date_index)
Gives you a time series like your example dataframe's head:
2003-06-24 2
2003-08-13 1
2003-08-19 2
2003-08-22 1
2003-08-24 5
Simply doing
ts.reindex(pd.date_range(min(date_index), max(date_index)))
then gives you a complete index, with NaNs for your missing values (you can use fillna if you want to fill the missing values with some other values - see here):
2003-06-24 2
2003-06-25 NaN
2003-06-26 NaN
2003-06-27 NaN
2003-06-28 NaN
2003-06-29 NaN
2003-06-30 NaN
2003-07-01 NaN
2003-07-02 NaN
2003-07-03 NaN
2003-07-04 NaN
2003-07-05 NaN
2003-07-06 NaN
2003-07-07 NaN
2003-07-08 NaN
2003-07-09 NaN
2003-07-10 NaN
2003-07-11 NaN
2003-07-12 NaN
2003-07-13 NaN
2003-07-14 NaN
2003-07-15 NaN
2003-07-16 NaN
2003-07-17 NaN
2003-07-18 NaN
2003-07-19 NaN
2003-07-20 NaN
2003-07-21 NaN
2003-07-22 NaN
2003-07-23 NaN
2003-07-24 NaN
2003-07-25 NaN
2003-07-26 NaN
2003-07-27 NaN
2003-07-28 NaN
2003-07-29 NaN
2003-07-30 NaN
2003-07-31 NaN
2003-08-01 NaN
2003-08-02 NaN
2003-08-03 NaN
2003-08-04 NaN
2003-08-05 NaN
2003-08-06 NaN
2003-08-07 NaN
2003-08-08 NaN
2003-08-09 NaN
2003-08-10 NaN
2003-08-11 NaN
2003-08-12 NaN
2003-08-13 1
2003-08-14 NaN
2003-08-15 NaN
2003-08-16 NaN
2003-08-17 NaN
2003-08-18 NaN
2003-08-19 2
2003-08-20 NaN
2003-08-21 NaN
2003-08-22 1
2003-08-23 NaN
2003-08-24 5
Freq: D, Length: 62