Fill in missing days in dataframe and add zero value in Python - python

I have a dataframe that looks like the following
Date A B
2014-12-20 00:00:00.000 3 2
2014-12-21 00:00:00.000 7 1
2014-12-22 00:00:00.000 2 9
2014-12-24 00:00:00.000 2 2
and I would like to add the missing day and fill the values for A and B with 0 so I get
Date A B
2014-12-20 00:00:00.000 3 2
2014-12-21 00:00:00.000 7 1
2014-12-22 00:00:00.000 2 9
2014-12-23 00:00:00.000 0 0
2014-12-24 00:00:00.000 2 2
How is this achieved best?

If Date is column create DatetimeIndex and then use DataFrame.asfreq:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.set_index('Date').asfreq('d', fill_value=0)
print (df1)
A B
Date
2014-12-20 3 2
2014-12-21 7 1
2014-12-22 2 9
2014-12-23 0 0
2014-12-24 2 2
If first column is index:
df.index = pd.to_datetime(df.index)
df1 = df.asfreq('d', fill_value=0)
print (df1)
A B
Date
2014-12-20 3 2
2014-12-21 7 1
2014-12-22 2 9
2014-12-23 0 0
2014-12-24 2 2

Related

Cumulative groupby with condition on datetime pandas

I need to calculate cumulative sums for different columns in a pandas dataframe based on a column playerId and a datetime column. My dataframe looks like this:
eventId playerId goal shot header dateutc
0 0 100 0 1 0 2020-11-08 17:00:00
1 1 100 0 0 1 2020-11-08 17:00:00
2 2 100 1 1 0 2020-11-08 17:00:00
3 3 200 0 1 0 2020-11-08 17:00:00
4 4 100 1 0 1 2020-11-15 17:00:00
5 5 100 1 1 0 2020-11-15 17:00:00
6 6 200 1 1 0 2020-11-15 17:00:00
So now I need to calculate cumulative sums for each player for the current date and all previous dates. So my final dateframe will look like this:
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
Hopefully someone can help me :)
First remove eventId for avoid sum if numeric, aggregate sum and then cumsum:
df1 = (df.drop('eventId',axis=1)
.groupby(['playerId','dateutc'], sort=False)
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
print (df1)
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
If need specify columns for processing:
df1 = (df.groupby(['playerId','dateutc'], sort=False)[['goal', 'shot', 'header']]
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
Try:
out = df.groupby(['playerId', 'dateutc'], sort=False)[['goal', 'shot', 'header']].sum()
out = out.groupby(level='playerId').cumsum().reset_index()
Output:
>>> out
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

How to add missing dates within date interval?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result

How to define if-else function using dataframe columns as arguments in python?

I need to write a function and then apply it for a dataframe's column in pandas.
My dataframe looks like this.Data is sorted by id and then by period columns.
period id column1
0 2013-01-31 5 NaT
1 2013-02-28 5 28 days
2 2013-03-31 5 31 days
3 2013-04-30 5 30 days
4 2016-05-31 6 NaT
5 2016-06-30 6 30 days
6 2016-08-31 6 62 days
The new column values should be defined according to values in column1:
if column1=NaT or column1>31
then new column eqauls to the value in period column
Else - values of new column should be copied from its previous row:
new column ith row= new column i-1 row.
I am very new to python and my code doesn't work:
def f(x):
if not x or x > 31
return x=df['period']
else
return x=x.shift()
df['newcolumn'] = df['column1'].apply(f)
The output should be this:
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
Any help would be much appreciated.
first it might be necessary to convert period to datetime: using pd.to_datetime
df['period']=pd.to_datetime(df['period'])
Then you can Use Dataframe.where with DataFrame.ffill:
df['newcolumn']=df['period'].where((df["column1"]>pd.Timedelta("31 days"))|(df["column1"].isnull())).ffill()
print(df)
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
you can use df.where(cond, other) which return return df's row if condition match else returns other
df["newcolumn"] = df["period"].where(df["column1"].isnull() | (df["column1"]>pd.TimeDelta("31D")), df["column1"].shift())

How to make all non-date values null in Pandas

I have an excel doc where the users put dates and strings in the same column. I want to make every string object null and leave all the dates. How do I do this in pandas? Thanks.
An easy way to convert dates in a DataFrame is with pandas.DataFrame.convert_objects, as mentioned by #Jeff, and it also handles numbers and timedeltas. Here is an example of using it:
# contents of Sheet1 of test.xlsx
x y date1 z date2 date3
1 fum 6/1/2016 7 9/1/2015 string3
2 fo 6/2/2016 alpha string0 10/1/2016
3 fi 6/3/2016 9 9/3/2015 10/2/2016
4 fee 6/4/2016 10 string1 string4
5 dumbledum 6/5/2016 beta string2 10/3/2015
6 dumbledee 6/6/2016 12 9/4/2015 string5
import pandas as pd
xl = pd.ExcelFile('test.xlsx')
df = xl.parse("Sheet1")
df1 = df.convert_objects(convert_dates='coerce')
# 'coerce' required for conversion to NaT on error
df1
Out[7]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
Individual columns in a DataFrame can be converted using pandas.to_datetime, as pointed out by #Jeff, and with pandas.Series.map, however neither are done in place. For example, with pandas.to_datetime:
import pandas as pd
xl2 = pd.ExcelFile('test.xlsx')
df2 = xl2.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df2[col] = pd.to_datetime(df2[col],coerce=True, infer_datetime_format=True)
df2
Out[8]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
And using pandas.Series.map:
import pandas as pd
import datetime
xl3 = pd.ExcelFile('test.xlsx')
df3 = xl3.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df3[col] = df3[col].map(lambda x: x if isinstance(x,(datetime.datetime)) else None)
df3
Out[9]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
An upfront way to convert dates in an excel doc is while parsing its sheets. This can be done using pandas.ExcelFile.parse's converters option with a function derived from pandas.to_datetime as the functions in the converters dict and enabling it with coerce=True to force errors to NaT. For example:
def converter(x):
return pd.to_datetime(x,coerce=True,infer_datetime_format=True)
# the following also works for this example
# return pd.to_datetime(x,format='%d/%m/%Y',coerce=True)
converters={'date1': converter,'date2': converter, 'date3': converter}
xl4 = pd.ExcelFile('test.xlsx')
df4 = xl4.parse("Sheet1",converters=converters)
df4
Out[10]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT

Categories