How to groupby two fields in pandas? - python

Given following input, the goal is to group values by hour for each Date with Avg and Sum functions.
Solution to grouping it by hour is here, but it does not consider new days.
Date Time F1 F2 F3
21-01-16 8:11 5 2 4
21-01-16 9:25 9 8 2
21-01-16 9:39 7 3 2
21-01-16 9:53 6 5 1
21-01-16 10:07 4 6 7
21-01-16 10:21 7 3 1
21-01-16 10:35 5 6 7
21-01-16 11:49 1 2 1
21-01-16 12:03 3 3 1
22-01-16 9:45 6 5 1
22-01-16 9:20 4 6 7
22-01-16 12:10 7 3 1
Expected output:
Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1

You can do the parsing of dates during reading of the csv file:
from __future__ import print_function # make it work with Python 2 and 3
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))
Output:
F1 F2 F3
mean sum mean sum mean sum
Date Time
2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
9 7.333333 22 5.333333 16 1.666667 5
10 5.333333 16 5.000000 15 5.000000 15
11 1.000000 1 2.000000 2 1.000000 1
12 3.000000 3 3.000000 3 1.000000 1
2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
12 7.000000 7 3.000000 3 1.000000 1
All the way into csv:
from __future__ import print_function
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))
Output:
Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1

You cas use convert time to datetime by to_datetime and then groupby with agg:
print df
Date Time F1 F2 F3
0 2016-01-21 8:11 5 2 4
1 2016-01-21 9:25 9 8 2
2 2016-01-21 9:39 7 3 2
3 2016-01-21 9:53 6 5 1
4 2016-01-21 10:07 4 6 7
5 2016-01-21 10:21 7 3 1
6 2016-01-21 10:35 5 6 7
7 2016-01-21 11:49 1 2 1
8 2016-01-21 12:03 3 3 1
9 2016-01-22 9:45 6 5 1
10 2016-01-22 9:20 4 6 7
11 2016-01-22 12:10 7 3 1
df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
Date Time F1 F2 F3
0 2016-01-21 1900-01-01 08:11:00 5 2 4
1 2016-01-21 1900-01-01 09:25:00 9 8 2
2 2016-01-21 1900-01-01 09:39:00 7 3 2
3 2016-01-21 1900-01-01 09:53:00 6 5 1
4 2016-01-21 1900-01-01 10:07:00 4 6 7
5 2016-01-21 1900-01-01 10:21:00 7 3 1
6 2016-01-21 1900-01-01 10:35:00 5 6 7
7 2016-01-21 1900-01-01 11:49:00 1 2 1
8 2016-01-21 1900-01-01 12:03:00 3 3 1
9 2016-01-22 1900-01-01 09:45:00 6 5 1
10 2016-01-22 1900-01-01 09:20:00 4 6 7
11 2016-01-22 1900-01-01 12:10:00 7 3 1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
Date Time F1 F2 F3
mean sum mean sum mean sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1
And then you can set column names by list comprehension:
levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df
Date Time F1 mean F1 sum F2 mean F2 sum F3 mean F3 sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1

Related

Sort columns based on min value of each column in pandas

I am working with a dataset that is of the form
stops stops name 0 1 2 3
1 A 21:09:00 17:24:00 17:54:00 17:29:00
2 B 21:10:00 17:25:00 17:55:00 17:27:00
3 C 17:28:00 17:58:00 17:26:00
4 D 21:16:00 18:01:00 17:23:00
5 E 21:17:00 17:32:00 18:02:00
6 F 21:20:00 17:35:00 18:05:00 17:20:00
I know how to sort columns [0-3] according to the times of a specific stop. For example,
to sort based on the times of the first stop(A) I do below:
def time_to_seconds(time):
try:
h,m,s = time.split(':')
return int(h) * 3600 + int(m) * 60 + int(s)
except:
return -1
def time_cmp(l):
return [ time_to_seconds(time) for time in l ]
df[df.columns[2:]] = df[df.columns[2:]].sort_values(by=[0],axis=1,key=time_cmp)
It works, and values are stored:
stops stops name 0 1 2 3
1 A 17:24:00 17:29:00 17:54:00 21:09:00
2 B 17:25:00 17:27:00 17:55:00 21:10:00
3 C 17:28:00 17:26:00 17:58:00
4 D 17:23:00 18:01:00 21:16:00
5 E 17:32:00 18:02:00 21:17:00
6 F 17:35:00 17:20:00 18:05:00 21:20:00
However, I want to sort the columns based on the minimum value for each column, not a specific row(or stop). What should I do?
Result of the desired sort:
stops stops name 0 1 2 3
1 A 17:29:00 17:24:00 17:54:00 21:09:00
2 B 17:27:00 17:25:00 17:55:00 21:10:00
3 C 17:26:00 17:28:00 17:58:00
4 D 17:23:00 18:01:00 21:16:00
5 E 17:32:00 18:02:00 21:17:00
6 F 17:20:00 17:35:00 18:05:00 21:20:00
As you can see min value of column 0(17:20:00) is less than the min value of column 1(17:24:00).
You can convert to_timedelta, then use numpy to get the sorted order:
import numpy as np
cols = df.columns[2:]
order = np.argsort(df[cols].apply(pd.to_timedelta).min())
df[cols].iloc[:, order]
# or as one-liner
# df[cols].iloc[:, np.argsort(df[cols].apply(pd.to_timedelta).min())]
output:
3 1 2 0
0 17:29:00 17:24:00 17:54:00 21:09:00
1 17:27:00 17:25:00 17:55:00 21:10:00
2 17:26:00 17:28:00 17:58:00 NaN
3 17:23:00 NaN 18:01:00 21:16:00
4 NaN 17:32:00 18:02:00 21:17:00
5 17:20:00 17:35:00 18:05:00 21:20:00
If you want to replace the values, keeping the column names intact:
cols = df.columns[2:]
order = np.argsort(df[cols].apply(pd.to_timedelta).min())
df[cols] = df[cols].iloc[:, order]
output:
stops stops name 0 1 2 3
0 1 A 17:29:00 17:24:00 17:54:00 21:09:00
1 2 B 17:27:00 17:25:00 17:55:00 21:10:00
2 3 C 17:26:00 17:28:00 17:58:00 NaN
3 4 D 17:23:00 NaN 18:01:00 21:16:00
4 5 E NaN 17:32:00 18:02:00 21:17:00
5 6 F 17:20:00 17:35:00 18:05:00 21:20:00
Here is how you can do it-
my_df = pd.DataFrame({'stops':[1,2,3,4,5,6],
'stops name':['A','B','C','D','E','F'],
'0':['17:29:00','17:27:00','17:26:00','17:23:00','','17:20:00'],
'1':['17:24:00','17:25:00','17:28:00','','17:32:00','17:35:00'],
'2':['17:54:00','17:55:00','17:58:00','18:01:00','18:02:00','18:05:00'],
'3':['21:09:00','21:10:00','','21:16:00','21:17:00','21:20:00']})
my_df = my_df.sort_values(by=['0','1','2','3'], ascending=[False, False, False, False])
my_df
Output-
stops stops name 0 1 2 3
0 1 A 17:29:00 17:24:00 17:54:00 21:09:00
1 2 B 17:27:00 17:25:00 17:55:00 21:10:00
2 3 C 17:26:00 17:28:00 17:58:00
3 4 D 17:23:00 18:01:00 21:16:00
5 6 F 17:20:00 17:35:00 18:05:00 21:20:00
4 5 E 17:32:00 18:02:00 21:17:00

Take row means of every other column in pandas (python)

I am trying to take row means of every few columns. Here is a sample dataset.
d = {'2000-01': range(0,10), '2000-02': range(10,20), '2000-03': range(10,20),
'2001-01': range(10,20), '2001-02':range(5,15), '2001-03':range(5,15)}
pd.DataFrame(data=d)
2000-01 2000-02 2000-03 2001-01 2001-02 2001-03
0 0 10 10 10 5 5
1 1 11 11 11 6 6
2 2 12 12 12 7 7
3 3 13 13 13 8 8
4 4 14 14 14 9 9
5 5 15 15 15 10 10
6 6 16 16 16 11 11
7 7 17 17 17 12 12
8 8 18 18 18 13 13
9 9 19 19 19 14 14
I need to take row means of the first three columns and then the next three and so on in the complete dataset. I don't need the original columns in the new dataset. Here is my code. It works but with caveats (discussed below). I am searching for a cleaner, more elegant solution if possible. (New to Python/Pandas)
#Create empty list to store row means
d1 = []
#Run loop to find row means for every three columns
for i in np.arange(0, 6, 3):
data1 = d.iloc[:,i:i+3]
d1.append(data1.mean(axis=1))
#Create empty list to concat DFs later
dlist1 =[]
#Concat DFs
for j in range(0,len(d1)):
dlist1.append(pd.Series(d1[j]).to_frame())
pd.concat(dlist1, axis = 1)
I get this output, which is correct:
0 0
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667
The columns names can easily be fixed, but the problem is that I need them in a specific format and I have 65 of these columns in the actual dataset. If you'll notice the column names in the original dataset, they are '2000-01'; '2000-02'; '2000-03'. The 1,2 and 3 are months of the year 2000, therefore column 1 of the new df should be '2000q1' , q1 being quarter 1. How do I loop over column names to create this for all my new columns? This seems significantly more challenging (at least to me!) than what's shown here. Thanks for your time!
EDIT: Ok this has been solved, quick shoutout to everyone who contributed!
We have groupby for axis=1, here using the numpy array get the divisor
df=df.groupby(np.arange(df.shape[1])//3,axis=1).mean()
0 1
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667
#np.arange(df.shape[1])//3
#array([0, 0, 0, 1, 1, 1])
More common way
df.columns=pd.to_datetime(df.columns,format='%Y-%m').to_period('Q')
df=df.groupby(level=0,axis=1).mean()
2000Q1 2001Q1
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667
Iterate with multiple of 3 and concat all the series:
df = (pd.concat([df.iloc[:, i:i+3].mean(1).rename(df.columns[i].split('-')[0]+'q1')
for i in range(0, df.shape[1], 3)], axis=1))
print(df)
2000q1 2001q1
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667

Get cumulative mean among groups in Python

I am trying to get a cumulative mean in python among different groups.
I have data as follows:
id date value
1 2019-01-01 2
1 2019-01-02 8
1 2019-01-04 3
1 2019-01-08 4
1 2019-01-10 12
1 2019-01-13 6
2 2019-01-01 4
2 2019-01-03 2
2 2019-01-04 3
2 2019-01-06 6
2 2019-01-11 1
The output I'm trying to get something like this:
id date value cumulative_avg
1 2019-01-01 2 NaN
1 2019-01-02 8 2
1 2019-01-04 3 5
1 2019-01-08 4 4.33
1 2019-01-10 12 4.25
1 2019-01-13 6 5.8
2 2019-01-01 4 NaN
2 2019-01-03 2 4
2 2019-01-04 3 3
2 2019-01-06 6 3
2 2019-01-11 1 3.75
I need the cumulative average to restart with each new id.
I can get a variation of what I'm looking for with a single, for example if the data set only had the data where id = 1 then I could use:
df['cumulative_avg'] = df['value'].expanding.mean().shift(1)
I try to add a group by into it but I get an error:
df['cumulative_avg'] = df.groupby('id')['value'].expanding().mean().shift(1)
TypeError: incompatible index of inserted column with frame index
Also tried:
df.set_index(['account']
ValueError: cannot handle a non-unique multi-index!
The actual data I have has millions of rows, and thousands of unique ids'. Any help with a speedy/efficient way to do this would be appreciated.
For many groups this will perform better because it ditches the apply. Take the cumsum divided by the cumcount, subtracting off the value to get the analog of expanding. Fortunately pandas interprets 0/0 as NaN.
gp = df.groupby('id')['value']
df['cum_avg'] = (gp.cumsum() - df['value'])/gp.cumcount()
id date value cum_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000
After a groupby, you can't really chain method and in your example, the shift is not made per group anymore so you would not get the expected result. And there is a problem with index alignment after anyway so you can't create a column like this. So you can do:
df['cumulative_avg'] = df.groupby('id')['value'].apply(lambda x: x.expanding().mean().shift(1))
print (df)
id date value cumulative_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000

How to add missing dates within date interval?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result

Duplicating rows based on a sequence of start date in Python

I would like to duplicate the rows in a data frame by creating a sequence of n dates from the start date.
My input file format.
col1 col2 date
1 5 2015-07-15
2 6 2015-07-20
3 7 2015-07-25
My expected output.
col1 col2 date
1 5 2015-07-15
1 5 2015-07-16
1 5 2015-07-17
1 5 2015-07-18
1 5 2015-07-19
2 6 2015-07-20
2 6 2015-07-21
2 6 2015-07-22
2 6 2015-07-23
2 6 2015-07-24
3 7 2015-07-25
3 7 2015-07-26
3 7 2015-07-27
3 7 2015-07-28
3 7 2015-07-29
I have to create a sequence of dates with a day difference.
Thanks in advance.
Use:
df['date'] = pd.to_datetime(df['date'])
n = 15
#create date range by periods
idx = pd.date_range(df['date'].iat[0], periods=n)
#create DatetimeIndex with reindex and forward filling values
df = (df.set_index('date')
.reindex(idx, method='ffill')
.reset_index()
.rename(columns={'index':'date'}))
print (df)
date col1 col2
0 2015-07-15 1 5
1 2015-07-16 1 5
2 2015-07-17 1 5
3 2015-07-18 1 5
4 2015-07-19 1 5
5 2015-07-20 2 6
6 2015-07-21 2 6
7 2015-07-22 2 6
8 2015-07-23 2 6
9 2015-07-24 2 6
10 2015-07-25 3 7
11 2015-07-26 3 7
12 2015-07-27 3 7
13 2015-07-28 3 7
14 2015-07-29 3 7
Import packages
from datetime import datetime as dt
from datetime import timedelta
import numpy as np
Then create the date range as a df:
base = dt(2015, 7, 15)
arr = np.array([base + timedelta(days=i) for i in range(15)])
df_d = pd.DataFrame({'date_r' : arr})
Change the datatype of the original df if you have not:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
and merge with the original df, and sort by date ascending:
df_merged = df.merge(df_d, how='right', left_on='date', right_on='date_r')
df_merged.sort_values('date_r', inplace=True)
you will get this df:
col1 col2 date date_r
0 1.0 5.0 2015-07-15 2015-07-15
3 NaN NaN NaT 2015-07-16
4 NaN NaN NaT 2015-07-17
5 NaN NaN NaT 2015-07-18
6 NaN NaN NaT 2015-07-19
1 2.0 6.0 2015-07-20 2015-07-20
7 NaN NaN NaT 2015-07-21
8 NaN NaN NaT 2015-07-22
9 NaN NaN NaT 2015-07-23
10 NaN NaN NaT 2015-07-24
2 3.0 7.0 2015-07-25 2015-07-25
11 NaN NaN NaT 2015-07-26
12 NaN NaN NaT 2015-07-27
13 NaN NaN NaT 2015-07-28
14 NaN NaN NaT 2015-07-29
Now, you will just need to forward fill using fillna(method='ffill').astype(int):
df_merged['col1'] = df_merged['col1'].fillna(method='ffill').astype(int)
df_merged['col2'] = df_merged['col2'].fillna(method='ffill').astype(int)
for completeness sake, rename the columns to get the original intended df back:
df_merged = df_merged[['col1', 'col2', 'date_r']]
df_merged.rename(columns={'date_r' : 'date'}, inplace=True)
for cosmetic purposes:
df_merged.reset_index(inplace=True, drop=True)
print(df_merged)
to yield finally:
col1 col2 date
0 1 5 2015-07-15
1 1 5 2015-07-16
2 1 5 2015-07-17
3 1 5 2015-07-18
4 1 5 2015-07-19
5 2 6 2015-07-20
6 2 6 2015-07-21
7 2 6 2015-07-22
8 2 6 2015-07-23
9 2 6 2015-07-24
10 3 7 2015-07-25
11 3 7 2015-07-26
12 3 7 2015-07-27
13 3 7 2015-07-28
14 3 7 2015-07-29
more generic way would be stretching out your time index and filling NaN with previous values.
try this,
df['date']=pd.to_datetime(df['date'])
print(df.set_index('date').asfreq('D').ffill().reset_index())
O/P:
date col1 col2
0 2015-07-15 1.0 5.0
1 2015-07-16 1.0 5.0
2 2015-07-17 1.0 5.0
3 2015-07-18 1.0 5.0
4 2015-07-19 1.0 5.0
5 2015-07-20 2.0 6.0
6 2015-07-21 2.0 6.0
7 2015-07-22 2.0 6.0
8 2015-07-23 2.0 6.0
9 2015-07-24 2.0 6.0
10 2015-07-25 3.0 7.0

Categories