There are many steps to do that for the data I have, I will show you the steps taken so far where I stuck:
I have this df:
df = pd.DataFrame(np.array([['Iza', '2020-12-01 10:34:00'],['Iza', '2020-12-02 10:34:00'],['Iza', '2020-12-01 17:34:00'],['Iza', '2020-12-01 17:34:00'],['Sara', '2020-12-04 17:34:00'], ['Sara', '2020-12-04 20:11:00'], ['Sara', '2020-12-06 17:34:00'],['Silvia', '2020-12-07 18:34:00'],['Silvia', '2020-12-09 11:22:00'],['Paul', '2020-12-09 11:22:00'],['Paul', '2020-12-08 11:22:00'],['Paul', '2020-12-07 11:22:00']]),
columns=['Name', 'Time'])
df:
Name Time
0 Iza 2020-12-01 10:34:00
1 Iza 2020-12-02 10:34:00
2 Iza 2020-12-01 17:34:00
3 Iza 2020-12-01 17:34:00
4 Sara 2020-12-04 17:34:00
5 Sara 2020-12-04 20:11:00
6 Sara 2020-12-06 17:34:00
7 Silvia 2020-12-07 18:34:00
8 Silvia 2020-12-09 11:22:00
9 Paul 2020-12-09 11:22:00
10 Paul 2020-12-08 11:22:00
11 Paul 2020-12-07 11:22:00
I converted the time column to datetime:
df['Time'] = pd.to_datetime(df['Time'])
Now I want to get days in names and find the percentage of each day per name in columns:
df['Day'] = df['Time'].dt.day_name()
Result:
Name Time Day
0 Iza 2020-12-01 10:34:00 Tuesday
1 Iza 2020-12-02 10:34:00 Wednesday
2 Iza 2020-12-01 17:34:00 Tuesday
3 Iza 2020-12-01 17:34:00 Tuesday
4 Sara 2020-12-04 17:34:00 Friday
5 Sara 2020-12-04 20:11:00 Friday
6 Sara 2020-12-06 17:34:00 Sunday
7 Silvia 2020-12-07 18:34:00 Monday
8 Silvia 2020-12-09 11:22:00 Wednesday
9 Paul 2020-12-09 11:22:00 Wednesday
10 Paul 2020-12-08 11:22:00 Tuesday
11 Paul 2020-12-07 11:22:00 Monday
df2 = round(df.groupby(['Name'])['Day'].apply(lambda x: x.value_counts(normalize=True)) * 100)
Result:
Name
Iza Tuesday 75.0
Wednesday 25.0
Paul Monday 33.0
Tuesday 33.0
Wednesday 33.0
Sara Friday 67.0
Sunday 33.0
Silvia Wednesday 50.0
Monday 50.0
Name: Day, dtype: float64
I stuck here, my desired output - days in columns with % for each per name:
Name Sunday Monday Tuesday Wednesday Friday
Iza NaN NaN 75 25 NaN
Paul NaN 33 33 33 NaN
Sara 33 NaN NaN NaN 67
Silvia NaN 50 NaN 50 NaN
Use Categorical for correct order in last Series.unstack, solution was simplify without apply:
df['Time'] = pd.to_datetime(df['Time'])
week = ['Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday']
df['Day'] = pd.Categorical(df['Time'].dt.day_name(), ordered=True, categories=week)
df1 = df.groupby('Name')['Day'].value_counts(normalize=True).unstack().mul(100).round()
print (df1)
Day Sunday Monday Tuesday Wednesday Friday
Name
Iza NaN NaN 75.0 25.0 NaN
Paul NaN 33.0 33.0 33.0 NaN
Sara 33.0 NaN NaN NaN 67.0
Silvia NaN 50.0 NaN 50.0 NaN
For correct ordering is a bit changed solution:
df['Time'] = pd.to_datetime(df['Time'])
df['Day'] = df['Time'].dt.dayofweek
d = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', 3: 'Wednesday',
4: 'Thursday', 5: 'Friday', 6: 'Saturday'}
df1 = df.groupby('Name')['Day'].value_counts(normalize=True).unstack().mul(100).round().rename(columns=d)
print (df1)
Day Sunday Monday Tuesday Thursday Saturday
Name
Iza NaN 75.0 25.0 NaN NaN
Paul 33.0 33.0 33.0 NaN NaN
Sara NaN NaN NaN 67.0 33.0
Silvia 50.0 NaN 50.0 NaN NaN
Just unstack. You are on the money
df2 = round(df.groupby(['Name'])['Day'].apply(lambda x: x.value_counts(normalize=True)) * 100).unstack(level=1)
df2=df2[['Sunday','Monday','Tuesday', 'Wednesday','Friday']]
Sunday Monday Tuesday Wednesday Friday
Name
Iza NaN NaN 75.0 25.0 NaN
Paul NaN 33.0 33.0 33.0 NaN
Sara 33.0 NaN NaN NaN 67.0
Silvia NaN 50.0 NaN 50.0 NaN
Related
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
I have a pandas dataframe called df_dummy with 3 columns: Days, Vacations_per_day and day_of_week. And I have a list called legal_days. I want to see if values from df_dummy['Days'] are found in the list legal_days and if found, change the value of Vacations_per_day column to 4 for that specific row.
This is my code so far:
legal_days = ['2022-08-15', '2022-11-30', '2022-12-01', '2022-12-26']
for index,i in enumerate(df_dummy['Days']):
if i in legal_days:
df_dummy['Vacations_per_day'][index] = 4
else:
pass
And the output is this:
Days Vacations_per_day day_of_week
2 2022-06-13 0.0 Monday
3 2022-06-14 0.0 Tuesday
4 2022-06-15 0.0 Wednesday
5 2022-06-16 1.0 Thursday
6 2022-06-17 1.0 Friday
7 2022-06-20 0.0 Monday
8 2022-06-21 0.0 Tuesday
9 2022-06-22 0.0 Wednesday
10 2022-06-23 0.0 Thursday
11 2022-06-24 0.0 Friday
12 2022-06-27 0.0 Monday
13 2022-06-28 0.0 Tuesday
14 2022-06-29 1.0 Wednesday
15 2022-06-30 1.0 Thursday
16 2022-07-01 1.0 Friday
17 2022-07-04 1.0 Monday
18 2022-07-05 1.0 Tuesday
19 2022-07-06 1.0 Wednesday
20 2022-07-07 0.0 Thursday
21 2022-07-08 1.0 Friday
22 2022-07-11 1.0 Monday
23 2022-07-12 1.0 Tuesday
24 2022-07-13 1.0 Wednesday
25 2022-07-14 1.0 Thursday
26 2022-07-15 1.0 Friday
27 2022-07-18 0.0 Monday
28 2022-07-19 0.0 Tuesday
29 2022-07-20 0.0 Wednesday
30 2022-07-21 0.0 Thursday
31 2022-07-22 0.0 Friday
32 2022-07-25 1.0 Monday
33 2022-07-26 1.0 Tuesday
34 2022-07-27 1.0 Wednesday
35 2022-07-28 1.0 Thursday
36 2022-07-29 1.0 Friday
37 2022-08-01 1.0 Monday
38 2022-08-02 1.0 Tuesday
39 2022-08-03 1.0 Wednesday
40 2022-08-04 1.0 Thursday
41 2022-08-05 1.0 Friday
42 2022-08-08 0.0 Monday
43 2022-08-09 0.0 Tuesday
44 2022-08-10 0.0 Wednesday
45 2022-08-11 4.0 Thursday
46 2022-08-12 0.0 Friday
47 2022-08-15 0.0 Monday
The problem is that, instead of changing the value of the row with the date 2022-08-15, it changes the row with a date of 2022-08-11. Could anyone help me with this?
Looking at your code, I don't see how this would occur. However, your solution seems to be outside the normal use-case for modifying a pandas dataframe. You could accomplish all of this with loc and isin:
df_dummy.loc[df_dummy['Days'].isin(legal_days), 'Vacations_per_day'] = 4
This code works by looking up all the rows in your dataframe that have a value for Days in the legal_days list and then sets the associated value for the Vacations_per_day column to 4.
I have the following df:
df3 = pd.DataFrame(np.array([['Iza', 'Tuesday'],['Martin', 'Friday'],['John', 'Monday'],['Iza', 'Tuesday'],['Iza', 'Tuesday'],['Iza', 'Wednesday'],['Sara', 'Friday'], ['Sara', 'Friday'], ['Sara', 'Sunday'],['Silvia', 'Monday'],['Silvia', 'Wednesday'],['Paul', 'Monday'],['Paul', 'Tuesday'],['Paul', 'Wednesday']]),
columns=['Name', 'Day'])
df3:
Name Day
0 Iza Tuesday
1 Martin Friday
2 John Monday
3 Iza Tuesday
4 Iza Tuesday
5 Iza Wednesday
6 Sara Friday
7 Sara Friday
8 Sara Sunday
9 Silvia Monday
10 Silvia Wednesday
11 Paul Monday
12 Paul Tuesday
13 Paul Wednesday
I got the count of days for each user:
oo = df3.groupby(['Name','Day'])['Day'].size().reset_index(name='counts')
result:
Name Day counts
0 Iza Tuesday 3
1 Iza Wednesday 1
2 John Monday 1
3 Martin Friday 1
4 Paul Monday 1
5 Paul Tuesday 1
6 Paul Wednesday 1
7 Sara Friday 2
8 Sara Sunday 1
9 Silvia Monday 1
10 Silvia Wednesday 1
dropped unwanted users with only one day record;
uniq_us = oo[oo.duplicated(['Name'], keep=False)]
result:
Name Day counts
0 Iza Tuesday 3
1 Iza Wednesday 1
4 Paul Monday 1
5 Paul Tuesday 1
6 Paul Wednesday 1
7 Sara Friday 2
8 Sara Sunday 1
9 Silvia Monday 1
10 Silvia Wednesday 1
Now I want to get the percentage of counts in each grouped days by name:
uniq_us.groupby(['Name','Day'])['counts'].apply(lambda x: x.value_counts(normalize=True)) * 100
I got:
Name Day
Iza Tuesday 3 100.0
Wednesday 1 100.0
Paul Monday 1 100.0
Tuesday 1 100.0
Wednesday 1 100.0
Sara Friday 2 100.0
Sunday 1 100.0
Silvia Monday 1 100.0
Wednesday 1 100.0
Name: counts, dtype: float64
I do not know how can I calculate it per grouped Name
Desired output:
Name Day
Iza Tuesday 3 75.0
Wednesday 1 25.0
Paul Monday 1 33.33
Tuesday 1 33.33
Wednesday 1 33.33
Sara Friday 2 66.66
Sunday 1 33.34
Silvia Monday 1 50.0
Wednesday 1 50.0
Name: counts, dtype: float64
You can normalize the counts with their sum via transform:
uniq_us["pcnt"] = uniq_us.groupby("Name").counts.transform(lambda x: x / x.sum())
to get
>>> uniq_us
Name Day counts pcnt
0 Iza Tuesday 3 0.750000
1 Iza Wednesday 1 0.250000
4 Paul Monday 1 0.333333
5 Paul Tuesday 1 0.333333
6 Paul Wednesday 1 0.333333
7 Sara Friday 2 0.666667
8 Sara Sunday 1 0.333333
9 Silvia Monday 1 0.500000
10 Silvia Wednesday 1 0.500000
You can put 100 * and round(2) in lambda and set the Name and Day as the index to match the output:
...transform(lambda x: (100 * x / x.sum()).round(2))
uniq_us = uniq_us.set_index(["Name", "Day"])
to get
counts pcnt
Name Day
Iza Tuesday 3 75.00
Wednesday 1 25.00
Paul Monday 1 33.33
Tuesday 1 33.33
Wednesday 1 33.33
Sara Friday 2 66.67
Sunday 1 33.33
Silvia Monday 1 50.00
Wednesday 1 50.00
You're almost there. Try:
>>> uniq_us.groupby(["Name", "Day"]).sum()/uniq_us.groupby("Name").sum()
counts
Name Day
Iza Tuesday 0.750000
Wednesday 0.250000
Paul Monday 0.333333
Tuesday 0.333333
Wednesday 0.333333
Sara Friday 0.666667
Sunday 0.333333
Silvia Monday 0.500000
Wednesday 0.500000
Another option is to normalize the count at early stage:
(df3.groupby('Name')
.Day
.value_counts(normalize=True)
.mul(100)
.rename('Counts')
.reset_index()
.pipe(lambda x: x[x.duplicated(['Name'], keep=False)]))
# Name Day Counts
#0 Iza Tuesday 75.000000
#1 Iza Wednesday 25.000000
#4 Paul Monday 33.333333
#5 Paul Tuesday 33.333333
#6 Paul Wednesday 33.333333
#7 Sara Friday 66.666667
#8 Sara Sunday 33.333333
#9 Silvia Monday 50.000000
#10 Silvia Wednesday 50.000000
I started to work with Pandas and I have some issues that I don't really know how to solve.
I have a dataframe with date, product, stock and sales. Some dates and products are missing. I would like to get a timeseries for each product in a range of dates.
For example:
product udsStock udsSales
date
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-30 14 856 0
2019-12-25 4 3132 439
2019-12-27 4 3177 616
2020-01-01 4 500 883
It has to be the same range for all products even if one product doesn't appear in one date in the range.
If I want the range 2019-12-25 to 2020-01-01, the final dataframe should be like this one:
product udsStock udsSales
date
2019-12-25 14 NaN NaN
2019-12-26 14 161 848
2019-12-27 14 1340 914
2019-12-28 14 NaN NaN
2019-12-29 14 NaN NaN
2019-12-30 14 856 0
2019-12-31 14 NaN NaN
2020-01-01 14 NaN NaN
2019-12-25 4 3132 439
2019-12-26 4 NaN NaN
2019-12-27 4 3177 616
2019-12-28 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-31 4 NaN NaN
2020-01-01 4 500 883
I have tried to reindex by the range but it doesn't work because there are identical indexes.
idx = pd.date_range('25-12-2019', '01-01-2020')
df = df.reindex(idx)
I also have tried to index by date and product and then reindex, but I don't know how to put the product that is missing.
Any more ideas?
Thanks in advance
We can use pd.date_range and groupby.reindex to achieve your result:
date_range = pd.date_range(start='2019-12-25', end='2020-01-01', freq='D')
df = df.groupby('product', sort=False).apply(lambda x: x.reindex(date_range))
df['product'] = df.groupby(level=0)['product'].ffill().bfill()
df = df.droplevel(0)
product udsStock udsSales
2019-12-25 14.0 NaN NaN
2019-12-26 14.0 161.0 848.0
2019-12-27 14.0 1340.0 914.0
2019-12-28 14.0 NaN NaN
2019-12-29 14.0 NaN NaN
2019-12-30 14.0 856.0 0.0
2019-12-31 14.0 NaN NaN
2020-01-01 14.0 NaN NaN
2019-12-25 4.0 3132.0 439.0
2019-12-26 4.0 NaN NaN
2019-12-27 4.0 3177.0 616.0
2019-12-28 4.0 NaN NaN
2019-12-29 4.0 NaN NaN
2019-12-30 4.0 NaN NaN
2019-12-31 4.0 NaN NaN
2020-01-01 4.0 500.0 883.0
Convert index to datetime object :
df2.index = pd.to_datetime(df2.index)
Create unique combinations of date and product :
import itertools
idx = pd.date_range("25-12-2019", "01-01-2020")
product = df2["product"].unique()
temp = itertools.product(idx, product)
temp = pd.MultiIndex.from_tuples(temp, names=["date", "product"])
temp
MultiIndex([('2019-12-25', 14),
('2019-12-25', 4),
('2019-12-26', 14),
('2019-12-26', 4),
('2019-12-27', 14),
('2019-12-27', 4),
('2019-12-28', 14),
('2019-12-28', 4),
('2019-12-29', 14),
('2019-12-29', 4),
('2019-12-30', 14),
('2019-12-30', 4),
('2019-12-31', 14),
('2019-12-31', 4),
('2020-01-01', 14),
('2020-01-01', 4)],
names=['date', 'product'])
Reindex dataframe :
df2.set_index("product", append=True).reindex(temp).sort_index(
level=1, ascending=False
).reset_index(level="product")
product udsStock udsSales
date
2020-01-01 14 NaN NaN
2019-12-31 14 NaN NaN
2019-12-30 14 856.0 0.0
2019-12-29 14 NaN NaN
2019-12-28 14 NaN NaN
2019-12-27 14 1340.0 914.0
2019-12-26 14 161.0 848.0
2019-12-25 14 NaN NaN
2020-01-01 4 500.0 883.0
2019-12-31 4 NaN NaN
2019-12-30 4 NaN NaN
2019-12-29 4 NaN NaN
2019-12-28 4 NaN NaN
2019-12-27 4 3177.0 616.0
2019-12-26 4 NaN NaN
2019-12-25 4 3132.0 439.0
In R, specifically tidyverse, it can be achieved with the complete method. In Python, the pyjanitor package has something similar, but a few kinks remain to be ironed out (A PR has been submitted already for this).
I have a data in below format
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:55:00 18 1047
Expected Output:
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:20:00 0 41
xyz 01-01-2020 00:25:00 0 41
xyz 01-01-2020 00:30:00 0 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:50:00 0 1029
xyz 01-01-2020 11:55:00 18 1047
So I want to fill timestamps with 5 mins interval and fill flowers column by 0 and total_flowers column by previous value(ffill)
My efforts:
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00+05:30")
end_time = pd.to_datetime(f"{end_day} 23:55:00+05:30")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min')
df = df.set_index('timestamp').reindex(dates).reset_index(drop=False).reindex(columns=df.columns)
How do I fill flowers column with zeros and total_flower column with ffill and I am also getting values in timestamp column as Nan
Actual Output:
user timestamp flowers total_flowers
xyz Nan 15 15
xyz Nan 5 20
xyz Nan 21 41
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan 1 42
...
xyz Nan 57 1029
xyz Nan Nan Nan
xyz Nan 18 1047
Reindex and refill
If you construct the dates such that you can reindex your timestamps, you can then just do some fillna and ffill operations. I had to remove the timezone information, but you should be able to add that back if your data are timezone aware. Here's the full example using some of your data:
d = {'user': {0: 'xyz', 1: 'xyz', 2: 'xyz', 3: 'xyz'},
'timestamp': {0: Timestamp('2020-01-01 00:05:00'),
1: Timestamp('2020-01-01 00:10:00'),
2: Timestamp('2020-01-01 00:15:00'),
3: Timestamp('2020-01-01 00:35:00')},
'flowers': {0: 15, 1: 5, 2: 21, 3: 1},
'total_flowers': {0: 15, 1: 20, 2: 41, 3: 42}}
df = pd.DataFrame(d)
# user timestamp flowers total_flowers
#0 xyz 2020-01-01 00:05:00 15 15
#1 xyz 2020-01-01 00:10:00 5 20
#2 xyz 2020-01-01 00:15:00 21 41
#3 xyz 2020-01-01 00:35:00 1 42
#as you did, but with no TZ
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00")
end_time = pd.to_datetime(f"{end_day} 00:55:00")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min', name="timestamp")
#filling the nas and reformatting
df = df.set_index('timestamp')
df = df.reindex(dates)
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.reset_index(inplace=True)
Output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
7 2020-01-01 00:40:00 xyz 0.0 42.0
8 2020-01-01 00:45:00 xyz 0.0 42.0
9 2020-01-01 00:50:00 xyz 0.0 42.0
10 2020-01-01 00:55:00 xyz 0.0 42.0
Resample and refill
You can also use resample here using asfreq(), then do the filling as before. This is convenient for finding the dates (and should get around the timezone stuff):
# resample and then fill the gaps
# same df as constructed above
df = df.set_index('timestamp')
df.resample('5T').asfreq()
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.index.name='timestamp'
df.reset_index(inplace=True)
Same output:
timestamp flowers total_flowers user
0 2020-01-01 00:05:00 15 15.0 xyz
1 2020-01-01 00:10:00 5 20.0 xyz
2 2020-01-01 00:15:00 21 41.0 xyz
3 2020-01-01 00:20:00 0 41.0 xyz
4 2020-01-01 00:25:00 0 41.0 xyz
5 2020-01-01 00:30:00 0 41.0 xyz
6 2020-01-01 00:35:00 1 42.0 xyz
I couldn't find a way to do the filling during the resampling. For instance, using
df = df.resample('5T').agg({'flowers':'sum',
'total_flowers':'ffill',
'user':'ffill'})
does not work (it gets you to the same place as asfreq, but there's more room for accidentally missing out columns here). Which is odd because when applying ffill over the whole DataFrame, the missing data can be forward filled (but we only want that for some columns, and the user column also gets dropped). But simply using asfreq and doing the filling after the fact seems fine to me with few columns.
crossed with #Tom
You are almost there:
df = pd.DataFrame({'user': ['xyz', 'xyz', 'xyz', 'xyz'],
'timestamp': ['01-01-2020 00:05:00', '01-01-2020 00:10:00', '01-01-2020 00:15:00', '01-01-2020 00:35:00'],
'flowers':[15, 5, 21, 1],
'total_flowers':[15, 20, 41, 42]
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
r = pd.date_range(start=df['timestamp'].min(), end=df['timestamp'].max(), freq='5Min')
df = df.set_index('timestamp').reindex(r).rename_axis('timestamp').reset_index()
df['user'].ffill(inplace=True)
df['total_flowers'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
leads to the following output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0