I have the following problem:
I have a table in which the customer number, the date and the sales are stored. The customer transactions are available on the first of each month. It may happen that a customer has not placed an order every month. The table looks like this:
ID
Date
Revenues
1
2021-05-01
100
1
2021-07-01
200
1
2021-08-01
100
1
2021-10-01
200
2
2021-12-01
300
2
2022-01-01
400
Now I want to add a certain number of rows to each group whose date is from today for a certain number of months in the future. The ID should remain the same, the date should be increased by one month and the turnover column should be filled with the moving average method.
The table should look like this:
ID
Date
Revenues
1
2021-05-01
100
1
2021-07-01
200
1
2021-08-01
100
1
2021-10-01
200
1
2022-04-01
150
1
2022-05-01
150
2
2021-12-01
300
2
2022-01-01
400
2
2022-04-01
350
2
2022-05-01
350
How can I solve this problem?
Thank you for your help :)
If I understand you correctly:
df["Date"] = pd.to_datetime(df["Date"])
def reindex(x):
min_date = x["Date"].min()
r = pd.date_range(min_date, min_date + pd.DateOffset(months=3), freq="MS")
x = x.set_index("Date").reindex(r)
x["ID"] = x["ID"].ffill().bfill()
x["Revenues"] = x["Revenues"].fillna(x["Revenues"].mean())
return x
x = df.groupby("ID", as_index=False).apply(reindex).droplevel(0).reset_index()
x = x.rename(columns={"index": "Date"})
print(x.to_markdown())
Prints:
Date
ID
Revenues
0
2021-12-01 00:00:00
1
100
1
2022-01-01 00:00:00
1
200
2
2022-02-01 00:00:00
1
150
3
2022-03-01 00:00:00
1
150
4
2021-12-01 00:00:00
2
300
5
2022-01-01 00:00:00
2
400
6
2022-02-01 00:00:00
2
350
7
2022-03-01 00:00:00
2
350
I want to get the sum of values for next 7 days of a column
my dataframe :
date value
0 2021-04-29 1
1 2021-05-03 2
2 2021-05-06 1
3 2021-05-15 1
4 2021-05-17 2
5 2021-05-18 1
6 2021-05-21 2
7 2021-05-22 5
8 2021-05-24 4
i tried to make a new column that contains date 7 days from current date
df['temp'] = df['date'] + timedelta(days=7)
then calculate value between date range :
df['next_7days'] = df[(df.date > df.date) & (df.date <= df.temp)].value.sum()
But this gives me answer as all 0.
intended result:
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
The method iam using currently is quite tedious, are their any better methods to get the intended result.
With a list comprehension:
tomorrow_dates = df.date + pd.Timedelta("1 day")
next_week_dates = df.date + pd.Timedelta("7 days")
df["next_7days"] = [df.value[df.date.between(tomorrow, next_week)].sum()
for tomorrow, next_week in zip(tomorrow_dates, next_week_dates)]
where we first define tomorrow and next week's dates and store them. Then zip them together and use between of pd.Series to get a boolean series if the date is indeed between the desired range. Then using boolean indexing to get the actual values and sum them. Do this for each date pair.
to get
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
I've got a dataframe that looks like this
date id
0 2019-01-15 c-15-Jan-2019-0
1 2019-01-26 c-26-Jan-2019-1
2 2019-02-02 c-02-Feb-2019-2
3 2019-02-15 c-15-Feb-2019-3
4 2019-02-23 c-23-Feb-2019-4
and I'd like to create a new column called 'days_since' that shows the number of days that have gone by since the last record. For instance, the new column would be
date id days_since
0 2019-01-15 c-15-Jan-2019-0 NaN
1 2019-01-26 c-26-Jan-2019-1 11
2 2019-02-02 c-02-Feb-2019-2 5
3 2019-02-15 c-15-Feb-2019-3 13
4 2019-02-23 c-23-Feb-2019-4 7
I tried to use
df_c['days_since'] = df_c.groupby('id')['date'].diff().apply(lambda x: x.days)
but that just returned a column full of null values. The date column is full of datetime objects. Any ideas?
I think you make it too complicated, given the date column contains datetime data, you can use:
>>> df['date'].diff()
0 NaT
1 11 days
2 7 days
3 13 days
4 8 days
Name: date, dtype: timedelta64[ns]
or if you want the number of days:
>>> df['date'].diff().dt.days
0 NaN
1 11.0
2 7.0
3 13.0
4 8.0
Name: date, dtype: float64
So you can assign the number of days with:
df['days_since'] = df['date'].diff().dt.days
This gives us:
>>> df
date days_since
0 2019-01-15 NaN
1 2019-01-26 11.0
2 2019-02-02 7.0
3 2019-02-15 13.0
4 2019-02-23 8.0
I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right
I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1