how to add a day to date column? - python

I wanna add a day to all cells of this dataframe:
value B N S date
date
2020-12-31 1 11 0 2020-12-31
2021-01-01 3 80 0 2021-01-01
2021-01-02 4 99 0 2021-01-02
2021-01-03 3 78 0 2021-01-03
2021-01-04 0 50 0 2021-01-04
to make it like this:
value B N S date
date
2020-12-31 1 11 0 2021-01-01
2021-01-01 3 80 0 2021-01-02
2021-01-02 4 99 0 2021-01-03
2021-01-03 3 78 0 2021-01-04
2021-01-04 0 50 0 2021-01-05
how can I do this?

df['date']=pd.to_datetime(df['date']).add(pd.offsets.Day(1))
df
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05

You can temporarily convert to datetime to add a DateOffset:
df['date'] = (pd.to_datetime(df['date'])
.add(pd.DateOffset(days=1))
.dt.strftime('%Y-%m-%d') # optional
)
Output:
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05

Related

Pandas count number of online devices at a time

I have the following dataframes which is a log database which shows each time a device connects to a new gateway
device
gateway
time
222
1
2021-01-01 05:02:03
222
2
2021-01-02 06:02:04
222
1
2021-01-03 02:02:53
223
3
2021-01-01 01:22:08
...
...
...
222
1
2021-02-01 12:32:23
I want to know for each minute for all the gateways how many devices are currently connected to each of the gateways
gateway
minute
count
1
2021-01-01 00:00:00
0
2
2021-01-01 00:00:00
0
3
2021-01-01 00:00:00
0
1
2021-01-01 00:01:00
0
...
...
...
1
2021-01-01 05:02:00
1
1
2021-01-01 05:03:00
1
1
2021-01-01 05:04:00
1
1
2021-01-01 05:05:00
1
...
...
...
1
2021-01-02 06:02:00
0
...
...
...
how can I accomplish this using pandas?
Try groupby with Grouper:
df.groupby(['gateway', pd.Grouper(freq='T', key='time')]).size().reset_index(name='count')

Using groupby's aggregation to populate a new column

Given this dataframe df:
date type target
2021-01-01 0 5
2021-01-01 0 6
2021-01-01 1 4
2021-01-01 1 2
2021-01-02 0 5
2021-01-02 1 3
2021-01-02 1 7
2021-01-02 0 1
2021-01-03 0 2
2021-01-03 1 5
I want to create a new column that contains yesterday's target mean by type.
For example, for the 5th row (date=2021-01-02, type=0) the new column's value would be 5.5, as the mean of the target for the previous day, 2021-01-01 for type=0 is (5+6)/2.
I can easily obtain the mean of target grouping by date and type as:
means = df.groupby(['date', 'type'])['target'].mean()
But I don't know how to create a new column on the original dataframe with the desired data, which should look as follows:
date type target mean
2021-01-01 0 5 NaN (or null or whatever)
2021-01-01 0 6 NaN
2021-01-01 1 4 NaN
2021-01-01 1 2 NaN
2021-01-02 0 5 5.5
2021-01-02 1 3 3
2021-01-02 1 7 3
2021-01-02 0 2 5.5
2021-01-03 0 2 3.5
2021-01-03 1 5 5
Ensure your date column is datetime, and add another temporary column to df of the date the day before:
df['date'] = pd.to_datetime(df['date'])
df['yesterday'] = df['date'] - pd.Timedelta('1 day')
Then use your means groupby, with as_index=False, and left merge that onto the original df on yesterday/date and type columns, and select the desired columns:
means = df.groupby(['date', 'type'], as_index=False)['target'].mean()
df.merge(means, left_on=['yesterday', 'type'], right_on=['date', 'type'],
how='left', suffixes=[None, ' mean'])[['date', 'type', 'target', 'target mean']]
Output:
date type target target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Idea is add one day to first level of MultiIndex Series by Timedelta, so possible add new column by DataFrame.join:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby(['date', 'type'])['target'].mean()
s2 = s1.rename(index=lambda x: x + pd.Timedelta(days=1), level=0)
df = df.join(s2.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Another solution:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby([df['date'] + pd.Timedelta(days=1), 'type'])['target'].mean()
df = df.join(s1.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
A small edition on #Emi OB' s answer
means = df.groupby(["date", "type"], as_index=False)["target"].mean()
means["mean"] = means.pop("target").shift(2)
df = df.merge(means, how="left", on=["date", "type"])
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 2 5.5
8 2021-01-03 0 2 3.5
9 2021-01-03 1 5 5.0

Efficiently counting records with date in between two columns

Say I have this DataFrame:
user
sub_date
unsub_date
group
0
alice
2021-01-01 00:00:00
2021-02-09 00:00:00
A
1
bob
2021-02-03 00:00:00
2021-04-05 00:00:00
B
2
charlie
2021-02-03 00:00:00
NaT
A
3
dave
2021-01-29 00:00:00
2021-09-01 00:00:00
B
What is the most efficient way to count the subbed users per date and per group? In other words, to get this DataFrame:
date
group
subbed
2021-01-01
A
1
2021-01-01
B
0
2021-01-02
A
1
2021-01-02
B
0
...
...
...
2021-02-03
A
2
2021-02-03
B
2
...
...
...
2021-02-10
A
1
2021-02-10
B
2
...
...
...
Here's a snippet to init the example df:
import pandas as pd
import datetime as dt
users = pd.DataFrame(
[
["alice", "2021-01-01", "2021-02-09", "A"],
["bob", "2021-02-03", "2021-04-05", "B"],
["charlie", "2021-02-03", None, "A"],
["dave", "2021-01-29", "2021-09-01", "B"],
],
columns=["user", "sub_date", "unsub_date", "group"],
)
users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
pd.to_datetime
)
Using a smaller date range for convenience
Note: my users df is different from OPs. I've changed around a few dates to make the outputs smaller
In [26]: import pandas as pd
...: import datetime as dt
...:
...: users = pd.DataFrame(
...: [
...: ["alice", "2021-01-01", "2021-01-05", "A"],
...: ["bob", "2021-01-03", "2021-01-07", "B"],
...: ["charlie", "2021-01-03", None, "A"],
...: ["dave", "2021-01-09", "2021-01-11", "B"],
...: ],
...: columns=["user", "sub_date", "unsub_date", "group"],
...: )
...:
...: users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
...: pd.to_datetime
...: )
In [81]: users
Out[81]:
user sub_date unsub_date group
0 alice 2021-01-01 2021-01-05 A
1 bob 2021-01-03 2021-01-07 B
2 charlie 2021-01-03 NaT A
3 dave 2021-01-09 2021-01-11 B
In [82]: users.melt(id_vars=['user', 'group'])
Out[82]:
user group variable value
0 alice A sub_date 2021-01-01
1 bob B sub_date 2021-01-03
2 charlie A sub_date 2021-01-03
3 dave B sub_date 2021-01-09
4 alice A unsub_date 2021-01-05
5 bob B unsub_date 2021-01-07
6 charlie A unsub_date NaT
7 dave B unsub_date 2021-01-11
# dropna to remove rows with no unsub_date
# sort_values to sort by date
# sub_date exists -> map to 1, else -1 then take cumsum to get # of subbed people at that date
In [85]: melted = users.melt(id_vars=['user', 'group']).dropna().sort_values('value')
...: melted['sub_value'] = np.where(melted['variable'] == 'sub_date', 1, -1) # or melted['variable'].map({'sub_date': 1, 'unsub_date': -1})
...: melted['sub_cumsum_group'] = melted.groupby('group')['sub_value'].cumsum()
...: melted
Out[85]:
user group variable value sub_value sub_cumsum_group
0 alice A sub_date 2021-01-01 1 1
1 bob B sub_date 2021-01-03 1 1
2 charlie A sub_date 2021-01-03 1 2
4 alice A unsub_date 2021-01-05 -1 1
5 bob B unsub_date 2021-01-07 -1 0
3 dave B sub_date 2021-01-09 1 1
7 dave B unsub_date 2021-01-11 -1 0
In [93]: idx = pd.date_range(melted['value'].min(), melted['value'].max(), freq='1D')
...: idx
Out[93]:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10', '2021-01-11'],
dtype='datetime64[ns]', freq='D')
In [94]: melted.set_index('value').groupby('group')['sub_cumsum_group'].apply(lambda x: x.reindex(idx).ffill().fillna(0))
Out[94]:
group
A 2021-01-01 1.0
2021-01-02 1.0
2021-01-03 2.0
2021-01-04 2.0
2021-01-05 1.0
2021-01-06 1.0
2021-01-07 1.0
2021-01-08 1.0
2021-01-09 1.0
2021-01-10 1.0
2021-01-11 1.0
B 2021-01-01 0.0
2021-01-02 0.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 1.0
2021-01-06 1.0
2021-01-07 0.0
2021-01-08 0.0
2021-01-09 1.0
2021-01-10 1.0
2021-01-11 0.0
Name: sub_cumsum_group, dtype: float64
The data is described by step functions, and staircase can be used for these applications
import staircase as sc
stepfunctions = users.groupby("group").apply(sc.Stairs, "sub_date", "unsub_date")
stepfunctions will be a pandas.Series, indexed by group, and the values are Stairs objects which represent step functions.
group
A <staircase.Stairs, id=2516834869320>
B <staircase.Stairs, id=2516112096072>
dtype: object
You could plot the step function for A if you wanted like so
stepfunctions["A"].plot()
Next step is to sample the step function at whatever dates you want, eg for every day of January..
sc.sample(stepfunctions, pd.date_range("2021-01-01", "2021-02-01")).melt(ignore_index=False).reset_index()
The result is this
group variable value
0 A 2021-01-01 1
1 B 2021-01-01 0
2 A 2021-01-02 1
3 B 2021-01-02 0
4 A 2021-01-03 1
.. ... ... ...
59 B 2021-01-30 1
60 A 2021-01-31 1
61 B 2021-01-31 1
62 A 2021-02-01 1
63 B 2021-02-01 1
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Try this?
>>> users.groupby(['sub_date','group'])[['user']].count()

Creating a DataFrame with a row for each date from date range in other DataFrame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'id':[1,2,3,4,5],
'start_date':['2021-01-01','2021-01-01','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-04','2021-01-03','2021-01-03','2021-01-06','2021-01-08']})
plan_dates
id start_date end_date
0 1 2021-01-01 2021-01-04
1 2 2021-01-01 2021-01-03
2 3 2021-01-03 2021-01-03
3 4 2021-01-04 2021-01-06
4 5 2021-01-05 2021-01-08
I would like to create a new DataFrame with a row for each day where the plan is active, for each id.
INTENDED DF:
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08
Any help would be greatly appreciated.
Use:
#first part is same like https://stackoverflow.com/a/66869805/2901002
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['start_date'] = df['start_date'].add(pd.to_timedelta(counter, unit='d'))
Then remove end_date column, rename and create default index:
df = (df.drop('end_date', axis=1)
.rename(columns={'start_date':'active_days'})
.reset_index(drop=True))
print (df)
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08

Split date range rows into years (ungroup) - Python Pandas

I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']

Categories