I have a simple dataframe with datetime and their date
df = pd.DataFrame( [['2021-01-01 10:10', '2021-01-01'],
['2021-01-03 13:33', '2021-01-03'],
['2021-01-03 14:44', '2021-01-03'],
['2021-01-07 17:17', '2021-01-07'],
['2021-01-07 07:07', '2021-01-07'],
['2021-01-07 01:07', '2021-01-07'],
['2021-01-09 09:09', '2021-01-09']],
columns=['datetime', 'date'])
I would like to create a new column containing the last datetime of each day.
I have something quite close, but the last datetime of the day is only filled on the last datetime of the day...
A weird NaT (Not a Time) is filled on all other cells.
Can you suggest something better?
df['eod']=df.groupby('date')['datetime'].tail(1)
You are probably looking for transform which will return the result to every row in the group.
df['eod'] = df.groupby('date').transform('last')
Output
datetime date eod
0 2021-01-01 10:10 2021-01-01 2021-01-01 10:10
1 2021-01-03 13:33 2021-01-03 2021-01-03 14:44
2 2021-01-03 14:44 2021-01-03 2021-01-03 14:44
3 2021-01-07 17:17 2021-01-07 2021-01-07 01:07
4 2021-01-07 07:07 2021-01-07 2021-01-07 01:07
5 2021-01-07 01:07 2021-01-07 2021-01-07 01:07
6 2021-01-09 09:09 2021-01-09 2021-01-09 09:09
You don't really need another date column if the date part is coming from the datetime column. You can group by dt.day of the datetime column, then call last for the datetime value:
>>> df['datetime'] = pd.to_datetime(df['datetime'])
>>> df.groupby(df['datetime'].dt.day)['datetime'].last()
datetime
1 2021-01-01 10:10:00
3 2021-01-03 14:44:00
7 2021-01-07 01:07:00
9 2021-01-09 09:09:00
Name: datetime, dtype: datetime64[ns]
Related
I have a dataframe with the following columns:
datetime: HH:MM:SS (not continuous, there are some missing days)
date: ['datetime'].dt.date
X = various values
X_daily_cum = df.groupby(['date']).X.cumsum()
So Xcum is the cumulated sum of X but grouped per day, it's reset every day.
Code to reproduce:
import pandas as pd
df = pd.DataFrame( [['2021-01-01 10:10', 3],
['2021-01-03 13:33', 7],
['2021-01-03 14:44', 6],
['2021-01-07 17:17', 2],
['2021-01-07 07:07', 4],
['2021-01-07 01:07', 9],
['2021-01-09 09:09', 3]],
columns=['datetime', 'X'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %M:%S')
df['date'] = df['datetime'].dt.date
df['X_daily_cum'] = df.groupby(['date']).X.cumsum()
print(df)
Now I would like a new column that takes for value the cumulated sum of previous available day, like that:
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3
2 2021-01-03 00:14:44 6 2021-01-03 13 3
3 2021-01-07 00:17:17 2 2021-01-07 2 13
4 2021-01-07 00:07:07 4 2021-01-07 6 13
5 2021-01-07 00:01:07 9 2021-01-07 15 13
6 2021-01-09 00:09:09 3 2021-01-09 3 15
Is there a clean way to do it with pandas with an apply ?
I have managed to do it in a disgusting way by copying the df, removing datetime granularity, selecting last record of each date, joining this new df with the previous one. It's disgusting, I would like a more elegant solution.
Thanks for the help
Use Series.duplicated with Series.mask for set missing values to all values without last per dates, then shifting values and forward filling missing values:
df['last_day_cum_value'] = (df['X_daily_cum'].mask(df['date'].duplicated(keep='last'))
.shift()
.ffill())
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Old solution:
Use DataFrame.drop_duplicates with Series created by date and Series.shift for previous dates, then use Series.map for new column:
s = df.drop_duplicates('date', keep='last').set_index('date')['X_daily_cum'].shift()
print (s)
date
2021-01-01 NaN
2021-01-03 3.0
2021-01-07 13.0
2021-01-09 15.0
Name: X_daily_cum, dtype: float64
df['last_day_cum_value'] = df['date'].map(s)
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Suppose I have a Pandas dataframe with 'Date' column whose values have gaps like below:
>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])
>>> df
Date $
0 2021-01-02 1.0
1 2021-01-05 2.0
2 2021-02-05 3.0
I would like to fill the gaps in the 'Date' column from the period between Jan 01, 2021 to Feb 28, 2021 while copying (forward-filling) the values, so from some reading up on StackOverflow posts like this, I came up with this solution to transform the dataframe as shown below:
# I need to first convert values in 'Date' column to datetime64 type
>>> df['Date'] = pd.to_datetime(df['Date'])
# Then I have to set 'Date' column as the dataframe's index
>>> df = df.set_index(['Date'])
# Without doing the above two steps, the call below returns error
>>> df_new=df.asfreq(freq='D', how={'start':'2021-01-01', 'end':'2021-03-31'}, method='ffill')
>>> df_new
$
Date
2021-01-02 1.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 2.0
2021-01-06 2.0
2021-01-07 2.0
2021-01-08 2.0
2021-01-09 2.0
2021-01-10 2.0
...
2021-01-31 2.0
2021-02-01 2.0
2021-02-02 2.0
2021-02-03 2.0
2021-02-04 2.0
2021-02-05 3.0
But as you can see above, the dates in df_new only starts at '2021-01-02' instead of '2021-01-01' AND it ends on '2021-02-05' instead of '2021-02-28'. I hope I'm entering the input for how parameter correctly above.
Q1: What else do I need to do to make the resulting dataframe look like below:
>>> df_new
$
Date
2021-01-01 1.0
2021-01-02 1.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 2.0
2021-01-06 2.0
2021-01-07 2.0
2021-01-08 2.0
2021-01-09 2.0
2021-01-10 2.0
...
2021-01-31 2.0
2021-02-01 2.0
2021-02-02 2.0
2021-02-03 2.0
2021-02-04 2.0
2021-02-05 3.0
2021-02-06 3.0
...
2021-02-28 3.0
Q2: Is there any way I can accomplish this simpler (i.e. without having to set the 'Date' column as the index of the dataframe for example)
Thanks in advance for your suggestions/answers!
You can find min/max date, create new pd.date_range() using MonthBegin/End date offsets and reindex:
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(
mn - pd.tseries.offsets.MonthBegin(),
mx + pd.tseries.offsets.MonthEnd(),
name="Date",
)
df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
print(df)
Prints:
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
...
55 2021-02-25 3.0
56 2021-02-26 3.0
57 2021-02-27 3.0
58 2021-02-28 3.0
Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31
I have a dataframe:
Date Price
2021-01-01 29344.67
2021-01-02 32072.08
2021-01-03 33048.03
2021-01-04 32084.61
2021-01-05 34105.46
2021-01-06 36910.18
2021-01-07 39505.51
2021-01-08 40809.93
2021-01-09 40397.52
2021-01-10 38505.49
Date object
Price float64
dtype: object
And my goal is to find the longest consecutive period of growth.
It should return:
Longest consecutive period was from 2021-01-04 to 2021-01-08 with increase of $8725.32
and honestly I have no idea where to start with it. These are my first steps in pandas and I don't know which tools I should use to get this information.
Could anyone help me / point me in the right direction?
Detect your increasing sequence with cumsum on decreasing:
df['is_increasing'] = df['Price'].diff().lt(0).cumsum()
You would get:
Date Price is_increasing
0 2021-01-01 29344.67 0
1 2021-01-02 32072.08 0
2 2021-01-03 33048.03 0
3 2021-01-04 32084.61 1
4 2021-01-05 34105.46 1
5 2021-01-06 36910.18 1
6 2021-01-07 39505.51 1
7 2021-01-08 40809.93 1
8 2021-01-09 40397.52 2
9 2021-01-10 38505.49 3
Now, you can detect your longest sequence with
sizes=df.groupby('is_increasing')['Price'].transform('size')
df[sizes == sizes.max()]
And you get:
Date Price is_increasing
3 2021-01-04 32084.61 1
4 2021-01-05 34105.46 1
5 2021-01-06 36910.18 1
6 2021-01-07 39505.51 1
7 2021-01-08 40809.93 1
Something like what Quang did for split the group , then pick the number of group
s = df.Price.diff().lt(0).cumsum()
out = df.loc[s==s.value_counts().sort_values().index[-1]]
Out[514]:
Date Price
3 2021-01-04 32084.61
4 2021-01-05 34105.46
5 2021-01-06 36910.18
6 2021-01-07 39505.51
7 2021-01-08 40809.93
I'm trying to batch some data based on a start_date and end_date that is conditional of the cumulative sum of which is <= 500000.
Say I have a simple data frame with two columns:
index Date num_books
0 2021-01-01 200000
1 2021-01-02 240000
2 2021-01-03 55000
3 2021-01-04 400000
4 2021-01-05 80000
5 2021-01-06 100000
I need to do a cumulative sum of the values in num_books until it has <= 500000 and record the start date, end date and the cumsum value. This is an example of what I'm trying to achieve
start_date end_date cumsum_books
2021-01-01 2021-01-03 495000
2021-01-04 2021-01-05 480000
2021-01-06 2021-01-06 100000
Is there an efficient way/function to achieve this? Thank you!
Here's one way:
from io import StringIO as sio
d = sio("""
index Date num_books
0 2021-01-01 200000
1 2021-01-02 240000
2 2021-01-03 55000
3 2021-01-04 400000
4 2021-01-05 80000
5 2021-01-06 100000
""")
import pandas as pd
df = pd.read_csv(d, sep='\s+')
batch_num = 5*10**5
df['batch_num'] = df['num_books'].cumsum()//batch_num
result = df.groupby('batch_num').agg(start_date=('Date', 'min'), end_date=('Date', 'max'), cumsum_books=('num_books','sum'))
print(result)
# start_date end_date cumsum_books
#batch_num
#0 2021-01-01 2021-01-03 495000
#1 2021-01-04 2021-01-05 480000
#2 2021-01-06 2021-01-06 100000
Note that the result dataframe also contains the entry with more than 500_000, but it's trivial to drop/filter it out.