I have a dataframe with the following columns:
datetime: HH:MM:SS (not continuous, there are some missing days)
date: ['datetime'].dt.date
X = various values
X_daily_cum = df.groupby(['date']).X.cumsum()
So Xcum is the cumulated sum of X but grouped per day, it's reset every day.
Code to reproduce:
import pandas as pd
df = pd.DataFrame( [['2021-01-01 10:10', 3],
['2021-01-03 13:33', 7],
['2021-01-03 14:44', 6],
['2021-01-07 17:17', 2],
['2021-01-07 07:07', 4],
['2021-01-07 01:07', 9],
['2021-01-09 09:09', 3]],
columns=['datetime', 'X'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %M:%S')
df['date'] = df['datetime'].dt.date
df['X_daily_cum'] = df.groupby(['date']).X.cumsum()
print(df)
Now I would like a new column that takes for value the cumulated sum of previous available day, like that:
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3
2 2021-01-03 00:14:44 6 2021-01-03 13 3
3 2021-01-07 00:17:17 2 2021-01-07 2 13
4 2021-01-07 00:07:07 4 2021-01-07 6 13
5 2021-01-07 00:01:07 9 2021-01-07 15 13
6 2021-01-09 00:09:09 3 2021-01-09 3 15
Is there a clean way to do it with pandas with an apply ?
I have managed to do it in a disgusting way by copying the df, removing datetime granularity, selecting last record of each date, joining this new df with the previous one. It's disgusting, I would like a more elegant solution.
Thanks for the help
Use Series.duplicated with Series.mask for set missing values to all values without last per dates, then shifting values and forward filling missing values:
df['last_day_cum_value'] = (df['X_daily_cum'].mask(df['date'].duplicated(keep='last'))
.shift()
.ffill())
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Old solution:
Use DataFrame.drop_duplicates with Series created by date and Series.shift for previous dates, then use Series.map for new column:
s = df.drop_duplicates('date', keep='last').set_index('date')['X_daily_cum'].shift()
print (s)
date
2021-01-01 NaN
2021-01-03 3.0
2021-01-07 13.0
2021-01-09 15.0
Name: X_daily_cum, dtype: float64
df['last_day_cum_value'] = df['date'].map(s)
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Related
I have a simple dataframe with datetime and their date
df = pd.DataFrame( [['2021-01-01 10:10', '2021-01-01'],
['2021-01-03 13:33', '2021-01-03'],
['2021-01-03 14:44', '2021-01-03'],
['2021-01-07 17:17', '2021-01-07'],
['2021-01-07 07:07', '2021-01-07'],
['2021-01-07 01:07', '2021-01-07'],
['2021-01-09 09:09', '2021-01-09']],
columns=['datetime', 'date'])
I would like to create a new column containing the last datetime of each day.
I have something quite close, but the last datetime of the day is only filled on the last datetime of the day...
A weird NaT (Not a Time) is filled on all other cells.
Can you suggest something better?
df['eod']=df.groupby('date')['datetime'].tail(1)
You are probably looking for transform which will return the result to every row in the group.
df['eod'] = df.groupby('date').transform('last')
Output
datetime date eod
0 2021-01-01 10:10 2021-01-01 2021-01-01 10:10
1 2021-01-03 13:33 2021-01-03 2021-01-03 14:44
2 2021-01-03 14:44 2021-01-03 2021-01-03 14:44
3 2021-01-07 17:17 2021-01-07 2021-01-07 01:07
4 2021-01-07 07:07 2021-01-07 2021-01-07 01:07
5 2021-01-07 01:07 2021-01-07 2021-01-07 01:07
6 2021-01-09 09:09 2021-01-09 2021-01-09 09:09
You don't really need another date column if the date part is coming from the datetime column. You can group by dt.day of the datetime column, then call last for the datetime value:
>>> df['datetime'] = pd.to_datetime(df['datetime'])
>>> df.groupby(df['datetime'].dt.day)['datetime'].last()
datetime
1 2021-01-01 10:10:00
3 2021-01-03 14:44:00
7 2021-01-07 01:07:00
9 2021-01-09 09:09:00
Name: datetime, dtype: datetime64[ns]
For a school project I am predicting 'green' ETFs price movements with tweet sentiment and tweet volume related to climate change.
I predict with a lag of 1, so the predictions of Monday are made with the data of Sunday. The data of Sunday consists of the tweet data (volume & sentiment) of Sunday and the market data that is equal to the trading data of Friday, as there is no trading in the weekend. However for accurate predictions I need the twitter data of Sunday on the trading data of Friday.
My question: How do I get the tweet data (volume and sentiment) of a non trading day on the last available trading day? So i can then drop the weekend/holiday entries.
So my novice thoughts went something like: I need a formula, that looks for NaN's in the column df['adjusted close'] If the next value is NAN: look at next value: If the next value is not NAN: Select the 'sentiment' value corresponding to the NAN on that date. And use that to replace the value in 'sentiment ' corresponding to the the date before the NaN
import datetime
import pandas as pd
date = pd.date_range(start="2021-01-01",end="2021-01-20")
df = pd.DataFrame({'date': date,
'tweet_volume': range(20),
'sentiment': range(20),
'adjusted close': [0,'NaN',2,3,4,5,6,7,'NaN','NaN',10,11,12,13,'NaN','NaN','NaN',17,18,19]},
columns = ['date', 'tweet_volume', 'sentiment', 'adjusted close'])
df = df.set_index('date')
gives:
tweet_volume sentiment adjusted close
date
2021-01-01 0 0 0
2021-01-02 1 1 NaN
2021-01-03 2 2 2
2021-01-04 3 3 3
2021-01-05 4 4 4
2021-01-06 5 5 5
2021-01-07 6 6 6
2021-01-08 7 7 7
2021-01-09 8 8 NaN
2021-01-10 9 9 NaN
2021-01-11 10 10 10
2021-01-12 11 11 11
2021-01-13 12 12 12
2021-01-14 13 13 13
2021-01-15 14 14 NaN
2021-01-16 15 15 NaN
2021-01-17 16 16 NaN
2021-01-18 17 17 17
2021-01-19 18 18 18
2021-01-20 19 19 19
and i want:
tweet_volume sentiment adjusted close
date
2021-01-01 *1* *1* 0
2021-01-02 1 1 NaN
2021-01-03 2 2 2
2021-01-04 3 3 3
2021-01-05 4 4 4
2021-01-06 5 5 5
2021-01-07 6 6 6
2021-01-08 *9* *9* 7
2021-01-09 8 8 NaN
2021-01-10 9 9 NaN
2021-01-11 10 10 10
2021-01-12 11 11 11
2021-01-13 12 12 12
2021-01-14 *16* *16* 13
2021-01-15 14 14 NaN
2021-01-16 15 15 NaN
2021-01-17 16 16 NaN
2021-01-18 17 17 17
2021-01-19 18 18 18
2021-01-20 19 19 19
So I can then drop the rows with NaN's
This works:
date = pd.date_range(start="2021-01-01",end="2021-01-20")
df = pd.DataFrame({'date': date,
'tweet_volume': range(20),
'sentiment': range(20),
'adjusted close': [0,'NaN',2,3,4,5,6,7,'NaN','NaN',10,11,12,13,'NaN','NaN','NaN',17,18,19]},
columns = ['date', 'tweet_volume', 'sentiment', 'adjusted close'])
df = df.replace('NaN', np.nan)
df = df.set_index('date')
df[['tweet_volume','sentiment']] = df.groupby((df['adjusted close'].diff(0).notnull()).astype('int').cumsum()).transform('last')[['tweet_volume','sentiment']]
df = df.dropna()
print(df)
output:
tweet_volume sentiment adjusted close
date
2021-01-01 1 1 0.0
2021-01-03 2 2 2.0
2021-01-04 3 3 3.0
2021-01-05 4 4 4.0
2021-01-06 5 5 5.0
2021-01-07 6 6 6.0
2021-01-08 9 9 7.0
2021-01-11 10 10 10.0
2021-01-12 11 11 11.0
2021-01-13 12 12 12.0
2021-01-14 16 16 13.0
2021-01-18 17 17 17.0
2021-01-19 18 18 18.0
2021-01-20 19 19 19.0
I have data that looks like this: (assume start and end are date times)
id
start
end
1
01-01
01-02
1
01-03
01-05
1
01-04
01-07
1
01-06
NaT
1
01-07
NaT
I want to get a data frame that would include all dates, that has a 'cumulative sum' that only counts for the range they are in.
dates
count
01-01
1
01-02
0
01-03
1
01-04
2
01-05
1
01-06
2
01-07
3
One idea I thought of was simply using cumcount on the start dates, and doing a 'reverse cumcount' decreasing the counts using the end dates, but I am having trouble wrapping my head around doing this in pandas and I'm wondering whether there's a more elegant solution.
Here is two options. first consider this data with only one id, note that your columns start and end must be datetime.
d = {'id': [1, 1, 1, 1, 1],
'start': [pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-03'),
pd.Timestamp('2021-01-04'), pd.Timestamp('2021-01-06'),
pd.Timestamp('2021-01-07')],
'end': [pd.Timestamp('2021-01-02'), pd.Timestamp('2021-01-05'),
pd.Timestamp('2021-01-07'), pd.NaT, pd.NaT]}
df = pd.DataFrame(d)
so to get your result, you can do a sub between the get_dummies of start and end. then sum if several start and or end at the same dates, cumsum along the dates, reindex to get all the dates between the min and max dates available. create a function.
def dates_cc(df_):
return (
pd.get_dummies(df_['start'])
.sub(pd.get_dummies(df_['end'], dtype=int), fill_value=0)
.sum()
.cumsum()
.to_frame(name='count')
.reindex(pd.date_range(df_['start'].min(), df_['end'].max()), method='ffill')
.rename_axis('dates')
)
Now you can apply this function to your dataframe
res = dates_cc(df).reset_index()
print(res)
# dates count
# 0 2021-01-01 1.0
# 1 2021-01-02 0.0
# 2 2021-01-03 1.0
# 3 2021-01-04 2.0
# 4 2021-01-05 1.0
# 5 2021-01-06 2.0
# 6 2021-01-07 2.0
Now if you have several id, like
df1 = df.assign(id=[1,1,2,2,2])
print(df1)
# id start end
# 0 1 2021-01-01 2021-01-02
# 1 1 2021-01-03 2021-01-05
# 2 2 2021-01-04 2021-01-07
# 3 2 2021-01-06 NaT
# 4 2 2021-01-07 NaT
then you can use the above function like
res1 = df1.groupby('id').apply(dates_cc).reset_index()
print(res1)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 2 2021-01-04 1.0
# 6 2 2021-01-05 1.0
# 7 2 2021-01-06 2.0
# 8 2 2021-01-07 2.0
that said, a more straightforward possibility is with crosstab that create a row per id, the rest is about the same manipulations.
res2 = (
pd.crosstab(index=df1['id'], columns=df1['start'])
.sub(pd.crosstab(index=df1['id'], columns=df1['end']), fill_value=0)
.reindex(columns=pd.date_range(df1['start'].min(), df1['end'].max()), fill_value=0)
.rename_axis(columns='dates')
.cumsum(axis=1)
.stack()
.reset_index(name='count')
)
print(res2)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 1 2021-01-06 0.0
# 6 1 2021-01-07 0.0
# 7 2 2021-01-01 0.0
# 8 2 2021-01-02 0.0
# 9 2 2021-01-03 0.0
# 10 2 2021-01-04 1.0
# 11 2 2021-01-05 1.0
# 12 2 2021-01-06 2.0
# 13 2 2021-01-07 2.0
the main difference between the two options is that this one create extra dates for each id, because for example 2021-01-01 is in id=1 but not id=2 and with this version, you get this date also for id=2 while in groupby it is not taken into account.
I have a dataframe:
Date Price
2021-01-01 29344.67
2021-01-02 32072.08
2021-01-03 33048.03
2021-01-04 32084.61
2021-01-05 34105.46
2021-01-06 36910.18
2021-01-07 39505.51
2021-01-08 40809.93
2021-01-09 40397.52
2021-01-10 38505.49
Date object
Price float64
dtype: object
And my goal is to find the longest consecutive period of growth.
It should return:
Longest consecutive period was from 2021-01-04 to 2021-01-08 with increase of $8725.32
and honestly I have no idea where to start with it. These are my first steps in pandas and I don't know which tools I should use to get this information.
Could anyone help me / point me in the right direction?
Detect your increasing sequence with cumsum on decreasing:
df['is_increasing'] = df['Price'].diff().lt(0).cumsum()
You would get:
Date Price is_increasing
0 2021-01-01 29344.67 0
1 2021-01-02 32072.08 0
2 2021-01-03 33048.03 0
3 2021-01-04 32084.61 1
4 2021-01-05 34105.46 1
5 2021-01-06 36910.18 1
6 2021-01-07 39505.51 1
7 2021-01-08 40809.93 1
8 2021-01-09 40397.52 2
9 2021-01-10 38505.49 3
Now, you can detect your longest sequence with
sizes=df.groupby('is_increasing')['Price'].transform('size')
df[sizes == sizes.max()]
And you get:
Date Price is_increasing
3 2021-01-04 32084.61 1
4 2021-01-05 34105.46 1
5 2021-01-06 36910.18 1
6 2021-01-07 39505.51 1
7 2021-01-08 40809.93 1
Something like what Quang did for split the group , then pick the number of group
s = df.Price.diff().lt(0).cumsum()
out = df.loc[s==s.value_counts().sort_values().index[-1]]
Out[514]:
Date Price
3 2021-01-04 32084.61
4 2021-01-05 34105.46
5 2021-01-06 36910.18
6 2021-01-07 39505.51
7 2021-01-08 40809.93
Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'id':[1,2,3,4,5],
'start_date':['2021-01-01','2021-01-01','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-04','2021-01-03','2021-01-03','2021-01-06','2021-01-08']})
plan_dates
id start_date end_date
0 1 2021-01-01 2021-01-04
1 2 2021-01-01 2021-01-03
2 3 2021-01-03 2021-01-03
3 4 2021-01-04 2021-01-06
4 5 2021-01-05 2021-01-08
I would like to create a new DataFrame with a row for each day where the plan is active, for each id.
INTENDED DF:
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08
Any help would be greatly appreciated.
Use:
#first part is same like https://stackoverflow.com/a/66869805/2901002
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['start_date'] = df['start_date'].add(pd.to_timedelta(counter, unit='d'))
Then remove end_date column, rename and create default index:
df = (df.drop('end_date', axis=1)
.rename(columns={'start_date':'active_days'})
.reset_index(drop=True))
print (df)
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08