Cumulative metric in continuous date range based on non-continuous date changes - python

On specific dates, a metric starting at 0 increases by a value. Given a set of non-continuous dates and values, is it possible to produce a column with metric?
Input - metric changes per day
date value
02-03-2022 00:00:00 10
03-03-2022 00:00:00 0
06-03-2022 00:00:00 2
10-03-2022 00:00:00 18
Output - metric calculated for continuous range of days (starting value = 0 unless change applies already on first day)
0 metric
0 2022-02-28 0
1 2022-03-01 0
2 2022-03-02 10
3 2022-03-03 10
4 2022-03-04 10
5 2022-03-05 10
6 2022-03-06 12
7 2022-03-07 12
8 2022-03-08 12
9 2022-03-09 12
10 2022-03-10 30
11 2022-03-11 30
12 2022-03-12 30
13 2022-03-13 30
Code example
import pandas as pd
df = pd.DataFrame({'date': ['02-03-2022 00:00:00',
'03-03-2022 00:00:00',
'06-03-2022 00:00:00',
'10-03-2022 00:00:00'],
'value': [10, 0, 2, 18]},
index=[0,1,2,3])
df2 = pd.DataFrame(pd.date_range(start='28-02-2022', end='13-03-2022'))
df2['metric'] = 0 # TODO

Replace values in df2 from df by date, fill missing values with 0 and then cumsum:
df['date'] = pd.to_datetime(df.date, format='%d-%m-%Y %H:%M:%S')
df2['metric'] = df2[0].map(df.set_index('date')['value']).fillna(0).cumsum()
df2
0 metric
0 2022-02-28 0.0
1 2022-03-01 0.0
2 2022-03-02 10.0
3 2022-03-03 10.0
4 2022-03-04 10.0
5 2022-03-05 10.0
6 2022-03-06 12.0
7 2022-03-07 12.0
8 2022-03-08 12.0
9 2022-03-09 12.0
10 2022-03-10 30.0
11 2022-03-11 30.0
12 2022-03-12 30.0
13 2022-03-13 30.0

df.reindex is useful for this. Then add df.fillna and apply df.cumsum.
import pandas as pd
df = pd.DataFrame({'date': ['02-03-2022 00:00:00',
'03-03-2022 00:00:00',
'06-03-2022 00:00:00',
'10-03-2022 00:00:00'],
'value': [10, 0, 2, 18]},
index=[0,1,2,3])
df['date'] = pd.to_datetime(df.date, format='%d-%m-%Y %H:%M:%S')
res = df.set_index('date').reindex(pd.date_range(
start='2022-02-28', end='2022-03-13')).fillna(0).cumsum()\
.reset_index(drop=False).rename(columns={'index':'date',
'value':'metric'})
print(res)
date metric
0 2022-02-28 0.0
1 2022-03-01 0.0
2 2022-03-02 10.0
3 2022-03-03 10.0
4 2022-03-04 10.0
5 2022-03-05 10.0
6 2022-03-06 12.0
7 2022-03-07 12.0
8 2022-03-08 12.0
9 2022-03-09 12.0
10 2022-03-10 30.0
11 2022-03-11 30.0
12 2022-03-12 30.0
13 2022-03-13 30.0

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

Generate weeks from column with dates

I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0

Fill monthly holes (time-series) in a pandas dataframe with several categories [duplicate]

This question already has answers here:
Pandas filling missing dates and values within group
(3 answers)
Closed 4 months ago.
I have a time-series in pandas with several products (id's: a, b, etc), but with monthly holes. I have to fill those holes. It may be with np.nan or any other constant. I tried groupby but I wasnt able.
date id units
2022-01-01 a 10
2022-01-01 b 100
2022-02-01 a 15
2022-03-01 a 30
2022-03-01 b 70
2022-05-01 b 60
2022-06-01 a 8
2022-06-01 b 90
Should be:
date id units
2022-01-01 a 10
2022-01-01 b 100
2022-02-01 a 15
2022-02-01 b np.nan
2022-03-01 a 30
2022-03-01 b 70
2022-04-01 a np.nan
2022-04-01 b np.nan
2022-05-01 a np.nan
2022-05-01 b 60
2022-06-01 a 8
2022-06-01 b 90
You can do pivot then stack
df = df.pivot(*df.columns).stack(dropna = False).reset_index(name = 'units')
Out[126]:
date id units
0 2022-01-01 a 10.0
1 2022-01-01 b 100.0
2 2022-02-01 a 15.0
3 2022-02-01 b NaN
4 2022-03-01 a 30.0
5 2022-03-01 b 70.0
6 2022-05-01 a NaN
7 2022-05-01 b 60.0
8 2022-06-01 a 8.0
9 2022-06-01 b 90.0
df2=(df.set_index('date' )
.groupby('id', group_keys=False)
.apply(lambda x: x.resample('1MS').asfreq(fill_value=np.nan) )
.reset_index() )
df2['id'].ffill(inplace=True)
df2
date id units
0 2022-01-01 a 10.0
1 2022-02-01 a 15.0
2 2022-03-01 a 30.0
3 2022-04-01 a NaN
4 2022-05-01 a NaN
5 2022-06-01 a 8.0
6 2022-01-01 b 100.0
7 2022-02-01 b NaN
8 2022-03-01 b 70.0
9 2022-04-01 b NaN
10 2022-05-01 b 60.0
11 2022-06-01 b 90.0

How do I transpose twitter data of a non trading day on the last available trading day? (python)

For a school project I am predicting 'green' ETFs price movements with tweet sentiment and tweet volume related to climate change.
I predict with a lag of 1, so the predictions of Monday are made with the data of Sunday. The data of Sunday consists of the tweet data (volume & sentiment) of Sunday and the market data that is equal to the trading data of Friday, as there is no trading in the weekend. However for accurate predictions I need the twitter data of Sunday on the trading data of Friday.
My question: How do I get the tweet data (volume and sentiment) of a non trading day on the last available trading day? So i can then drop the weekend/holiday entries.
So my novice thoughts went something like: I need a formula, that looks for NaN's in the column df['adjusted close'] If the next value is NAN: look at next value: If the next value is not NAN: Select the 'sentiment' value corresponding to the NAN on that date. And use that to replace the value in 'sentiment ' corresponding to the the date before the NaN
import datetime
import pandas as pd
date = pd.date_range(start="2021-01-01",end="2021-01-20")
df = pd.DataFrame({'date': date,
'tweet_volume': range(20),
'sentiment': range(20),
'adjusted close': [0,'NaN',2,3,4,5,6,7,'NaN','NaN',10,11,12,13,'NaN','NaN','NaN',17,18,19]},
columns = ['date', 'tweet_volume', 'sentiment', 'adjusted close'])
df = df.set_index('date')
gives:
tweet_volume sentiment adjusted close
date
2021-01-01 0 0 0
2021-01-02 1 1 NaN
2021-01-03 2 2 2
2021-01-04 3 3 3
2021-01-05 4 4 4
2021-01-06 5 5 5
2021-01-07 6 6 6
2021-01-08 7 7 7
2021-01-09 8 8 NaN
2021-01-10 9 9 NaN
2021-01-11 10 10 10
2021-01-12 11 11 11
2021-01-13 12 12 12
2021-01-14 13 13 13
2021-01-15 14 14 NaN
2021-01-16 15 15 NaN
2021-01-17 16 16 NaN
2021-01-18 17 17 17
2021-01-19 18 18 18
2021-01-20 19 19 19
and i want:
tweet_volume sentiment adjusted close
date
2021-01-01 *1* *1* 0
2021-01-02 1 1 NaN
2021-01-03 2 2 2
2021-01-04 3 3 3
2021-01-05 4 4 4
2021-01-06 5 5 5
2021-01-07 6 6 6
2021-01-08 *9* *9* 7
2021-01-09 8 8 NaN
2021-01-10 9 9 NaN
2021-01-11 10 10 10
2021-01-12 11 11 11
2021-01-13 12 12 12
2021-01-14 *16* *16* 13
2021-01-15 14 14 NaN
2021-01-16 15 15 NaN
2021-01-17 16 16 NaN
2021-01-18 17 17 17
2021-01-19 18 18 18
2021-01-20 19 19 19
So I can then drop the rows with NaN's
This works:
date = pd.date_range(start="2021-01-01",end="2021-01-20")
df = pd.DataFrame({'date': date,
'tweet_volume': range(20),
'sentiment': range(20),
'adjusted close': [0,'NaN',2,3,4,5,6,7,'NaN','NaN',10,11,12,13,'NaN','NaN','NaN',17,18,19]},
columns = ['date', 'tweet_volume', 'sentiment', 'adjusted close'])
df = df.replace('NaN', np.nan)
df = df.set_index('date')
df[['tweet_volume','sentiment']] = df.groupby((df['adjusted close'].diff(0).notnull()).astype('int').cumsum()).transform('last')[['tweet_volume','sentiment']]
df = df.dropna()
print(df)
output:
tweet_volume sentiment adjusted close
date
2021-01-01 1 1 0.0
2021-01-03 2 2 2.0
2021-01-04 3 3 3.0
2021-01-05 4 4 4.0
2021-01-06 5 5 5.0
2021-01-07 6 6 6.0
2021-01-08 9 9 7.0
2021-01-11 10 10 10.0
2021-01-12 11 11 11.0
2021-01-13 12 12 12.0
2021-01-14 16 16 13.0
2021-01-18 17 17 17.0
2021-01-19 18 18 18.0
2021-01-20 19 19 19.0

Split date range rows into years (ungroup) - Python Pandas

I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']

Categories