How to fill NANs with a specific row of data - python

I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN

Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06

Related

Fill missing dates in a pandas DataFrame

I’ve a lot of DataFrames with 2 columns, like this:
Fecha
unidades
0
2020-01-01
2.0
84048
2020-09-01
4.0
149445
2020-10-01
11.0
532541
2020-11-01
4.0
660659
2020-12-01
2.0
1515682
2021-03-01
9.0
1563644
2021-04-01
2.0
1759823
2021-05-01
1.0
2226586
2021-07-01
1.0
As it can be seen, there are some months that are missing. Missing data depends on the DataFrame, I can have 2 months, 10, 100% complete, only one...I need to complete column "Fecha" with missing months (from 2020-01-01 to 2021-12-01) and when date is added into "Fecha", add "0" value to "unidades" column.
Each element in Fecha Column is a class 'pandas._libs.tslibs.timestamps.Timestamp
How could I fill the missing dates for each DataFrame??
You could create a date range and use "Fecha" column to set_index + reindex to add missing months. Then fillna + reset_index fetches the desired outcome:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = (df.set_index('Fecha')
.reindex(pd.date_range('2020-01-01', '2021-12-01', freq='MS'))
.rename_axis(['Fecha'])
.fillna(0)
.reset_index())
Output:
Fecha unidades
0 2020-01-01 2.0
1 2020-02-01 0.0
2 2020-03-01 0.0
3 2020-04-01 0.0
4 2020-05-01 0.0
5 2020-06-01 0.0
6 2020-07-01 0.0
7 2020-08-01 0.0
8 2020-09-01 4.0
9 2020-10-01 11.0
10 2020-11-01 4.0
11 2020-12-01 2.0
12 2021-01-01 0.0
13 2021-02-01 0.0
14 2021-03-01 9.0
15 2021-04-01 2.0
16 2021-05-01 1.0
17 2021-06-01 0.0
18 2021-07-01 1.0
19 2021-08-01 0.0
20 2021-09-01 0.0
21 2021-10-01 0.0
22 2021-11-01 0.0
23 2021-12-01 0.0

How to apply different conditions to different months in a pandas Dataframe?

so I'm trying to apply different conditions that depends on a date, months to be specific. For example, for January replace the data in TEMP that is above 45 but for February that is above 30 and so on. I already did that with the code below, but the problem is that the data from the previous month is replace it with nan.
This is my code:
meses = ["01", "02"]
for i in var_vars:
if i in dataframes2.columns.values:
for j in range(len(meses)):
test_prueba_mes = dataframes2[i].loc[dataframes2['fecha'].dt.month == int(meses[j])]
test_prueba = test_prueba_mes[dataframes2[i]<dataframes.loc[i]["X"+meses[j]+".max"]]
dataframes2["Prueba " + str(i)] = test_prueba
Output:
dataframes2.tail(5)
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 NaN NaN
23 2020-01-02 00:00:00 31.0 3 NaN NaN
24 2020-01-02 12:00:00 31.0 2 NaN NaN
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
My desired Output is:
Output:
fecha TEMP_C_Avg RH_Avg Prueba TEMP_C_Avg Prueba RH_Avg
21 2020-01-01 22:00:00 46.0 103 NaN NaN
22 2020-01-01 23:00:00 29.0 103 29.0 NaN
23 2020-01-02 00:00:00 31.0 3 31.0 3.0
24 2020-01-02 12:00:00 31.0 2 31.0 2.0
25 2020-02-01 10:00:00 29.0 5 29.0 5.0
Appreciate if anyone can help me.
Update: The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
An example of the data:
fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20
With clear understanding
have encoded monthly limits into a dict limits
use numpy select(), when a condition matches take value corresponding to condition from second parameter. Default to third parameter
build conditions dynamically from limits dict
second parameter needs to be same length as conditions list. Build list of np.nan as list comprehension so it's correct length
to consider all columns, use a dict comprehension that builds **kwarg params to assign()
df = pd.read_csv(io.StringIO(""" fecha TEMP_C_Avg RH_Avg
25 2020-02-01 10:00:00 29.0 5
26 2020-02-01 11:00:00 32.0 105
27 2020-03-01 10:00:00 55.0 3
28 2020-03-01 11:00:00 40.0 5
29 2020-04-01 10:00:00 10.0 20
30 2020-04-01 11:00:00 5.0 15
31 2020-05-01 10:00:00 20.0 15
32 2020-05-01 11:00:00 5.0 106
33 2020-06-01 10:00:00 33.0 107
34 2020-06-01 11:00:00 20.0 20"""), sep="\s\s+", engine="python")
df.fecha = pd.to_datetime(df.fecha)
# The ruleset for 6 months is jan 45, feb 30, mar 45, abr 10, may 15, jun 30
limits = {1:45, 2:30, 3:45, 4:10, 5:15, 6:30}
df = df.assign(**{f"Prueba {c}":np.select( # construct target column name
# build a condition for each of the month limits
[df.fecha.dt.month.eq(m) & df[c].gt(l) for m,l in limits.items()],
[np.nan for m in limits.keys()], # NaN if beyond limit
df[c]) # keep value if within limits
for c in df.columns if "Avg" in c}) # do calc for all columns that have "Avg" in name
fecha
TEMP_C_Avg
RH_Avg
Prueba TEMP_C_Avg
Prueba RH_Avg
25
2020-02-01 10:00:00
29
5
29
5
26
2020-02-01 11:00:00
32
105
nan
nan
27
2020-03-01 10:00:00
55
3
nan
3
28
2020-03-01 11:00:00
40
5
40
5
29
2020-04-01 10:00:00
10
20
10
nan
30
2020-04-01 11:00:00
5
15
5
nan
31
2020-05-01 10:00:00
20
15
nan
15
32
2020-05-01 11:00:00
5
106
5
nan
33
2020-06-01 10:00:00
33
107
nan
nan
34
2020-06-01 11:00:00
20
20
20
20

Fill missing timestamps and apply different operations on different columns

I have a data in below format
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:55:00 18 1047
Expected Output:
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:20:00 0 41
xyz 01-01-2020 00:25:00 0 41
xyz 01-01-2020 00:30:00 0 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:50:00 0 1029
xyz 01-01-2020 11:55:00 18 1047
So I want to fill timestamps with 5 mins interval and fill flowers column by 0 and total_flowers column by previous value(ffill)
My efforts:
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00+05:30")
end_time = pd.to_datetime(f"{end_day} 23:55:00+05:30")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min')
df = df.set_index('timestamp').reindex(dates).reset_index(drop=False).reindex(columns=df.columns)
How do I fill flowers column with zeros and total_flower column with ffill and I am also getting values in timestamp column as Nan
Actual Output:
user timestamp flowers total_flowers
xyz Nan 15 15
xyz Nan 5 20
xyz Nan 21 41
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan 1 42
...
xyz Nan 57 1029
xyz Nan Nan Nan
xyz Nan 18 1047
Reindex and refill
If you construct the dates such that you can reindex your timestamps, you can then just do some fillna and ffill operations. I had to remove the timezone information, but you should be able to add that back if your data are timezone aware. Here's the full example using some of your data:
d = {'user': {0: 'xyz', 1: 'xyz', 2: 'xyz', 3: 'xyz'},
'timestamp': {0: Timestamp('2020-01-01 00:05:00'),
1: Timestamp('2020-01-01 00:10:00'),
2: Timestamp('2020-01-01 00:15:00'),
3: Timestamp('2020-01-01 00:35:00')},
'flowers': {0: 15, 1: 5, 2: 21, 3: 1},
'total_flowers': {0: 15, 1: 20, 2: 41, 3: 42}}
df = pd.DataFrame(d)
# user timestamp flowers total_flowers
#0 xyz 2020-01-01 00:05:00 15 15
#1 xyz 2020-01-01 00:10:00 5 20
#2 xyz 2020-01-01 00:15:00 21 41
#3 xyz 2020-01-01 00:35:00 1 42
#as you did, but with no TZ
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00")
end_time = pd.to_datetime(f"{end_day} 00:55:00")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min', name="timestamp")
#filling the nas and reformatting
df = df.set_index('timestamp')
df = df.reindex(dates)
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.reset_index(inplace=True)
Output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
7 2020-01-01 00:40:00 xyz 0.0 42.0
8 2020-01-01 00:45:00 xyz 0.0 42.0
9 2020-01-01 00:50:00 xyz 0.0 42.0
10 2020-01-01 00:55:00 xyz 0.0 42.0
Resample and refill
You can also use resample here using asfreq(), then do the filling as before. This is convenient for finding the dates (and should get around the timezone stuff):
# resample and then fill the gaps
# same df as constructed above
df = df.set_index('timestamp')
df.resample('5T').asfreq()
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.index.name='timestamp'
df.reset_index(inplace=True)
Same output:
timestamp flowers total_flowers user
0 2020-01-01 00:05:00 15 15.0 xyz
1 2020-01-01 00:10:00 5 20.0 xyz
2 2020-01-01 00:15:00 21 41.0 xyz
3 2020-01-01 00:20:00 0 41.0 xyz
4 2020-01-01 00:25:00 0 41.0 xyz
5 2020-01-01 00:30:00 0 41.0 xyz
6 2020-01-01 00:35:00 1 42.0 xyz
I couldn't find a way to do the filling during the resampling. For instance, using
df = df.resample('5T').agg({'flowers':'sum',
'total_flowers':'ffill',
'user':'ffill'})
does not work (it gets you to the same place as asfreq, but there's more room for accidentally missing out columns here). Which is odd because when applying ffill over the whole DataFrame, the missing data can be forward filled (but we only want that for some columns, and the user column also gets dropped). But simply using asfreq and doing the filling after the fact seems fine to me with few columns.
crossed with #Tom
You are almost there:
df = pd.DataFrame({'user': ['xyz', 'xyz', 'xyz', 'xyz'],
'timestamp': ['01-01-2020 00:05:00', '01-01-2020 00:10:00', '01-01-2020 00:15:00', '01-01-2020 00:35:00'],
'flowers':[15, 5, 21, 1],
'total_flowers':[15, 20, 41, 42]
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
r = pd.date_range(start=df['timestamp'].min(), end=df['timestamp'].max(), freq='5Min')
df = df.set_index('timestamp').reindex(r).rename_axis('timestamp').reset_index()
df['user'].ffill(inplace=True)
df['total_flowers'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
leads to the following output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0

How to create a column with hour interval from two columns in Python?

i have the following dataframe structure:
exec_start_date exec_finish_date hour_start hour_finish session_qtd
2020-03-01 2020-03-02 22 0 1
2020-03-05 2020-03-05 22 23 3
2020-03-03 2020-03-04 18 7 4
2020-03-07 2020-03-07 18 18 2
As you can see above, i have three situations of sessions execution:
1) Start in one day and finish in another day with different hours
2) Start in one day and finish in the same day with different hours
3) Start in one day and finish in the same day with same hours
I need to create a column filling the interval between hour_start and hour_finish and create another column with execution date. Then:
If a session starts in one day and finish in another day, the hours executed in the start date need to be filled with exec_start_date and the hours executed in the following day need to be filled with the exec_finish_date.
So, a intermediary dataset would be like this:
exec_start_date exec_finish_date hour_start hour_finish session_qtd hour_interval
2020-03-01 2020-03-02 22 0 1 [22,23,0]
2020-03-05 2020-03-05 22 23 3 [22,23]
2020-03-03 2020-03-04 20 3 4 [20,21,22,23,0,1,2,3]
2020-03-07 2020-03-07 18 18 2 [18]
And the final dataset would be like this:
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-07 2 18
I have tried to create the interval with np.arange but didn't work properly to all cases, specially with the cases that start in one day and finish in another day.
Can you help me?
So the way I would do this is to get the full dates to get the time interval, then just pull the hours from that range. np.arange will not work because hours loop.
#Create two new temp columns that calculate have the full start and end date
df['start'] = df['exec_start_date'].astype(str) + " " + df['hour_start'].astype(str) + ":00.000"
df['end'] = df['exec_finish_date'].astype(str) + " " + df['hour_finish'].astype(str) + ":00.000"
#Convert them both to datetimes
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
#Apply a function to get your range
df['date'] = df.apply(lambda x: pd.date_range(x['start'], x['end'], freq='H').tolist(), axis = 1)
#explode new date column to get 1 column for all the created interval dates
df = df.explode(column = 'date')
#Make two new columns based on your final requested table based on new column
df['exec_date'] = df['date'].dt.date
df['hour_interval'] = df['date'].dt.hour
#Make a copy of the columns you wanted in a new df
df2 = df[['exec_date','session_qtd','hour_interval']].copy()
Make df dtype string
df=df.astype(str)
Concat date with hour and coerce to datetime
df['exec_start_date']=pd.to_datetime(df['exec_start_date'].str.cat(df['hour_start'], sep=' ')+ ':00:00')
df['exec_finish_date']=pd.to_datetime(df['exec_finish_date'].str.cat(df['hour_finish'], sep=' ')+ ':00:00')
Derive hourly periods between the start and end datetime
df['date'] = df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)
Explode date to achieve outcome
df.explode('date')
Can skip last step by exploding after deriving hourly periods as follows
df.assign(date=df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)).explode('date')
Outcome
exec_start_date exec_finish_date hour_start hour_finish session_qtd \
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
3 2020-03-07 18:00:00 2020-03-07 18:00:00 18 18 2
date
0 22
0 23
0 0
1 22
1 23
2 18
2 19
2 20
2 21
2 22
2 23
2 0
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 18
Another approach without converting the date fields:
import pandas as pd
import numpy as np
ss = '''
exec_start_date,exec_finish_date,hour_start,hour_finish,session_qtd
2020-03-01,2020-03-02,22,0,1
2020-03-05,2020-03-05,22,23,3
2020-03-03,2020-03-04,18,7,4
2020-03-07,2020-03-07,18,18,2
'''.strip()
with open('data.csv','w') as f: f.write(ss)
########## main script #############
df = pd.read_csv('data.csv')
df['hour_start'] = df['hour_start'].astype('int')
df['hour_finish'] = df['hour_finish'].astype('int')
# get hour list, group by day
df['hour_interval'] = df.apply(lambda x: list(range(x[2],x[3]+1)) if x[0]==x[1] else [(101,) + tuple(range(x[2],24))]+[(102,)+tuple(range(0,x[3]+1))], axis=1)
lst_col = 'hour_interval'
# split day group to separate rows
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# choose start\end day
hrdf['exec_date'] = hrdf.apply(lambda x: x[0] if type(x[5]) is int or x[5][0] == 101 else x[1], axis=1)
# remove day indicator
hrdf['hour_interval'] = hrdf['hour_interval'].apply(lambda x: [x] if type(x) is int else list(x[1:]))
# split hours to separate rows
df = hrdf
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# columns for final output
df_final = hrdf[['exec_date','session_qtd','hour_interval']]
print(df_final.to_string(index=False))
Output
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 18
2020-03-03 4 19
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-04 4 4
2020-03-04 4 5
2020-03-04 4 6
2020-03-04 4 7
2020-03-07 2 18

Split date range rows into years (ungroup) - Python Pandas

I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']

Categories