Splitting DataFrame rows into multiple based on a condition - python

I have a Dataframe df1 that has a bunch of columns like so:
val_1
val_2
start
end
val_3
val_4
0
10
70
1/1/2020
3/4/2020
10
20
1
20
80
1/1/2020
3/6/2021
30
40
2
30
90
1/1/2020
6/4/2021
50
60
3
40
100
12/5/2020
7/4/2021
70
80
4
89
300
4/5/2020
6/8/2022
40
10
I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:
val_1
val_2
start
end
val_3
val_4
0
10
70
1/1/2020
3/4/2020
10
20
1
20
80
1/1/2020
12/31/2020
30
40
2
20
80
1/1/2021
3/6/2021
30
40
3
30
90
1/1/2020
12/31/2020
50
60
4
30
90
1/1/2021
6/4/2021
50
60
5
40
100
7/5/2021
11/17/2021
70
80
6
89
300
4/5/2020
12/31/2020
40
10
7
89
300
1/1/2021
12/31/2021
40
10
8
89
300
1/1/2021
6/8/2022
40
10
Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!
EDIT
If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!

Here's a different approach. Note that I've already converted start and end to datetimes, and I didn't bother sorting the resultant DataFrame because I didn't want to assume a specific ordering for your use-case.
import pandas as pd
def jump_to_new_year(df: pd.DataFrame) -> pd.DataFrame:
df["start"] = df["start"].map(lambda t: pd.Timestamp(t.year + 1, 1, 1))
return df
def fill_to_year_end(df: pd.DataFrame) -> pd.DataFrame:
df["end"] = df["start"].map(lambda t: pd.Timestamp(t.year, 12, 31))
return df
def roll_over(df: pd.DataFrame) -> pd.DataFrame:
mask = df.start.dt.year != df.end.dt.year
if all(~mask):
return df
start_df = fill_to_year_end(df[mask].copy())
end_df = roll_over(jump_to_new_year(df[mask].copy()))
return pd.concat([df[~mask], start_df, end_df]).reset_index(drop=True)
This is a recursive function. It first checks if any start-end date pairs have mismatched years. If not, we simply return the DataFrame. If so, we fill to the end of the year in the start_df DataFrame. Then we jump to the new year and fill that to the end date in the end_df DataFrame. Then we recurse on end_df, which will be a smaller subset of the original input.
Warning: this solution assumes that all start dates occur on or before the end date's year. If you start in 2020 and end in 2019, you will recurse infinitely and blow the stack.
Demo:
>>> df
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2021-07-04 70 80
4 89 300 2020-04-05 2022-06-08 40 10
>>> roll_over(df)
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
2 30 90 2020-01-01 2020-12-31 50 60
3 40 100 2020-12-05 2020-12-31 70 80
4 89 300 2020-04-05 2020-12-31 40 10
5 20 80 2021-01-01 2021-03-06 30 40
6 30 90 2021-01-01 2021-06-04 50 60
7 40 100 2021-01-01 2021-07-04 70 80
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10
# An example of reordering the DataFrame
>>> roll_over(df).sort_values(by=["val_1", "start"])
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
5 20 80 2021-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2020-12-31 50 60
6 30 90 2021-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2020-12-31 70 80
7 40 100 2021-01-01 2021-07-04 70 80
4 89 300 2020-04-05 2020-12-31 40 10
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10

Find the year end after date_range, then explode
df['end'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]:
val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
1 20 80 1/1/2020 12/31/20 30 40
2 30 90 1/1/2020 6/4/2021 50 60
2 30 90 1/1/2020 12/31/20 50 60
3 40 100 12/5/2020 7/4/2021 70 80
3 40 100 12/5/2020 12/31/20 70 80
Update
df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x]+pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
for x , y in zip(df['start'],df['end'])]
out = df.explode(['Newend','Newstart'])
val_1 val_2 start end val_3 val_4 Newstart Newend
0 10 70 2020-01-01 2020-03-04 10 20 2020-01-01 2020-03-04
1 20 80 2020-01-01 2021-03-06 30 40 2021-01-01 2021-03-06
1 20 80 2020-01-01 2021-03-06 30 40 2020-01-01 2020-12-31
2 30 90 2020-01-01 2021-06-04 50 60 2021-01-01 2021-06-04
2 30 90 2020-01-01 2021-06-04 50 60 2020-01-01 2020-12-31
3 40 100 2020-12-05 2021-07-04 70 80 2021-01-01 2021-07-04
3 40 100 2020-12-05 2021-07-04 70 80 2020-12-05 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2020-04-05 2022-06-08
4 89 300 2020-04-05 2022-06-08 40 10 2022-01-01 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2021-01-01 2021-12-31

Related

How to Multi-Index an existing DataFrame

"Multi-Index" might be the incorrect term of what I'm looking to do, but below is an example of what I'm trying to accomplish.
Original DF:
HOD site1_units site1_orders site2_units site2_orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Desired DF
site1 site2
HOD units orders units orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Is there an efficient way to construct/format the dataframe like this? Thank you for the help!
Try this:
df = df.set_index('HOD')
df = df.set_axis(df.columns.map(lambda x: tuple(x.split('_'))),axis=1)
Output:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Here is one way you can do this:
df = df.set_index("HOD")
index = pd.MultiIndex.from_tuples(zip(["site1","site1", "site2", "site2"],["units", "orders", "units", "orders"]))
df.columns = index
Result:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90

Pandas: How to calculate average one value to after another (succeeding average)

Imagine a dataset like below:
result country start end
5 A 2/14/2022 2/21/2022
10 A 2/21/2022 2/28/2022
30 B 2/28/2022 3/7/2022
50 C 1/3/2022 1/10/2022
60 C 1/10/2022 1/17/2022
70 D 1/17/2022 1/24/2022
40 E 1/24/2022 1/31/2022
20 E 1/31/2022 2/7/2022
30 A 2/7/2022 2/14/2022
20 B 2/14/2022 2/21/2022
Expected output
I need to do groupby (country, start, and end) and the result column should add existing value with the above value and need to populate the average column.
For example:
groupby country, start, and end with result and average column is nothing but 5, 5+10/2, 10+30/2, 30+50/2,50+60/2
result average
5 5 eg: (5)
10 7.5 (5+10/2) #resultcol of existingvalue + abovevalue divided by 2 = average
30 20 (10+30/2)
50 40 (30+50/2)
60 55 (50+60/2)
70 65 ...
40 55 ...
20 30 ...
30 25 ...
20 25 ...
Try this solution with grouping by country and date, however it may raise error if there is no sufficient data in a subset (i.e. larger than 2):
df_data['average'] = df_data.groupby(['country', 'date'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
In case you want to group by country only
df_data['average'] = df_data.groupby(['country'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
df_data
country date result average
0 A 2/14/2022 5 5.0
1 A 2/21/2022 10 7.5
2 B 2/28/2022 30 30.0
3 C 1/3/2022 50 50.0
4 C 1/10/2022 60 55.0
5 D 1/17/2022 70 70.0
6 E 1/24/2022 40 40.0
7 E 1/31/2022 20 30.0
8 A 2/7/2022 30 20.0
9 B 2/14/2022 20 25.0

Pandas Dataframe Transformations

Consider a data frame which looks like:
A B C
0 2018-10-13 100 50
1 2018-10-13 200 25
2 2018-10-13 300 10
3 2018-10-13 400 5
4 2018-10-13 500 0
5 2018-10-14 100 100
6 2018-10-14 200 50
7 2018-10-14 300 25
8 2018-10-14 400 10
9 2018-10-14 500 5
10 2018-10-15 100 150
11 2018-10-15 200 100
12 2018-10-15 300 50
13 2018-10-15 400 25
14 2018-10-15 500 10
Here transformation that I want to perform is:
GroupBy Column A
Then GroupBy Column B into 3 intervals ( [0,100] say intval-1, [101,200] say intval-2, [201,end] say intval-3]. Can be n intervals to generalize.
Perform sum aggregation on Column C
So my transformed/pivoted dataframe should be like
A intval-1 intval-2 intval-3
0 2018-10-13 50 25 15
1 2018-10-14 100 50 40
2 2018-10-13 150 100 85
An easy way to implement this would be great help.
Thank You.
You can cut, then pivot_table:
bin_lst = [0, 100, 200, np.inf]
cut_b = pd.cut(df['B'], bins=bin_lst,
labels=[f'intval-{i}' for i in range(1, len(bin_lst))])
res = df.assign(B=cut_b)\
.pivot_table(index='A', columns='B', values='C', aggfunc='sum')
print(res)
B intval-1 intval-2 intval-3
A
2018-10-13 50 25 15
2018-10-14 100 50 40
2018-10-15 150 100 85
Using pd.cut with groupby + unstack
df.B=pd.cut(df.B,bins=[0,100,200,np.inf],labels=['intval-1','intval-2','intval-3'])
df.groupby(['A','B']).C.sum().unstack()
Out[35]:
B intval-1 intval-2 intval-3
A
2018-10-13 50 25 15
2018-10-14 100 50 40
2018-10-15 150 100 85

Pandas: groupby and make a new column applying aggregate to two columns

I'm having a difficulty with applying agg to a groupby pandas dataframe.
I have a dataframe df like this:
order_id distance_theo bird_distance
10 100 80
10 80 80
10 70 80
11 90 70
11 70 70
11 60 70
12 200 180
12 150 180
12 100 180
12 60 180
I want to groupby order_id, and make a new column crow by dividing distance_theo of the first row in each group by bird_distance in the first row of each group(or in any row, because there is only one value of bird_distance in one group).
order_id distance_theo bird_distance crow
10 100 80 1.25
10 80 80 1.25
10 70 80 1.25
11 90 70 1.29
11 70 70 1.29
11 60 70 1.29
12 200 180 1.11
12 150 180 1.11
12 100 180 1.11
12 60 180 1.11
My attempt:
df.groupby('order_id').agg({'crow', lambda x: x.distance_theo.head(1) / x.bird_distance.head(1)})
But I get an error:
'Series' object has no attribute 'distance_theo'
How can I solve this? Thanks for any kinds of advice!
Using groupby with first:
s = df.groupby('order_id').transform('first')
df.assign(crow=s.distance_theo.div(s.bird_distance))
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111
You could do it without groupby and use drop_duplicate and join:
df.join(df.drop_duplicates('order_id')\
.eval('crow = distance_theo / bird_distance')[['crow']]).ffill()
or use assign instead of eval per #jezraela comments below:
df1.join(df1.drop_duplicates('order_id')\
.assign(crow=df1.distance_theo / df1.bird_distance)[['crow']]).ffill()
Output:
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111

Pandas cumulative conditional sum by dates

Imagine a pandas DataFrame like this
date id initial_value part_value
2016-01-21 1 100 10
2016-05-18 1 100 20
2016-03-15 2 150 75
2016-07-28 2 150 50
2016-08-30 2 150 25
2015-07-21 3 75 75
Generated with following
df = pd.DataFrame({
'id': (1, 1, 2, 2, 2, 3),
'date': tuple(pd.to_datetime(date) for date in
('2016-01-21', '2016-05-18', '2016-03-15', '2016-07-28', '2016-08-30', '2015-07-21')),
'initial_value': (100, 100, 150, 150, 150, 75),
'part_value': (10, 20, 75, 50, 25, 75)}).sort_values(['id', 'date'])
I wish to add a column with the remaining value defined by the initial_value minus the cumulative sum of part_value conditioned on id and dates before. Hence I wish my goal is
date id initial_value part_value goal
2016-01-21 1 100 10 100
2016-05-18 1 100 20 90
2016-03-15 2 150 75 150
2016-07-28 2 150 50 75
2016-08-30 2 150 25 25
2015-07-21 3 75 75 75
I'm thinking that a solution can be made by combining the solution from here and here, but I can't exactly figure it out.
If dont use dates values need add, sub and groupby with cumsum:
df['goal'] = df.initial_value.add(df.part_value).sub(df.groupby('id').part_value.cumsum())
print (df)
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75
What is same as:
df['goal'] = df.initial_value + df.part_value - df.groupby('id').part_value.cumsum()
print (df)
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75
I actually came up with a solution myself as well. I guess it is kind of the same that is happening.
df['goal'] = df.initial_value - ((df.part_value).groupby(df.id).cumsum() - df.part_value)
df
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75

Categories