Pandas Dataframe Transformations

Pandas Dataframe Transformations - python

Consider a data frame which looks like:
A B C
0 2018-10-13 100 50
1 2018-10-13 200 25
2 2018-10-13 300 10
3 2018-10-13 400 5
4 2018-10-13 500 0
5 2018-10-14 100 100
6 2018-10-14 200 50
7 2018-10-14 300 25
8 2018-10-14 400 10
9 2018-10-14 500 5
10 2018-10-15 100 150
11 2018-10-15 200 100
12 2018-10-15 300 50
13 2018-10-15 400 25
14 2018-10-15 500 10
Here transformation that I want to perform is:
GroupBy Column A
Then GroupBy Column B into 3 intervals ( [0,100] say intval-1, [101,200] say intval-2, [201,end] say intval-3]. Can be n intervals to generalize.
Perform sum aggregation on Column C
So my transformed/pivoted dataframe should be like
A intval-1 intval-2 intval-3
0 2018-10-13 50 25 15
1 2018-10-14 100 50 40
2 2018-10-13 150 100 85
An easy way to implement this would be great help.
Thank You.

You can cut, then pivot_table:
bin_lst = [0, 100, 200, np.inf]
cut_b = pd.cut(df['B'], bins=bin_lst,
labels=[f'intval-{i}' for i in range(1, len(bin_lst))])
res = df.assign(B=cut_b)\
.pivot_table(index='A', columns='B', values='C', aggfunc='sum')
print(res)
B intval-1 intval-2 intval-3
A
2018-10-13 50 25 15
2018-10-14 100 50 40
2018-10-15 150 100 85

Using pd.cut with groupby + unstack
df.B=pd.cut(df.B,bins=[0,100,200,np.inf],labels=['intval-1','intval-2','intval-3'])
df.groupby(['A','B']).C.sum().unstack()
Out[35]:
B intval-1 intval-2 intval-3
A
2018-10-13 50 25 15
2018-10-14 100 50 40
2018-10-15 150 100 85

Related

Splitting DataFrame rows into multiple based on a condition

I have a Dataframe df1 that has a bunch of columns like so:
val_1
val_2
start
end
val_3
val_4
0
10
70
1/1/2020
3/4/2020
10
20
1
20
80
1/1/2020
3/6/2021
30
40
2
30
90
1/1/2020
6/4/2021
50
60
3
40
100
12/5/2020
7/4/2021
70
80
4
89
300
4/5/2020
6/8/2022
40
10
I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:
val_1
val_2
start
end
val_3
val_4
0
10
70
1/1/2020
3/4/2020
10
20
1
20
80
1/1/2020
12/31/2020
30
40
2
20
80
1/1/2021
3/6/2021
30
40
3
30
90
1/1/2020
12/31/2020
50
60
4
30
90
1/1/2021
6/4/2021
50
60
5
40
100
7/5/2021
11/17/2021
70
80
6
89
300
4/5/2020
12/31/2020
40
10
7
89
300
1/1/2021
12/31/2021
40
10
8
89
300
1/1/2021
6/8/2022
40
10
Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!
EDIT
If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!

Here's a different approach. Note that I've already converted start and end to datetimes, and I didn't bother sorting the resultant DataFrame because I didn't want to assume a specific ordering for your use-case.
import pandas as pd
def jump_to_new_year(df: pd.DataFrame) -> pd.DataFrame:
df["start"] = df["start"].map(lambda t: pd.Timestamp(t.year + 1, 1, 1))
return df
def fill_to_year_end(df: pd.DataFrame) -> pd.DataFrame:
df["end"] = df["start"].map(lambda t: pd.Timestamp(t.year, 12, 31))
return df
def roll_over(df: pd.DataFrame) -> pd.DataFrame:
mask = df.start.dt.year != df.end.dt.year
if all(~mask):
return df
start_df = fill_to_year_end(df[mask].copy())
end_df = roll_over(jump_to_new_year(df[mask].copy()))
return pd.concat([df[~mask], start_df, end_df]).reset_index(drop=True)
This is a recursive function. It first checks if any start-end date pairs have mismatched years. If not, we simply return the DataFrame. If so, we fill to the end of the year in the start_df DataFrame. Then we jump to the new year and fill that to the end date in the end_df DataFrame. Then we recurse on end_df, which will be a smaller subset of the original input.
Warning: this solution assumes that all start dates occur on or before the end date's year. If you start in 2020 and end in 2019, you will recurse infinitely and blow the stack.
Demo:
>>> df
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2021-07-04 70 80
4 89 300 2020-04-05 2022-06-08 40 10
>>> roll_over(df)
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
2 30 90 2020-01-01 2020-12-31 50 60
3 40 100 2020-12-05 2020-12-31 70 80
4 89 300 2020-04-05 2020-12-31 40 10
5 20 80 2021-01-01 2021-03-06 30 40
6 30 90 2021-01-01 2021-06-04 50 60
7 40 100 2021-01-01 2021-07-04 70 80
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10
# An example of reordering the DataFrame
>>> roll_over(df).sort_values(by=["val_1", "start"])
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
5 20 80 2021-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2020-12-31 50 60
6 30 90 2021-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2020-12-31 70 80
7 40 100 2021-01-01 2021-07-04 70 80
4 89 300 2020-04-05 2020-12-31 40 10
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10

Find the year end after date_range, then explode
df['end'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]:
val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
1 20 80 1/1/2020 12/31/20 30 40
2 30 90 1/1/2020 6/4/2021 50 60
2 30 90 1/1/2020 12/31/20 50 60
3 40 100 12/5/2020 7/4/2021 70 80
3 40 100 12/5/2020 12/31/20 70 80
Update
df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x]+pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
for x , y in zip(df['start'],df['end'])]
out = df.explode(['Newend','Newstart'])
val_1 val_2 start end val_3 val_4 Newstart Newend
0 10 70 2020-01-01 2020-03-04 10 20 2020-01-01 2020-03-04
1 20 80 2020-01-01 2021-03-06 30 40 2021-01-01 2021-03-06
1 20 80 2020-01-01 2021-03-06 30 40 2020-01-01 2020-12-31
2 30 90 2020-01-01 2021-06-04 50 60 2021-01-01 2021-06-04
2 30 90 2020-01-01 2021-06-04 50 60 2020-01-01 2020-12-31
3 40 100 2020-12-05 2021-07-04 70 80 2021-01-01 2021-07-04
3 40 100 2020-12-05 2021-07-04 70 80 2020-12-05 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2020-04-05 2022-06-08
4 89 300 2020-04-05 2022-06-08 40 10 2022-01-01 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2021-01-01 2021-12-31

Add last n rows of DataFrame into a new column

I would to add 4 last numbers of a dataframe and output will be appended in a new column, I have used for loop to do it but it is slower if there are many rows is there a way we can use df.apply(lamda x:) to achieve this.
**Sample Input:
values
0 10
1 20
2 30
3 40
4 50
5 60
Output:
values result
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180**

use pandas.DataFrame.rolling
>>> df.rolling(4, min_periods=1).sum()
values
0 10.0
1 30.0
2 60.0
3 100.0
4 140.0
5 180.0
6 220.0
7 260.0
8 300.0
Add it together:
>>> df.assign(results=df.rolling(4, min_periods=1).sum().astype(int))
values results
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180
6 70 220
7 80 260
8 90 300

Pandas: groupby and make a new column applying aggregate to two columns

I'm having a difficulty with applying agg to a groupby pandas dataframe.
I have a dataframe df like this:
order_id distance_theo bird_distance
10 100 80
10 80 80
10 70 80
11 90 70
11 70 70
11 60 70
12 200 180
12 150 180
12 100 180
12 60 180
I want to groupby order_id, and make a new column crow by dividing distance_theo of the first row in each group by bird_distance in the first row of each group(or in any row, because there is only one value of bird_distance in one group).
order_id distance_theo bird_distance crow
10 100 80 1.25
10 80 80 1.25
10 70 80 1.25
11 90 70 1.29
11 70 70 1.29
11 60 70 1.29
12 200 180 1.11
12 150 180 1.11
12 100 180 1.11
12 60 180 1.11
My attempt:
df.groupby('order_id').agg({'crow', lambda x: x.distance_theo.head(1) / x.bird_distance.head(1)})
But I get an error:
'Series' object has no attribute 'distance_theo'
How can I solve this? Thanks for any kinds of advice!

Using groupby with first:
s = df.groupby('order_id').transform('first')
df.assign(crow=s.distance_theo.div(s.bird_distance))
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111

You could do it without groupby and use drop_duplicate and join:
df.join(df.drop_duplicates('order_id')\
.eval('crow = distance_theo / bird_distance')[['crow']]).ffill()
or use assign instead of eval per #jezraela comments below:
df1.join(df1.drop_duplicates('order_id')\
.assign(crow=df1.distance_theo / df1.bird_distance)[['crow']]).ffill()
Output:
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111

Summing up previous 10 rows of a dataframe

I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:

You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0

Pandas cumulative conditional sum by dates

Imagine a pandas DataFrame like this
date id initial_value part_value
2016-01-21 1 100 10
2016-05-18 1 100 20
2016-03-15 2 150 75
2016-07-28 2 150 50
2016-08-30 2 150 25
2015-07-21 3 75 75
Generated with following
df = pd.DataFrame({
'id': (1, 1, 2, 2, 2, 3),
'date': tuple(pd.to_datetime(date) for date in
('2016-01-21', '2016-05-18', '2016-03-15', '2016-07-28', '2016-08-30', '2015-07-21')),
'initial_value': (100, 100, 150, 150, 150, 75),
'part_value': (10, 20, 75, 50, 25, 75)}).sort_values(['id', 'date'])
I wish to add a column with the remaining value defined by the initial_value minus the cumulative sum of part_value conditioned on id and dates before. Hence I wish my goal is
date id initial_value part_value goal
2016-01-21 1 100 10 100
2016-05-18 1 100 20 90
2016-03-15 2 150 75 150
2016-07-28 2 150 50 75
2016-08-30 2 150 25 25
2015-07-21 3 75 75 75
I'm thinking that a solution can be made by combining the solution from here and here, but I can't exactly figure it out.

If dont use dates values need add, sub and groupby with cumsum:
df['goal'] = df.initial_value.add(df.part_value).sub(df.groupby('id').part_value.cumsum())
print (df)
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75
What is same as:
df['goal'] = df.initial_value + df.part_value - df.groupby('id').part_value.cumsum()
print (df)
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75

I actually came up with a solution myself as well. I guess it is kind of the same that is happening.
df['goal'] = df.initial_value - ((df.part_value).groupby(df.id).cumsum() - df.part_value)
df
date id initial_value part_value goal
0 2016-01-21 1 100 10 100
1 2016-05-18 1 100 20 90
2 2016-03-15 2 150 75 150
3 2016-07-28 2 150 50 75
4 2016-08-30 2 150 25 25
5 2015-07-21 3 75 75 75

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe Transformations - python

Using pd.cut with groupby + unstack df.B=pd.cut(df.B,bins=[0,100,200,np.inf],labels=['intval-1','intval-2','intval-3']) df.groupby(['A','B']).C.sum().unstack() Out[35]: B intval-1 intval-2 intval-3 A 2018-10-13 50 25 15 2018-10-14 100 50 40 2018-10-15 150 100 85

Related

Splitting DataFrame rows into multiple based on a condition

Add last n rows of DataFrame into a new column

Pandas: groupby and make a new column applying aggregate to two columns

Summing up previous 10 rows of a dataframe

Pandas cumulative conditional sum by dates

Categories

Resources