I would to add 4 last numbers of a dataframe and output will be appended in a new column, I have used for loop to do it but it is slower if there are many rows is there a way we can use df.apply(lamda x:) to achieve this.
**Sample Input:
values
0 10
1 20
2 30
3 40
4 50
5 60
Output:
values result
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180**
use pandas.DataFrame.rolling
>>> df.rolling(4, min_periods=1).sum()
values
0 10.0
1 30.0
2 60.0
3 100.0
4 140.0
5 180.0
6 220.0
7 260.0
8 300.0
Add it together:
>>> df.assign(results=df.rolling(4, min_periods=1).sum().astype(int))
values results
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180
6 70 220
7 80 260
8 90 300
Related
Below, I have two dataframe. I need to update df_mapped using df_original.
In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time
**
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row
**
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step. Can't think of way out, what could be done to massage the dataframe?
This is not complete solution but it might help you to get unstuck.
At the end we get the correct data.
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.
Imagine a dataset like below:
result country start end
5 A 2/14/2022 2/21/2022
10 A 2/21/2022 2/28/2022
30 B 2/28/2022 3/7/2022
50 C 1/3/2022 1/10/2022
60 C 1/10/2022 1/17/2022
70 D 1/17/2022 1/24/2022
40 E 1/24/2022 1/31/2022
20 E 1/31/2022 2/7/2022
30 A 2/7/2022 2/14/2022
20 B 2/14/2022 2/21/2022
Expected output
I need to do groupby (country, start, and end) and the result column should add existing value with the above value and need to populate the average column.
For example:
groupby country, start, and end with result and average column is nothing but 5, 5+10/2, 10+30/2, 30+50/2,50+60/2
result average
5 5 eg: (5)
10 7.5 (5+10/2) #resultcol of existingvalue + abovevalue divided by 2 = average
30 20 (10+30/2)
50 40 (30+50/2)
60 55 (50+60/2)
70 65 ...
40 55 ...
20 30 ...
30 25 ...
20 25 ...
Try this solution with grouping by country and date, however it may raise error if there is no sufficient data in a subset (i.e. larger than 2):
df_data['average'] = df_data.groupby(['country', 'date'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
In case you want to group by country only
df_data['average'] = df_data.groupby(['country'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
df_data
country date result average
0 A 2/14/2022 5 5.0
1 A 2/21/2022 10 7.5
2 B 2/28/2022 30 30.0
3 C 1/3/2022 50 50.0
4 C 1/10/2022 60 55.0
5 D 1/17/2022 70 70.0
6 E 1/24/2022 40 40.0
7 E 1/31/2022 20 30.0
8 A 2/7/2022 30 20.0
9 B 2/14/2022 20 25.0
I have a dataframe like
pd.DataFrame({'i': [ 3, 4, 12, 25, 44, 45, 52, 53, 65, 66]
, 't': range(1,11)
, 'v': range(0,100)[::10]}
)
i.e.
i t v
0 3 1 0
1 4 2 10
2 12 3 20
3 25 4 30
4 44 5 40
5 45 6 50
6 52 7 60
7 53 8 70
8 65 9 80
9 66 10 90
I would like to sum the values in column v with the next column if i increased by 1, otherwise do nothing.
One can assume that there are maximally two consecutive rows to sum, thus the last row might be ambiguous, depending if it is summed or not.
The resulting dataframe should look like:
i t v
0 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
6 52 7 130
8 65 9 170
Obviously I could loop over the dataframe using .iterrows() but there must be a smarter solution.
I tried various combinations of shift, diff and groupby, though I cannot see the way to do it...
It's a common technique to identify the block with cumsum on diff:
blocks = df['i'].diff().ne(1).cumsum()
df.groupby(blocks, as_index=False).agg({'i':'first','t':'first', 'v':'sum'})
Output:
i t v
0 3 1 10
1 12 3 20
2 25 4 30
3 44 5 90
4 52 7 130
5 65 9 170
Let us try
out = df.groupby(df['i'].diff().ne(1).cumsum()).agg({'i':'first','t':'first','v':'sum'})
Out[11]:
i t v
i
1 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
5 52 7 130
6 65 9 170
I'm having a difficulty with applying agg to a groupby pandas dataframe.
I have a dataframe df like this:
order_id distance_theo bird_distance
10 100 80
10 80 80
10 70 80
11 90 70
11 70 70
11 60 70
12 200 180
12 150 180
12 100 180
12 60 180
I want to groupby order_id, and make a new column crow by dividing distance_theo of the first row in each group by bird_distance in the first row of each group(or in any row, because there is only one value of bird_distance in one group).
order_id distance_theo bird_distance crow
10 100 80 1.25
10 80 80 1.25
10 70 80 1.25
11 90 70 1.29
11 70 70 1.29
11 60 70 1.29
12 200 180 1.11
12 150 180 1.11
12 100 180 1.11
12 60 180 1.11
My attempt:
df.groupby('order_id').agg({'crow', lambda x: x.distance_theo.head(1) / x.bird_distance.head(1)})
But I get an error:
'Series' object has no attribute 'distance_theo'
How can I solve this? Thanks for any kinds of advice!
Using groupby with first:
s = df.groupby('order_id').transform('first')
df.assign(crow=s.distance_theo.div(s.bird_distance))
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111
You could do it without groupby and use drop_duplicate and join:
df.join(df.drop_duplicates('order_id')\
.eval('crow = distance_theo / bird_distance')[['crow']]).ffill()
or use assign instead of eval per #jezraela comments below:
df1.join(df1.drop_duplicates('order_id')\
.assign(crow=df1.distance_theo / df1.bird_distance)[['crow']]).ffill()
Output:
order_id distance_theo bird_distance crow
0 10 100 80 1.250000
1 10 80 80 1.250000
2 10 70 80 1.250000
3 11 90 70 1.285714
4 11 70 70 1.285714
5 11 60 70 1.285714
6 12 200 180 1.111111
7 12 150 180 1.111111
8 12 100 180 1.111111
9 12 60 180 1.111111
I'm using Dataframe in Pandas, and I would like to calculate the delta between each adjacent rows, using a partition.
For example, this is my initial set after sorting it by A and B:
A B
1 12 40
2 12 50
3 12 65
4 23 30
5 23 45
6 23 60
I want to calculate the delta between adjacent B values, partitioned by A. If we define C as result, the final table should look like this:
A B C
1 12 40 NaN
2 12 50 10
3 12 65 15
4 23 30 NaN
5 23 45 15
6 23 75 30
The reason for the NaN is that we cannot calculate delta for the minimum number in each partition.
You can group by column A and take the difference:
df['C'] = df.groupby('A')['B'].diff()
df
Out:
A B C
1 12 40 NaN
2 12 50 10.0
3 12 65 15.0
4 23 30 NaN
5 23 45 15.0
6 23 60 15.0