Imagine a dataset like below:
result country start end
5 A 2/14/2022 2/21/2022
10 A 2/21/2022 2/28/2022
30 B 2/28/2022 3/7/2022
50 C 1/3/2022 1/10/2022
60 C 1/10/2022 1/17/2022
70 D 1/17/2022 1/24/2022
40 E 1/24/2022 1/31/2022
20 E 1/31/2022 2/7/2022
30 A 2/7/2022 2/14/2022
20 B 2/14/2022 2/21/2022
Expected output
I need to do groupby (country, start, and end) and the result column should add existing value with the above value and need to populate the average column.
For example:
groupby country, start, and end with result and average column is nothing but 5, 5+10/2, 10+30/2, 30+50/2,50+60/2
result average
5 5 eg: (5)
10 7.5 (5+10/2) #resultcol of existingvalue + abovevalue divided by 2 = average
30 20 (10+30/2)
50 40 (30+50/2)
60 55 (50+60/2)
70 65 ...
40 55 ...
20 30 ...
30 25 ...
20 25 ...
Try this solution with grouping by country and date, however it may raise error if there is no sufficient data in a subset (i.e. larger than 2):
df_data['average'] = df_data.groupby(['country', 'date'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
In case you want to group by country only
df_data['average'] = df_data.groupby(['country'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
df_data
country date result average
0 A 2/14/2022 5 5.0
1 A 2/21/2022 10 7.5
2 B 2/28/2022 30 30.0
3 C 1/3/2022 50 50.0
4 C 1/10/2022 60 55.0
5 D 1/17/2022 70 70.0
6 E 1/24/2022 40 40.0
7 E 1/31/2022 20 30.0
8 A 2/7/2022 30 20.0
9 B 2/14/2022 20 25.0
Related
Below, I have two dataframe. I need to update df_mapped using df_original.
In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time
**
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row
**
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step. Can't think of way out, what could be done to massage the dataframe?
This is not complete solution but it might help you to get unstuck.
At the end we get the correct data.
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.
My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 70 C a
3 2000 26 A b
4 2000 180 B b
5 2000 100 C b
6 2000 120 A c
7 2000 46 B c
8 2000 80 C c
9 2005 129 A a
10 2005 46 B a
11 2005 95 C a
12 2005 60 A b
13 2005 23 B b
14 2005 95 C b
15 2005 16 A c
16 2005 65 B c
17 2005 35 C c
I want to get the regions in which the two largest parties have a Vote difference of less than 50 every year. So the desired output is:
Region
a
c
These are two regions where the top two parties had a Vote difference of <50 every year.
I tried to groupby using "Election Year" and "Region" and then sort the Votes in descending order. But I am unable to check if the difference between top two votes of each region in every year is less than 50.
how can I get the desired output?
Starting from your idea with sorting (sort_values) + grouping (GroupBy), then getting the difference diff(), which gives you the the number of votes difference:
>>> df = df.sort_values(['Votes'])
>>> votes_diff = df.groupby(['Election Year', 'Region'])['Votes'].diff()
We can join with the original dataframe and re-sort on the index to see what happened with respect to the original data:
>>> df.join(votes_diff.rename('Δ votes')).sort_index()
Election Year Votes Party Region Δ votes
0 2000 50 A a NaN
1 2000 100 B a 30.0
2 2000 70 C a 20.0
3 2000 26 A b NaN
4 2000 180 B b 80.0
5 2000 100 C b 74.0
6 2000 120 A c 40.0
7 2000 46 B c NaN
8 2000 80 C c 34.0
9 2005 129 A a 34.0
10 2005 46 B a NaN
11 2005 95 C a 49.0
12 2005 60 A b 37.0
13 2005 23 B b NaN
14 2005 95 C b 35.0
15 2005 16 A c NaN
16 2005 65 B c 30.0
17 2005 35 C c 19.0
So in each election, each party now has a value Δ votes which is how many more votes it has than the next party. The smallest party of each election as NaN as its vote counts.
Now we want in each election the difference between the 2 largest parties, meaning the Δ votes for the row with the maximum of votes. We could probably use idxmax, but since the dataframe is already sorted on vote counts, we can also just use last.
>>> top_vote_diff = votes_diff.groupby([df['Election Year'], df['Region']]).last()
>>> top_vote_diff
Election Year Region
2000 a 30.0
b 80.0
c 40.0
2005 a 34.0
b 35.0
c 30.0
Name: Votes, dtype: float64
Now check it’s less than 50 lt(50) for all elections in that region:
>>> criteria = top_vote_diff.lt(50).groupby('Region').all()
>>> criteria
Region
a True
b False
c True
Name: Votes, dtype: bool
>>> pd.Series(criteria.index[criteria])
0 a
1 c
Name: Region, dtype: object
I'd use .pivot:
df = df.pivot(
index=["Region", "Election Year"], columns="Party", values="Votes"
)
df["diff"] = df.apply(
lambda x: x.sort_values(ascending=False).head(2).diff()[-1] * -1, axis=1
)
x = df.groupby(level=0)["diff"].apply(lambda x: (x < 50).all())
print(pd.DataFrame(x.index[x]))
Prints:
Region
0 a
1 c
Steps:
df = df.pivot(
index=["Region", "Election Year"], columns="Party", values="Votes"
)
Creates:
Party A B C
Region Election Year
a 2000 50 100 70
2005 129 46 95
b 2000 26 180 100
2005 60 23 95
c 2000 120 46 80
2005 16 65 35
df["diff"] = df.apply(
lambda x: x.sort_values(ascending=False).head(2).diff()[-1] * -1, axis=1
)
Creates:
Party A B C diff
Region Election Year
a 2000 50 100 70 30.0
2005 129 46 95 34.0
b 2000 26 180 100 80.0
2005 60 23 95 35.0
c 2000 120 46 80 40.0
2005 16 65 35 30.0
x = df.groupby(level=0)["diff"].apply(lambda x: (x < 50).all())
Creates:
Region
a True
b False
c True
I would to add 4 last numbers of a dataframe and output will be appended in a new column, I have used for loop to do it but it is slower if there are many rows is there a way we can use df.apply(lamda x:) to achieve this.
**Sample Input:
values
0 10
1 20
2 30
3 40
4 50
5 60
Output:
values result
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180**
use pandas.DataFrame.rolling
>>> df.rolling(4, min_periods=1).sum()
values
0 10.0
1 30.0
2 60.0
3 100.0
4 140.0
5 180.0
6 220.0
7 260.0
8 300.0
Add it together:
>>> df.assign(results=df.rolling(4, min_periods=1).sum().astype(int))
values results
0 10 10
1 20 30
2 30 60
3 40 100
4 50 140
5 60 180
6 70 220
7 80 260
8 90 300
My df looks like this:
country id x y
AT 11 50 100
AT 12 NaN 90
AT 13 NaN 104
AT 22 40 50
AT 23 30 23
AT 61 40 88
AT 62 NaN 78
UK 11 40 34
UK 12 NaN 22
UK 13 NaN 70
What I need is the sum of the y column in the first row that is not NaN in x, grouped by the first number on the left of the column id. This separately for each country. At the end I just need to drop the NaN.
The result should be something like this:
country id x y
AT 11 50 294
AT 22 40 50
AT 23 30 23
AT 61 40 166
UK 11 40 126
You can aggregate by GroupBy.agg by first and sum functions with helper Series by compare non missing values by Series.notna and cumulative sum by Series.cumsum:
df1 = (df.groupby(['country', df['x'].notna().cumsum()])
.agg({'id':'first', 'x':'first', 'y':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
country id x y
0 AT 11 50.0 294
1 AT 22 40.0 50
2 AT 23 30.0 23
3 AT 61 40.0 166
4 UK 11 40.0 126
If possible first value(s) of x are misisng values add DataFrame.dropna:
print (df)
country id x y
0 AT 11 NaN 100
1 AT 11 50.0 100
2 AT 12 NaN 90
3 AT 13 NaN 104
4 AT 22 40.0 50
5 AT 23 30.0 23
6 AT 61 40.0 88
7 AT 62 NaN 78
8 UK 11 40.0 34
9 UK 12 NaN 22
10 UK 13 NaN 70
df1 = (df.groupby(['country', df['x'].notna().cumsum()])
.agg({'id':'first', 'x':'first', 'y':'sum'})
.reset_index(level=1, drop=True)
.reset_index()
.dropna(subset=['x']))
print (df1)
country id x y
1 AT 11 50.0 294
2 AT 22 40.0 50
3 AT 23 30.0 23
4 AT 61 40.0 166
5 UK 11 40.0 126
Use groupby, transform and dropna:
print (df.assign(y=df.groupby(df["x"].notnull().cumsum())["y"].transform('sum'))
.dropna(subset=["x"]))
country id x y
0 AT 11 50.0 294
3 AT 22 40.0 50
4 AT 23 30.0 23
5 AT 61 40.0 166
7 UK 11 40.0 126
I'm using Dataframe in Pandas, and I would like to calculate the delta between each adjacent rows, using a partition.
For example, this is my initial set after sorting it by A and B:
A B
1 12 40
2 12 50
3 12 65
4 23 30
5 23 45
6 23 60
I want to calculate the delta between adjacent B values, partitioned by A. If we define C as result, the final table should look like this:
A B C
1 12 40 NaN
2 12 50 10
3 12 65 15
4 23 30 NaN
5 23 45 15
6 23 75 30
The reason for the NaN is that we cannot calculate delta for the minimum number in each partition.
You can group by column A and take the difference:
df['C'] = df.groupby('A')['B'].diff()
df
Out:
A B C
1 12 40 NaN
2 12 50 10.0
3 12 65 15.0
4 23 30 NaN
5 23 45 15.0
6 23 60 15.0