Calculate the delta between entries in Pandas using partitions - python

I'm using Dataframe in Pandas, and I would like to calculate the delta between each adjacent rows, using a partition.
For example, this is my initial set after sorting it by A and B:
A B
1 12 40
2 12 50
3 12 65
4 23 30
5 23 45
6 23 60
I want to calculate the delta between adjacent B values, partitioned by A. If we define C as result, the final table should look like this:
A B C
1 12 40 NaN
2 12 50 10
3 12 65 15
4 23 30 NaN
5 23 45 15
6 23 75 30
The reason for the NaN is that we cannot calculate delta for the minimum number in each partition.

You can group by column A and take the difference:
df['C'] = df.groupby('A')['B'].diff()
df
Out:
A B C
1 12 40 NaN
2 12 50 10.0
3 12 65 15.0
4 23 30 NaN
5 23 45 15.0
6 23 60 15.0

Related

Pandas drop multiple in range value using isin

Given a df
a
0 1
1 2
2 1
3 7
4 10
5 11
6 21
7 22
8 26
9 51
10 56
11 83
12 82
13 85
14 90
I would like to drop rows if the value in column a is not within these multiple range
(10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop`
lbot =[10, 25, 50, 80]
ltop=[15, 30, 55, 85]
I am thinking this can be achieve via pandas isin
df[df['a'].isin(list(zip(lbot,ltop)))]
But, it return empty df instead.
The expected output is
a
10
11
26
51
83
82
85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.:
out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)]
Output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Create values in flatten list comprehension with range:
df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])]
print (df)
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Or use np.concatenate for flatten list of ranges:
df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between():
df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists:
# group based on lower bound
id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)),
right=False) # include lower bound
# group based on upper bound
id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop)))
# ensure groups are identical
df[id1.eq(id2)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
intermediate groups:
a id1 id2
0 1 NaN 0
1 2 NaN 0
2 1 NaN 0
3 7 NaN 0
4 10 0 0
5 11 0 0
6 21 0 1
7 22 0 1
8 26 1 1
9 51 2 2
10 56 2 3
11 83 3 3
12 82 3 3
13 85 3 3
14 90 3 NaN

Pandas: How to calculate average one value to after another (succeeding average)

Imagine a dataset like below:
result country start end
5 A 2/14/2022 2/21/2022
10 A 2/21/2022 2/28/2022
30 B 2/28/2022 3/7/2022
50 C 1/3/2022 1/10/2022
60 C 1/10/2022 1/17/2022
70 D 1/17/2022 1/24/2022
40 E 1/24/2022 1/31/2022
20 E 1/31/2022 2/7/2022
30 A 2/7/2022 2/14/2022
20 B 2/14/2022 2/21/2022
Expected output
I need to do groupby (country, start, and end) and the result column should add existing value with the above value and need to populate the average column.
For example:
groupby country, start, and end with result and average column is nothing but 5, 5+10/2, 10+30/2, 30+50/2,50+60/2
result average
5 5 eg: (5)
10 7.5 (5+10/2) #resultcol of existingvalue + abovevalue divided by 2 = average
30 20 (10+30/2)
50 40 (30+50/2)
60 55 (50+60/2)
70 65 ...
40 55 ...
20 30 ...
30 25 ...
20 25 ...
Try this solution with grouping by country and date, however it may raise error if there is no sufficient data in a subset (i.e. larger than 2):
df_data['average'] = df_data.groupby(['country', 'date'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
In case you want to group by country only
df_data['average'] = df_data.groupby(['country'])['result'].rolling(2, min_periods=1).mean().reset_index(0, drop=True)
df_data
country date result average
0 A 2/14/2022 5 5.0
1 A 2/21/2022 10 7.5
2 B 2/28/2022 30 30.0
3 C 1/3/2022 50 50.0
4 C 1/10/2022 60 55.0
5 D 1/17/2022 70 70.0
6 E 1/24/2022 40 40.0
7 E 1/31/2022 20 30.0
8 A 2/7/2022 30 20.0
9 B 2/14/2022 20 25.0

For a column in pandas dataframe, calculate mean of column values in previous 4th, 8th and 12th row from the present row?

In a pandas dataframe, I want to create a new column that calculates the average of column values of 4th, 8th and 12th row before our present row.
As shown in the table below, for row number 13 :
Value in Existing column that is 4 rows before row 13 (row 9) = 4
Value in Existing column that is 8 rows before row 13 (row 5) = 6
Value in Existing column that is 12 rows before row 13 (row 1) = 2
Average of 4,6,2 is 4. Hence New Column = 4 at row number 13, for the remaining rows between 1-12, New Column = Nan
I have more rows in my df, but I added only first 13 rows here for illustration.
Row number
Existing column
New column
1
2
NaN
2
4
NaN
3
3
NaN
4
1
NaN
5
6
NaN
6
4
NaN
7
8
NaN
8
2
NaN
9
4
NaN
10
9
NaN
11
2
NaN
12
4
NaN
13
3
3
.shift() is your missing part. We can use it to access previous rows from the existing row in a Pandas dataframe.
Let's use .groupby(), .apply() and .shift() as follows:
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
Here, rows are partitioned into groups of 13 rows by grouping them under different group numbers set by (df['Row number'] - 1) // 13
Then within each group, we use .apply() on the column Existing column and use .shift() to get the previous 4th, 8th and 12th entries within the group.
Test Run
data = {'Row number' : np.arange(1, 40), 'Existing column': np.arange(11, 50) }
df = pd.DataFrame(data)
print(df)
Row number Existing column
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
10 11 21
11 12 22
12 13 23
13 14 24
14 15 25
15 16 26
16 17 27
17 18 28
18 19 29
19 20 30
20 21 31
21 22 32
22 23 33
23 24 34
24 25 35
25 26 36
26 27 37
27 28 38
28 29 39
29 30 40
30 31 41
31 32 42
32 33 43
33 34 44
34 35 45
35 36 46
36 37 47
37 38 48
38 39 49
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
print(df)
Row number Existing column New column
0 1 11 NaN
1 2 12 NaN
2 3 13 NaN
3 4 14 NaN
4 5 15 NaN
5 6 16 NaN
6 7 17 NaN
7 8 18 NaN
8 9 19 NaN
9 10 20 NaN
10 11 21 NaN
11 12 22 NaN
12 13 23 15.0
13 14 24 NaN
14 15 25 NaN
15 16 26 NaN
16 17 27 NaN
17 18 28 NaN
18 19 29 NaN
19 20 30 NaN
20 21 31 NaN
21 22 32 NaN
22 23 33 NaN
23 24 34 NaN
24 25 35 NaN
25 26 36 28.0
26 27 37 NaN
27 28 38 NaN
28 29 39 NaN
29 30 40 NaN
30 31 41 NaN
31 32 42 NaN
32 33 43 NaN
33 34 44 NaN
34 35 45 NaN
35 36 46 NaN
36 37 47 NaN
37 38 48 NaN
38 39 49 41.0
You can use rolling with .apply to apply a custom aggregation function.
The average of (4,6,2) is 4, not 3
>>> (2 + 6 + 4) / 3
4.0
>>> df["New column"] = df["Existing column"].rolling(13).apply(lambda x: x.iloc[[0, 4, 8]].mean())
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down:
df["Existing column"]: select "Existing column" from the dataframe
.rolling(13): starting with the first 13 rows, we're going to move a sliding window across all of the data. So first, we will encounter rows 0-12, then rows 1-13, then 2-14, so on and so forth.
.apply(...): For each of those aforementioned rolling sections, we're going to apply a function that works on each section (in this case the function we're applying is the lambda.
lambda x: x.iloc[[0, 4, 8]].mean(): from each of those rolling sections, extract the 0th 4th, and 8th (corresponding to row 1, 5, & 9) and calculate and return the mean of those values.
In order to work on your dataframe in chunks (or groups) instead of a sliding window, you can apply the same logic with the .groupby method (instead of .rolling).
>>> groups = np.arange(len(df)) // 13 # defines groups as chunks of 13 rows
>>> averages = (
df.groupby(groups)["Existing column"]
.apply(lambda x: x.iloc[[0, 4, 8]].mean())
)
>>> averages.index = (averages.index + 1) * 13 - 1
>>> df["New column"] = averages
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down now:
groups = np.arange(len(df)): creates an array that will be used to chunk our dataframe into groups. This array will essentially be 13 0s, followed by 13 1s, follow by 13 2s... until the array is the same length as the dataframe. So in this case for a single chunk example it will only be an array of 13 0s.
df.groupby(groups)["Existing column"] group the dataframe according to the groups defined above and select the "Existing column"
.apply(lambda x: x.iloc[[0, 4, 8]].mean()): Conceptually the same as before, except we're applying to each grouping instead of a sliding window.
averages.index = (averages.index + 1) * 12: this part may seem a little odd. But we're essentially ensuring that our selected averages line up with the original dataset correctly. In this case, we want the average from group 0 (specified with an index value of 0 in the averages Series) to align to row 12. If we had another group (group 1, we would want it to align to row 25 in the original dataset). So we can use a little math to do this transformation.
df["New column"] = averages: since we already matched up our indices, pandas takes care of the actual alignment of these new values under the hood for us.

Pandas Fill column values as previous

I have many columns that must hold their values from the previous row if the condition is met. Y & Z columns decides the values of other columns.
Y Z A B C D
100 10 20 Nan 22 40
100 11 Nan 15 Nan 41
100 10 23 Nan 24 42
100 11 Nan 16 Nan 42
100 10 25 Nan 26 45
100 11 Nan 17 Nan 45
101 17 Nan Nan Nan Nan
Expectation
Y Z A B C D
100 10 20 Nan 22 40
100 11 20 15 22 41
100 10 23 15 24 42
100 11 23 16 24 42
100 10 25 16 26 45
100 11 25 17 26 45
101 17 Nan Nan Nan Nan
So basically if the value of Y is 100 and Z is 10 the column values of B should be copied from the previous value of B and if Z is 11 the values of A and C should be copied from the previous values. I have around 20 columns like B and 20 columns like A & C. There are 50-60 columns like D , they should not be effected. And if the value of Y is other than 100 then nothing needs to be done on columns A, B and C
I was thinking of using
df[B] = df[B].shift().fillna(-1)
but I am not sure how to do it based on condition and for many columns in 1 go.
Forward filling only rows matching by mask chained by Series.eq for == with Series.isin for test membership by & for bitwise AND:
#if necessary replace strings Nan to missing values NaN
df = df.replace('Nan', np.nan)
mask = df.Y.eq(100) & df.Z.isin([10,11])
df[mask] = df[mask].ffill()
Another idea with DataFrame.mask:
df = df.mask(mask, df.ffill())
print (df)
Y Z A B C D
0 100 10 20 NaN 22 40
1 100 11 20 15 22 41
2 100 10 23 15 24 42
3 100 11 23 16 24 42
4 100 10 25 16 26 45
5 100 11 25 17 26 45
6 101 17 NaN NaN NaN NaN

Python Pandas create column based on value of index

I have a dataframe that looks like the below. What I'd like to do is create another column that is based on the VALUE of the index (so anything less the 10 would have another column and be labeled as "small"). I can do something like lengthDF[lengthDF.index < 10] to get the values I want, but I'm sure how to get the additional column I want. I've tried this Create Column with ELIF in Pandas but can't get it to read the index...
LengthFirst LengthOthers
0 1 NaN
4 NaN 1
9 NaN 1
13 NaN 1
17 1 1
18 NaN 1
19 NaN 1
20 1 NaN
21 1 1
22 3 4
23 1 NaN
24 7 6
25 1 2
26 16 19
27 1 2
28 24 8
29 9 12
30 73 65
31 15 12
32 55 60
33 28 21
34 29 31
Something like this?
lengthDF['size'] = 'large'
lengthDF['size'][lengthDF.index < 10] = 'small'

Categories