I have many columns that must hold their values from the previous row if the condition is met. Y & Z columns decides the values of other columns.
Y Z A B C D
100 10 20 Nan 22 40
100 11 Nan 15 Nan 41
100 10 23 Nan 24 42
100 11 Nan 16 Nan 42
100 10 25 Nan 26 45
100 11 Nan 17 Nan 45
101 17 Nan Nan Nan Nan
Expectation
Y Z A B C D
100 10 20 Nan 22 40
100 11 20 15 22 41
100 10 23 15 24 42
100 11 23 16 24 42
100 10 25 16 26 45
100 11 25 17 26 45
101 17 Nan Nan Nan Nan
So basically if the value of Y is 100 and Z is 10 the column values of B should be copied from the previous value of B and if Z is 11 the values of A and C should be copied from the previous values. I have around 20 columns like B and 20 columns like A & C. There are 50-60 columns like D , they should not be effected. And if the value of Y is other than 100 then nothing needs to be done on columns A, B and C
I was thinking of using
df[B] = df[B].shift().fillna(-1)
but I am not sure how to do it based on condition and for many columns in 1 go.
Forward filling only rows matching by mask chained by Series.eq for == with Series.isin for test membership by & for bitwise AND:
#if necessary replace strings Nan to missing values NaN
df = df.replace('Nan', np.nan)
mask = df.Y.eq(100) & df.Z.isin([10,11])
df[mask] = df[mask].ffill()
Another idea with DataFrame.mask:
df = df.mask(mask, df.ffill())
print (df)
Y Z A B C D
0 100 10 20 NaN 22 40
1 100 11 20 15 22 41
2 100 10 23 15 24 42
3 100 11 23 16 24 42
4 100 10 25 16 26 45
5 100 11 25 17 26 45
6 101 17 NaN NaN NaN NaN
Related
Given a df
a
0 1
1 2
2 1
3 7
4 10
5 11
6 21
7 22
8 26
9 51
10 56
11 83
12 82
13 85
14 90
I would like to drop rows if the value in column a is not within these multiple range
(10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop`
lbot =[10, 25, 50, 80]
ltop=[15, 30, 55, 85]
I am thinking this can be achieve via pandas isin
df[df['a'].isin(list(zip(lbot,ltop)))]
But, it return empty df instead.
The expected output is
a
10
11
26
51
83
82
85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.:
out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)]
Output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Create values in flatten list comprehension with range:
df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])]
print (df)
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Or use np.concatenate for flatten list of ranges:
df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between():
df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists:
# group based on lower bound
id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)),
right=False) # include lower bound
# group based on upper bound
id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop)))
# ensure groups are identical
df[id1.eq(id2)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
intermediate groups:
a id1 id2
0 1 NaN 0
1 2 NaN 0
2 1 NaN 0
3 7 NaN 0
4 10 0 0
5 11 0 0
6 21 0 1
7 22 0 1
8 26 1 1
9 51 2 2
10 56 2 3
11 83 3 3
12 82 3 3
13 85 3 3
14 90 3 NaN
In a pandas dataframe, I want to create a new column that calculates the average of column values of 4th, 8th and 12th row before our present row.
As shown in the table below, for row number 13 :
Value in Existing column that is 4 rows before row 13 (row 9) = 4
Value in Existing column that is 8 rows before row 13 (row 5) = 6
Value in Existing column that is 12 rows before row 13 (row 1) = 2
Average of 4,6,2 is 4. Hence New Column = 4 at row number 13, for the remaining rows between 1-12, New Column = Nan
I have more rows in my df, but I added only first 13 rows here for illustration.
Row number
Existing column
New column
1
2
NaN
2
4
NaN
3
3
NaN
4
1
NaN
5
6
NaN
6
4
NaN
7
8
NaN
8
2
NaN
9
4
NaN
10
9
NaN
11
2
NaN
12
4
NaN
13
3
3
.shift() is your missing part. We can use it to access previous rows from the existing row in a Pandas dataframe.
Let's use .groupby(), .apply() and .shift() as follows:
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
Here, rows are partitioned into groups of 13 rows by grouping them under different group numbers set by (df['Row number'] - 1) // 13
Then within each group, we use .apply() on the column Existing column and use .shift() to get the previous 4th, 8th and 12th entries within the group.
Test Run
data = {'Row number' : np.arange(1, 40), 'Existing column': np.arange(11, 50) }
df = pd.DataFrame(data)
print(df)
Row number Existing column
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
10 11 21
11 12 22
12 13 23
13 14 24
14 15 25
15 16 26
16 17 27
17 18 28
18 19 29
19 20 30
20 21 31
21 22 32
22 23 33
23 24 34
24 25 35
25 26 36
26 27 37
27 28 38
28 29 39
29 30 40
30 31 41
31 32 42
32 33 43
33 34 44
34 35 45
35 36 46
36 37 47
37 38 48
38 39 49
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
print(df)
Row number Existing column New column
0 1 11 NaN
1 2 12 NaN
2 3 13 NaN
3 4 14 NaN
4 5 15 NaN
5 6 16 NaN
6 7 17 NaN
7 8 18 NaN
8 9 19 NaN
9 10 20 NaN
10 11 21 NaN
11 12 22 NaN
12 13 23 15.0
13 14 24 NaN
14 15 25 NaN
15 16 26 NaN
16 17 27 NaN
17 18 28 NaN
18 19 29 NaN
19 20 30 NaN
20 21 31 NaN
21 22 32 NaN
22 23 33 NaN
23 24 34 NaN
24 25 35 NaN
25 26 36 28.0
26 27 37 NaN
27 28 38 NaN
28 29 39 NaN
29 30 40 NaN
30 31 41 NaN
31 32 42 NaN
32 33 43 NaN
33 34 44 NaN
34 35 45 NaN
35 36 46 NaN
36 37 47 NaN
37 38 48 NaN
38 39 49 41.0
You can use rolling with .apply to apply a custom aggregation function.
The average of (4,6,2) is 4, not 3
>>> (2 + 6 + 4) / 3
4.0
>>> df["New column"] = df["Existing column"].rolling(13).apply(lambda x: x.iloc[[0, 4, 8]].mean())
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down:
df["Existing column"]: select "Existing column" from the dataframe
.rolling(13): starting with the first 13 rows, we're going to move a sliding window across all of the data. So first, we will encounter rows 0-12, then rows 1-13, then 2-14, so on and so forth.
.apply(...): For each of those aforementioned rolling sections, we're going to apply a function that works on each section (in this case the function we're applying is the lambda.
lambda x: x.iloc[[0, 4, 8]].mean(): from each of those rolling sections, extract the 0th 4th, and 8th (corresponding to row 1, 5, & 9) and calculate and return the mean of those values.
In order to work on your dataframe in chunks (or groups) instead of a sliding window, you can apply the same logic with the .groupby method (instead of .rolling).
>>> groups = np.arange(len(df)) // 13 # defines groups as chunks of 13 rows
>>> averages = (
df.groupby(groups)["Existing column"]
.apply(lambda x: x.iloc[[0, 4, 8]].mean())
)
>>> averages.index = (averages.index + 1) * 13 - 1
>>> df["New column"] = averages
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down now:
groups = np.arange(len(df)): creates an array that will be used to chunk our dataframe into groups. This array will essentially be 13 0s, followed by 13 1s, follow by 13 2s... until the array is the same length as the dataframe. So in this case for a single chunk example it will only be an array of 13 0s.
df.groupby(groups)["Existing column"] group the dataframe according to the groups defined above and select the "Existing column"
.apply(lambda x: x.iloc[[0, 4, 8]].mean()): Conceptually the same as before, except we're applying to each grouping instead of a sliding window.
averages.index = (averages.index + 1) * 12: this part may seem a little odd. But we're essentially ensuring that our selected averages line up with the original dataset correctly. In this case, we want the average from group 0 (specified with an index value of 0 in the averages Series) to align to row 12. If we had another group (group 1, we would want it to align to row 25 in the original dataset). So we can use a little math to do this transformation.
df["New column"] = averages: since we already matched up our indices, pandas takes care of the actual alignment of these new values under the hood for us.
I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})
print(df)
a = df.loc[:, ['a','b','c','d']]
df['mean'] = (a.iloc[:,0:]).mean(1)
print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667
I'm using Dataframe in Pandas, and I would like to calculate the delta between each adjacent rows, using a partition.
For example, this is my initial set after sorting it by A and B:
A B
1 12 40
2 12 50
3 12 65
4 23 30
5 23 45
6 23 60
I want to calculate the delta between adjacent B values, partitioned by A. If we define C as result, the final table should look like this:
A B C
1 12 40 NaN
2 12 50 10
3 12 65 15
4 23 30 NaN
5 23 45 15
6 23 75 30
The reason for the NaN is that we cannot calculate delta for the minimum number in each partition.
You can group by column A and take the difference:
df['C'] = df.groupby('A')['B'].diff()
df
Out:
A B C
1 12 40 NaN
2 12 50 10.0
3 12 65 15.0
4 23 30 NaN
5 23 45 15.0
6 23 60 15.0
I have a dataframe that looks like the below. What I'd like to do is create another column that is based on the VALUE of the index (so anything less the 10 would have another column and be labeled as "small"). I can do something like lengthDF[lengthDF.index < 10] to get the values I want, but I'm sure how to get the additional column I want. I've tried this Create Column with ELIF in Pandas but can't get it to read the index...
LengthFirst LengthOthers
0 1 NaN
4 NaN 1
9 NaN 1
13 NaN 1
17 1 1
18 NaN 1
19 NaN 1
20 1 NaN
21 1 1
22 3 4
23 1 NaN
24 7 6
25 1 2
26 16 19
27 1 2
28 24 8
29 9 12
30 73 65
31 15 12
32 55 60
33 28 21
34 29 31
Something like this?
lengthDF['size'] = 'large'
lengthDF['size'][lengthDF.index < 10] = 'small'