I have a data frame with 2 columns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
A B
0 11 10
1 61 30
2 24 54
3 47 52
4 72 42
... ... ...
95 61 2
96 67 41
97 95 30
98 29 66
99 49 22
100 rows × 2 columns
Now I want to create a third column, which is a rolling window max of col 'A' BUT
the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B
So for example in row
3 47 52
the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52
pseudo code
df['C'] = df['A'].rolling(window=4).max() where < df['B']
You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.
Sample Data
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))
Code
N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]
# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)
print(df.head(15))
A B C
0 51 92 NaN # Missing b/c min_periods
1 14 71 NaN # Missing b/c min_periods
2 60 20 NaN # Missing b/c min_periods
3 82 86 82.0
4 74 74 60.0
5 87 99 87.0
6 23 2 NaN # Missing b/c 82, 74, 87, 23 all > 2
7 21 52 23.0 # Max of 21, 23, 87, 74 which is < 52
8 1 87 23.0
9 29 37 29.0
10 1 63 29.0
11 59 20 1.0
12 32 75 59.0
13 57 21 1.0
14 88 48 32.0
You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))
def rollup(a, B=df.B):
ix = a.index.max()
b = B[ix]
return a[a<b].max()
df['C'] = df.A.rolling(4).apply(rollup)
df
# returns:
A B C
0 8 17 NaN
1 23 84 NaN
2 75 84 NaN
3 86 24 23.0
4 52 83 75.0
.. .. .. ...
95 38 22 NaN
96 53 48 38.0
97 45 4 NaN
98 3 92 53.0
99 91 86 53.0
The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.
You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1):
In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()
In [38]: df
Out[38]:
A B C
0 0 1 0.0
1 1 2 1.0
2 2 3 2.0
3 10 4 2.0
4 4 5 4.0
5 5 6 5.0
6 10 7 5.0
7 10 8 5.0
8 10 9 5.0
9 10 10 NaN
Related
Given a df
a
0 1
1 2
2 1
3 7
4 10
5 11
6 21
7 22
8 26
9 51
10 56
11 83
12 82
13 85
14 90
I would like to drop rows if the value in column a is not within these multiple range
(10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop`
lbot =[10, 25, 50, 80]
ltop=[15, 30, 55, 85]
I am thinking this can be achieve via pandas isin
df[df['a'].isin(list(zip(lbot,ltop)))]
But, it return empty df instead.
The expected output is
a
10
11
26
51
83
82
85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.:
out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)]
Output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Create values in flatten list comprehension with range:
df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])]
print (df)
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Or use np.concatenate for flatten list of ranges:
df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between():
df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists:
# group based on lower bound
id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)),
right=False) # include lower bound
# group based on upper bound
id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop)))
# ensure groups are identical
df[id1.eq(id2)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
intermediate groups:
a id1 id2
0 1 NaN 0
1 2 NaN 0
2 1 NaN 0
3 7 NaN 0
4 10 0 0
5 11 0 0
6 21 0 1
7 22 0 1
8 26 1 1
9 51 2 2
10 56 2 3
11 83 3 3
12 82 3 3
13 85 3 3
14 90 3 NaN
I tested this two snippets,the df share the same structure other than the df.columns, so what causes the difference between them? And how should I change my second snippet, for example,should I always use the pandas.DataFrame.mul or use the other method to avoid this?
# test1
df = pd.DataFrame(np.random.randint(100, size=(10, 10))) \
.assign(Count=np.random.rand(10))
df.iloc[:, 0:3] *= df['Count']
df
Out[1]:
0 1 2 3 4 5 6 7 8 9 Count
0 26.484949 68.217006 4.902341 61 10 13 31 15 10 11 0.645974
1 56.845743 70.085965 28.106758 79 56 47 82 83 62 40 0.934480
2 33.590667 78.496281 1.634114 94 3 91 16 41 93 55 0.326823
3 51.031974 15.886152 26.145821 67 31 20 81 21 10 8 0.012706
4 47.156128 82.234199 10.458328 24 8 68 44 24 4 50 0.517130
5 18.733256 61.675649 23.531239 74 61 97 20 12 0 95 0.360815
6 4.521820 26.165427 26.145821 68 10 77 67 92 82 11 0.606739
7 24.547026 62.610129 23.531239 50 45 69 94 56 77 56 0.412445
8 52.969897 75.692843 9.804683 73 74 5 10 60 51 77 0.125309
9 21.963128 30.837825 19.609366 75 9 50 68 10 82 96 0.687966
#test2
df = pd.DataFrame(np.random.randint(100, size=(10, 10))) \
.assign(Count=np.random.rand(10))
df.columns = ['find', 'a', 'b', 3, 4, 5, 6, 7, 8, 9, 'Count']
df.iloc[:, 0:3] *= df['Count']
df
Out[2]:
find a b 3 4 5 6 7 8 9 Count
0 NaN NaN NaN 63 63 47 81 3 48 34 0.603953
1 NaN NaN NaN 70 48 41 27 78 75 23 0.839635
2 NaN NaN NaN 5 38 52 23 3 75 4 0.515159
3 NaN NaN NaN 40 49 31 25 63 48 25 0.483255
4 NaN NaN NaN 42 89 46 47 78 30 5 0.693555
5 NaN NaN NaN 68 83 81 87 7 54 3 0.108306
6 NaN NaN NaN 74 48 99 67 80 81 36 0.361500
7 NaN NaN NaN 10 19 26 41 11 24 33 0.705899
8 NaN NaN NaN 38 51 83 78 7 31 42 0.838703
9 NaN NaN NaN 2 7 63 14 28 38 10 0.277547
df.iloc[:,0:3] is a dataframe with three series, named find, a, and b. df['Count'] is a series named Count. When you multiply these, Pandas tries to match up same-named series, but since there are none, it ends up generating NaN values for all the slots. Then it assigns these NaN:s back to the dataframe.
I think that using .mul with an appropriate axis= is the way around this, but I may be wrong about that...
My overall goal is to remove outliers in a row that are higher than the 1.5xIQR of the that row. I have a large dataframe with thousands of features which mainly consists of numeric data. I have calculated the 1.5xIQR in a row-wise fashion and set it as a new column. I would like to replace any data within each row that is greater than its respective 1.5xIQR with either NaN (preferred) or 0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
df
A B C D
0 46 99 38 11
1 43 49 3 95
2 64 39 33 49
3 41 60 49 7
4 38 95 70 13
5 11 45 57 73
6 8 62 57 22
7 9 83 89 91
8 47 82 61 40
9 34 21 21 41
I have tried numerous variations of this and beyond with no success.
df1 = df.iloc[:,:] > df.loc['D'] = 'NaN'
I think this should work:
def f(row):
Q1 = row.quantile(0.25)
Q3 = row.quantile(0.75)
IQR = Q3 - Q1
row[row > 1.5*IQR] = np.nan
return row
df1 = df.apply(f, axis=1)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
s = df.apply(lambda x: (x.quantile(.75)-x.quantile(.25))*1.5, axis=1)
df=df.where(df.lt(s, axis=0),np.nan)
print(df)
My understanding from the wording of question and your tried code is that you have already calculated the 1.5xIQR in column D. As such, you can use df.mask as follows:
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
Demo:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
print(df)
A B C D
0 45 71 15 22
1 56 68 62 91
2 21 90 44 15
3 60 87 2 68
4 48 21 22 25
5 60 68 67 60
6 74 97 94 27
7 69 26 56 85
8 39 42 74 73
9 23 99 91 72
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
print(df1)
A B C D
0 NaN NaN 15.0 22
1 56.0 68.0 62.0 91
2 NaN NaN NaN 15
3 60.0 NaN 2.0 68
4 NaN 21.0 22.0 25
5 60.0 NaN NaN 60
6 NaN NaN NaN 27
7 69.0 26.0 56.0 85
8 39.0 42.0 NaN 73
9 23.0 NaN NaN 72
Alternatively, you can also use the simplified code below to update df elements meeting the criteria in place:
df[df.gt(df['D'], axis=0)] = np.nan
Printing df after the code will give the same result.
I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})
print(df)
a = df.loc[:, ['a','b','c','d']]
df['mean'] = (a.iloc[:,0:]).mean(1)
print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667
If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?
I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)
pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.
Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer