I have a dataframe with a multi-index. I want to change the value of the 2nd index when certain conditions on the first index are met.
I found a similar (but different) question here: Replace a value in MultiIndex (pandas)
which doesn't answer my point because that was about changing a single row, and the solution passed the value of the first index (which didn't need changing), too. In my case I am dealing with multiple rows and I haven't been able to adapt that solution to my case.
A minimal example of my data is below. Thanks!
import pandas as pd
import numpy as np
consdf=pd.DataFrame()
for mylocation in ['North','South']:
for scenario in np.arange(1,4):
df= pd.DataFrame()
df['mylocation'] = [mylocation]
df['scenario']= [scenario]
df['this'] = np.random.randint(10,100)
df['that'] = df['this'] * 2
df['something else'] = df['this'] * 3
consdf=pd.concat((consdf, df ), axis=0, ignore_index=True)
mypiv = consdf.pivot('mylocation','scenario').transpose()
level_list =['this','that']
# if level 0 is in level_list --> set level 1 to np.nan
mypiv.iloc[mypiv.index.get_level_values(0).isin(level_list)].index.set_levels([np.nan], level =1, inplace=True)
The last line doesn't work: I get:
ValueError: On level 1, label max (2) >= length of level (1). NOTE: this index is in an inconsistent state
IIUC you could add new value to level values, and then change labels for your index, using advanced indexing, get_level_values, set_levels and set_labels methods:
len_ind = len(mypiv.loc[(level_list,)].index.get_level_values(1))
mypiv.index.set_levels([1, 2, 3, np.nan], level=1, inplace=True)
mypiv.index.set_labels([3]*len_ind + mypiv.index.labels[1][len_ind:].tolist(), level=1, inplace=True)
In [219]: mypiv
Out[219]:
mylocation North South
scenario
this NaN 26 46
NaN 32 67
NaN 75 30
that NaN 52 92
NaN 64 134
NaN 150 60
something else 1.0 78 138
2.0 96 201
3.0 225 90
Note You values for other scenario will convert to float because it should be one type and np.nan has float type.
Note: ix has been deprecated in Pandas 0.20+. Use loc accessor instead.
Here is a solution, using reset_index() method:
In [95]: new = mypiv.reset_index()
In [96]: new
Out[96]:
mylocation level_0 scenario North South
0 this 1 32 64
1 this 2 18 40
2 this 3 76 56
3 that 1 64 128
4 that 2 36 80
5 that 3 152 112
6 something else 1 96 192
7 something else 2 54 120
8 something else 3 228 168
In [100]: new.ix[new.level_0.isin(level_list), 'scenario'] = np.nan
In [101]: new
Out[101]:
mylocation level_0 scenario North South
0 this NaN 32 64
1 this NaN 18 40
2 this NaN 76 56
3 that NaN 64 128
4 that NaN 36 80
5 that NaN 152 112
6 something else 1.0 96 192
7 something else 2.0 54 120
8 something else 3.0 228 168
In [103]: mypiv = new.set_index(['level_0', 'scenario'])
In [104]: mypiv
Out[104]:
mylocation North South
level_0 scenario
this NaN 32 64
NaN 18 40
NaN 76 56
that NaN 64 128
NaN 36 80
NaN 152 112
something else 1.0 96 192
2.0 54 120
3.0 228 168
But I suspect there is a more elegant solution.
Related
My overall goal is to remove outliers in a row that are higher than the 1.5xIQR of the that row. I have a large dataframe with thousands of features which mainly consists of numeric data. I have calculated the 1.5xIQR in a row-wise fashion and set it as a new column. I would like to replace any data within each row that is greater than its respective 1.5xIQR with either NaN (preferred) or 0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
df
A B C D
0 46 99 38 11
1 43 49 3 95
2 64 39 33 49
3 41 60 49 7
4 38 95 70 13
5 11 45 57 73
6 8 62 57 22
7 9 83 89 91
8 47 82 61 40
9 34 21 21 41
I have tried numerous variations of this and beyond with no success.
df1 = df.iloc[:,:] > df.loc['D'] = 'NaN'
I think this should work:
def f(row):
Q1 = row.quantile(0.25)
Q3 = row.quantile(0.75)
IQR = Q3 - Q1
row[row > 1.5*IQR] = np.nan
return row
df1 = df.apply(f, axis=1)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
s = df.apply(lambda x: (x.quantile(.75)-x.quantile(.25))*1.5, axis=1)
df=df.where(df.lt(s, axis=0),np.nan)
print(df)
My understanding from the wording of question and your tried code is that you have already calculated the 1.5xIQR in column D. As such, you can use df.mask as follows:
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
Demo:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
print(df)
A B C D
0 45 71 15 22
1 56 68 62 91
2 21 90 44 15
3 60 87 2 68
4 48 21 22 25
5 60 68 67 60
6 74 97 94 27
7 69 26 56 85
8 39 42 74 73
9 23 99 91 72
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
print(df1)
A B C D
0 NaN NaN 15.0 22
1 56.0 68.0 62.0 91
2 NaN NaN NaN 15
3 60.0 NaN 2.0 68
4 NaN 21.0 22.0 25
5 60.0 NaN NaN 60
6 NaN NaN NaN 27
7 69.0 26.0 56.0 85
8 39.0 42.0 NaN 73
9 23.0 NaN NaN 72
Alternatively, you can also use the simplified code below to update df elements meeting the criteria in place:
df[df.gt(df['D'], axis=0)] = np.nan
Printing df after the code will give the same result.
I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})
print(df)
a = df.loc[:, ['a','b','c','d']]
df['mean'] = (a.iloc[:,0:]).mean(1)
print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667
I have a data frame with 2 columns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
A B
0 11 10
1 61 30
2 24 54
3 47 52
4 72 42
... ... ...
95 61 2
96 67 41
97 95 30
98 29 66
99 49 22
100 rows × 2 columns
Now I want to create a third column, which is a rolling window max of col 'A' BUT
the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B
So for example in row
3 47 52
the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52
pseudo code
df['C'] = df['A'].rolling(window=4).max() where < df['B']
You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.
Sample Data
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))
Code
N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]
# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)
print(df.head(15))
A B C
0 51 92 NaN # Missing b/c min_periods
1 14 71 NaN # Missing b/c min_periods
2 60 20 NaN # Missing b/c min_periods
3 82 86 82.0
4 74 74 60.0
5 87 99 87.0
6 23 2 NaN # Missing b/c 82, 74, 87, 23 all > 2
7 21 52 23.0 # Max of 21, 23, 87, 74 which is < 52
8 1 87 23.0
9 29 37 29.0
10 1 63 29.0
11 59 20 1.0
12 32 75 59.0
13 57 21 1.0
14 88 48 32.0
You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))
def rollup(a, B=df.B):
ix = a.index.max()
b = B[ix]
return a[a<b].max()
df['C'] = df.A.rolling(4).apply(rollup)
df
# returns:
A B C
0 8 17 NaN
1 23 84 NaN
2 75 84 NaN
3 86 24 23.0
4 52 83 75.0
.. .. .. ...
95 38 22 NaN
96 53 48 38.0
97 45 4 NaN
98 3 92 53.0
99 91 86 53.0
The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.
You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1):
In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()
In [38]: df
Out[38]:
A B C
0 0 1 0.0
1 1 2 1.0
2 2 3 2.0
3 10 4 2.0
4 4 5 4.0
5 5 6 5.0
6 10 7 5.0
7 10 8 5.0
8 10 9 5.0
9 10 10 NaN
I have a Python dataframe that reads from a file
the next step I do is to break dataset into 2 datasets df_LastYear & df_ThisYear
Note : that Index is not continuous missing 2 & 6
ID AdmissionAge
0 14 68
1 22 86
3 78 40
4 124 45
5 128 35
7 148 92
8 183 71
9 185 98
10 219 79
after applying some predictive models I get results of predictive values y_ThisYear
Prediction
0 2.400000e+01
1 1.400000e+01
2 1.000000e+00
3 2.096032e+09
4 2.000000e+00
5 -7.395179e+11
6 6.159412e+06
7 5.592327e+07
8 5.303477e+08
9 5.500000e+00
10 6.500000e+00
I am trying to concat both datasets df_ThisYear and y_ThisYear into one dataset
but I always get these results
ID AdmissionAge Prediction
0 14.0 68.0 2.400000e+01
1 22.0 86.0 1.400000e+01
2 NaN NaN 1.000000e+00
3 78.0 40.0 2.096032e+09
4 124.0 45.0 2.000000e+00
5 128.0 35.0 -7.395179e+11
6 NaN NaN 6.159412e+06
7 148.0 92.0 5.592327e+07
8 183.0 71.0 5.303477e+08
9 185.0 98.0 5.500000e+00
10 219.0 79.0 6.500000e+00
There are NaNs which did not exist before
I found that these NaNs are belonging to the index which was not included in df_ThisYear
Therefore I try reset index so I get continuous Indices
I used
df_ThisYear.reset_index(drop=True)
but still getting same indices
How to fix this problem so I can concatenate df_ThisYear with y_ThisYear correctly?
Then you just need join
df.join(Y)
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
If you are really excited about using concat, you can provide 'inner' to the how argument:
pd.concat([df_ThisYear, y_ThisYear], axis=1, join='inner')
This returns
Out[6]:
ID AdmissionAge Prediction
0 14 68 2.400000e+01
1 22 86 1.400000e+01
3 78 40 2.096032e+09
4 124 45 2.000000e+00
5 128 35 -7.395179e+11
7 148 92 5.592327e+07
8 183 71 5.303477e+08
9 185 98 5.500000e+00
10 219 79 6.500000e+00
Because y_ThisYear has different index than df_ThisYear
When I joined both using
df_ThisYear.join(y_ThisYear )
it started to match each number it its matching index
I know this is right if indices are actually represent the same record i.e. index 7 in df_ThisYear value is matching y_ThisYear index 7 too
In my case I just want to match first record in y_ThisYear to the first in df_ThisYear regardless of their index number
I found this code that does that.
df_ThisYear = pd.concat([df_ThisYear.reset_index(drop=True), pd.DataFrame(y_ThisYear)], axis=1)
Thanks for everyone helped with the answer
I like to add a total row on the top of my pivot, but when I trie to concat I got an error.
>>> table
Weight
Vcountry 1 2 3 4 5 6 7
V20001
1 86 NaN NaN NaN NaN NaN 92
2 41 NaN 71 40 50 51 49
3 NaN 61 60 61 60 25 62
4 51 NaN NaN NaN NaN NaN NaN
5 26 26 20 41 25 23 NaN
[5 rows x 7 columns]
Thats the pivot Table
>>> totals_frame
Vcountry 1 2 3 4 5 6 7
totalCount 204 87 151 142 135 99 203
The total of it I like to join
[1 rows x 7 columns]
>>> pc = [totals_frame, table]
>>> concat(pc)
Here the output:
reindex_items
copy_if_needed=True)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2887, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2486, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "inference.pyx", line 915, in pandas.lib.tuples_to_object_array (pandas\lib.c:43656)
TypeError: object of type 'long' has no len()
Here's a possible way: instead of using pd.concat use pd.DataFrame.append. There's a bit of fiddling around with the index to do, but it's still quite neat I think:
# Just setting up the dataframe:
df = pd.DataFrame({'country':['A','A','A','B','B','B'],
'weight':[1,2,3,1,2,3],
'value':[10,20,30,15,25,35]})
df = df.set_index(['country','weight']).unstack('weight')
# A bit of messing about to get the index right:
index = df.index.values.tolist()
index.append('Totals')
# Here's where the magic happens:
df = df.append(df.sum(), ignore_index=True)
df.index = index
which gives:
value
weight 1 2 3
A 10 20 30
B 15 25 35
Totals 25 45 65