pandas concat is not possible got error why - python

I like to add a total row on the top of my pivot, but when I trie to concat I got an error.
>>> table
Weight
Vcountry 1 2 3 4 5 6 7
V20001
1 86 NaN NaN NaN NaN NaN 92
2 41 NaN 71 40 50 51 49
3 NaN 61 60 61 60 25 62
4 51 NaN NaN NaN NaN NaN NaN
5 26 26 20 41 25 23 NaN
[5 rows x 7 columns]
Thats the pivot Table
>>> totals_frame
Vcountry 1 2 3 4 5 6 7
totalCount 204 87 151 142 135 99 203
The total of it I like to join
[1 rows x 7 columns]
>>> pc = [totals_frame, table]
>>> concat(pc)
Here the output:
reindex_items
copy_if_needed=True)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2887, in reindex
target = MultiIndex.from_tuples(target)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 2486, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "inference.pyx", line 915, in pandas.lib.tuples_to_object_array (pandas\lib.c:43656)
TypeError: object of type 'long' has no len()

Here's a possible way: instead of using pd.concat use pd.DataFrame.append. There's a bit of fiddling around with the index to do, but it's still quite neat I think:
# Just setting up the dataframe:
df = pd.DataFrame({'country':['A','A','A','B','B','B'],
'weight':[1,2,3,1,2,3],
'value':[10,20,30,15,25,35]})
df = df.set_index(['country','weight']).unstack('weight')
# A bit of messing about to get the index right:
index = df.index.values.tolist()
index.append('Totals')
# Here's where the magic happens:
df = df.append(df.sum(), ignore_index=True)
df.index = index
which gives:
value
weight 1 2 3
A 10 20 30
B 15 25 35
Totals 25 45 65

Related

Python: compare many columns to one column and replace values greater than that column with NaN

My overall goal is to remove outliers in a row that are higher than the 1.5xIQR of the that row. I have a large dataframe with thousands of features which mainly consists of numeric data. I have calculated the 1.5xIQR in a row-wise fashion and set it as a new column. I would like to replace any data within each row that is greater than its respective 1.5xIQR with either NaN (preferred) or 0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
df
A B C D
0 46 99 38 11
1 43 49 3 95
2 64 39 33 49
3 41 60 49 7
4 38 95 70 13
5 11 45 57 73
6 8 62 57 22
7 9 83 89 91
8 47 82 61 40
9 34 21 21 41
I have tried numerous variations of this and beyond with no success.
df1 = df.iloc[:,:] > df.loc['D'] = 'NaN'
I think this should work:
def f(row):
Q1 = row.quantile(0.25)
Q3 = row.quantile(0.75)
IQR = Q3 - Q1
row[row > 1.5*IQR] = np.nan
return row
df1 = df.apply(f, axis=1)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
s = df.apply(lambda x: (x.quantile(.75)-x.quantile(.25))*1.5, axis=1)
df=df.where(df.lt(s, axis=0),np.nan)
print(df)
My understanding from the wording of question and your tried code is that you have already calculated the 1.5xIQR in column D. As such, you can use df.mask as follows:
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
Demo:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
print(df)
A B C D
0 45 71 15 22
1 56 68 62 91
2 21 90 44 15
3 60 87 2 68
4 48 21 22 25
5 60 68 67 60
6 74 97 94 27
7 69 26 56 85
8 39 42 74 73
9 23 99 91 72
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
print(df1)
A B C D
0 NaN NaN 15.0 22
1 56.0 68.0 62.0 91
2 NaN NaN NaN 15
3 60.0 NaN 2.0 68
4 NaN 21.0 22.0 25
5 60.0 NaN NaN 60
6 NaN NaN NaN 27
7 69.0 26.0 56.0 85
8 39.0 42.0 NaN 73
9 23.0 NaN NaN 72
Alternatively, you can also use the simplified code below to update df elements meeting the criteria in place:
df[df.gt(df['D'], axis=0)] = np.nan
Printing df after the code will give the same result.

How to insert row to dataframe

I have an existing dataframe like this
>>> print(dataframe)
sid
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Name: name, Length: 87, dtype: int64
I want to add a row like {'sid': 2, '': 100} to it but when I try this
df = pandas.DataFrame({'sid': [2], '': [100]})
df = df.set_index('sid')
dataframe = dataframe.append(df)
print(dataframe)
I end up with
sid
30 11.0 NaN
56 5.0 NaN
73 25.0 NaN
78 2.0 NaN
132 1.0 NaN
... ... ...
8616 2.0 NaN
9049 1.0 NaN
9125 6.0 NaN
9316 11.0 NaN
2 NaN 100.0
I'm hoping for something more like
sid
2 100
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Any idea how I can achieve that?
The way to do this was
dataframe.loc[2] = 100
Thanks anky!
Reason for the above problem, because at the time you have appended two DataFrames, you forgot to set 'sid' as the dataframe index. So, basically the two DataFrames has different structure when you append it. Make sure to set the index of both dataframes same before you append them.
data = [ [30,11], [56, 5], [73, 25]] #test dataframe
dataframe = pd.DataFrame(data, columns=['sid', ''])
dataframe = dataframe.set_index('sid')
print(dataframe)
You get,
sid
30 11
56 5
73 25
Create and set the index of df,
df = pd.DataFrame({'sid' : [2], '' : [100]})
df = df.set_index('sid')
You get,
sid
2 100
Then append them,
dataframe = df.append(dataframe)
print(dataframe)
You will get the disired outcome,
sid
2 100
30 11
56 5
73 25

Calculate mean of df, BUT if =>1 of the values differs >20% from this mean, the mean is set to NaN

I want to calculate the mean of columns a,b,c,d of the dataframe BUT if one of four values in each dataframe row differs more then 20% from this mean (of the four values), the mean has to be set to NaN.
Calculation of the mean of 4 columns is easy, but I'm stuck at defining the condition 'if mean*0.8 <= one of the values in the data row <= mean*1,2 then mean == NaN.
In the example, one or more of the values in ID:5 en ID:87 don't fit in the interval and therefore the mean is set to NaN.
(NaN-values in the initial dataframe are ignored when calculating the mean and when applying the 20%-condition to the calculated mean)
So I'm trying to calculate the mean only for the data rows with no 'outliers'.
Initial df:
ID a b c d
2 31 32 31 31
5 33 52 159 2
7 51 NaN 52 51
87 30 52 421 2
90 10 11 10 11
102 41 42 NaN 42
Desired df:
ID a b c d mean
2 31 32 31 31 31.25
5 33 52 159 2 NaN
7 51 NaN 52 51 51.33
87 30 52 421 2 NaN
90 10 11 10 11 10.50
102 41 42 NaN 42 41.67
Code:
import pandas as pd
import numpy as np


df = pd.DataFrame({"ID": [2,5,7,87,90,102],
"a": [31,33,51,30,10,41],
"b": [32,52,np.nan,52,11,42],
"c": [31,159,52,421,10,np.nan],
"d": [31,2,51,2,11,42]})

print(df)

a = df.loc[:, ['a','b','c','d']]

df['mean'] = (a.iloc[:,0:]).mean(1)


print(df)
b = df.mean.values[:,None]*0.8 < a.values[:,:] < df.mean.values[:,None]*1.2
print(b)
...
Try this:
# extract related information
s = df.iloc[:,1:]
# calculate mean
mean = s.mean(1)
# where condition is violated
mask = s.lt(mean*.8, axis=0) | s.gt(mean*1.2, axis=0)
# mask where mask is True on any row
df['mean'] = mean.mask(mask.any(1))
Output:
ID a b c d mean
0 2 31 32.0 31.0 31 31.250000
1 5 33 52.0 159.0 2 NaN
2 7 51 NaN 52.0 51 51.333333
3 87 30 52.0 421.0 2 NaN
4 90 10 11.0 10.0 11 10.500000
5 102 41 42.0 NaN 42 41.666667

How to create a rolling window in pandas with another condition

I have a data frame with 2 columns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
A B
0 11 10
1 61 30
2 24 54
3 47 52
4 72 42
... ... ...
95 61 2
96 67 41
97 95 30
98 29 66
99 49 22
100 rows × 2 columns
Now I want to create a third column, which is a rolling window max of col 'A' BUT
the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B
So for example in row
3 47 52
the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52
pseudo code
df['C'] = df['A'].rolling(window=4).max() where < df['B']
You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.
Sample Data
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))
Code
N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]
# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)
print(df.head(15))
A B C
0 51 92 NaN # Missing b/c min_periods
1 14 71 NaN # Missing b/c min_periods
2 60 20 NaN # Missing b/c min_periods
3 82 86 82.0
4 74 74 60.0
5 87 99 87.0
6 23 2 NaN # Missing b/c 82, 74, 87, 23 all > 2
7 21 52 23.0 # Max of 21, 23, 87, 74 which is < 52
8 1 87 23.0
9 29 37 29.0
10 1 63 29.0
11 59 20 1.0
12 32 75 59.0
13 57 21 1.0
14 88 48 32.0
You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))
def rollup(a, B=df.B):
ix = a.index.max()
b = B[ix]
return a[a<b].max()
df['C'] = df.A.rolling(4).apply(rollup)
df
# returns:
A B C
0 8 17 NaN
1 23 84 NaN
2 75 84 NaN
3 86 24 23.0
4 52 83 75.0
.. .. .. ...
95 38 22 NaN
96 53 48 38.0
97 45 4 NaN
98 3 92 53.0
99 91 86 53.0
The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.
You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1):
In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()
In [38]: df
Out[38]:
A B C
0 0 1 0.0
1 1 2 1.0
2 2 3 2.0
3 10 4 2.0
4 4 5 4.0
5 5 6 5.0
6 10 7 5.0
7 10 8 5.0
8 10 9 5.0
9 10 10 NaN

Replacing values in a pandas multi-index

I have a dataframe with a multi-index. I want to change the value of the 2nd index when certain conditions on the first index are met.
I found a similar (but different) question here: Replace a value in MultiIndex (pandas)
which doesn't answer my point because that was about changing a single row, and the solution passed the value of the first index (which didn't need changing), too. In my case I am dealing with multiple rows and I haven't been able to adapt that solution to my case.
A minimal example of my data is below. Thanks!
import pandas as pd
import numpy as np
consdf=pd.DataFrame()
for mylocation in ['North','South']:
for scenario in np.arange(1,4):
df= pd.DataFrame()
df['mylocation'] = [mylocation]
df['scenario']= [scenario]
df['this'] = np.random.randint(10,100)
df['that'] = df['this'] * 2
df['something else'] = df['this'] * 3
consdf=pd.concat((consdf, df ), axis=0, ignore_index=True)
mypiv = consdf.pivot('mylocation','scenario').transpose()
level_list =['this','that']
# if level 0 is in level_list --> set level 1 to np.nan
mypiv.iloc[mypiv.index.get_level_values(0).isin(level_list)].index.set_levels([np.nan], level =1, inplace=True)
The last line doesn't work: I get:
ValueError: On level 1, label max (2) >= length of level (1). NOTE: this index is in an inconsistent state
IIUC you could add new value to level values, and then change labels for your index, using advanced indexing, get_level_values, set_levels and set_labels methods:
len_ind = len(mypiv.loc[(level_list,)].index.get_level_values(1))
mypiv.index.set_levels([1, 2, 3, np.nan], level=1, inplace=True)
mypiv.index.set_labels([3]*len_ind + mypiv.index.labels[1][len_ind:].tolist(), level=1, inplace=True)
In [219]: mypiv
Out[219]:
mylocation North South
scenario
this NaN 26 46
NaN 32 67
NaN 75 30
that NaN 52 92
NaN 64 134
NaN 150 60
something else 1.0 78 138
2.0 96 201
3.0 225 90
Note You values for other scenario will convert to float because it should be one type and np.nan has float type.
Note: ix has been deprecated in Pandas 0.20+. Use loc accessor instead.
Here is a solution, using reset_index() method:
In [95]: new = mypiv.reset_index()
In [96]: new
Out[96]:
mylocation level_0 scenario North South
0 this 1 32 64
1 this 2 18 40
2 this 3 76 56
3 that 1 64 128
4 that 2 36 80
5 that 3 152 112
6 something else 1 96 192
7 something else 2 54 120
8 something else 3 228 168
In [100]: new.ix[new.level_0.isin(level_list), 'scenario'] = np.nan
In [101]: new
Out[101]:
mylocation level_0 scenario North South
0 this NaN 32 64
1 this NaN 18 40
2 this NaN 76 56
3 that NaN 64 128
4 that NaN 36 80
5 that NaN 152 112
6 something else 1.0 96 192
7 something else 2.0 54 120
8 something else 3.0 228 168
In [103]: mypiv = new.set_index(['level_0', 'scenario'])
In [104]: mypiv
Out[104]:
mylocation North South
level_0 scenario
this NaN 32 64
NaN 18 40
NaN 76 56
that NaN 64 128
NaN 36 80
NaN 152 112
something else 1.0 96 192
2.0 54 120
3.0 228 168
But I suspect there is a more elegant solution.

Categories