I have a dataframe which has values of type: object. Dataframe also contains NaN values. I want to ignore NaN values and for all the remaining values in the column, I want to calculate the mean.
Mean is calculated as follows:
Upperbound value = 30
Lowerbound value = 0
(Upperbound and lowerbound are fixed and all values need to be calculated wrt to them.)
So,
for '>20', mean = (20+30)/2 = 25
for '>1', mean = (30+1)/2=15.5
for '<5', mean = (5+0)/2 = 2.5
for '<10', mean = (10+0)/2 = 5
Dataframe:
column1
>20
NaN
<5
12
>1
<10
NaN
8
Note: Above values in the column are strings and I want to convert it to numerical value.
Final converted dataframe should be:
column1
25
NaN
2.5
12
15.5
5
NaN
8
Note: Above values like 8 and 12 are not converted I only want to convert those values which are prefixed with either > or < remaining values just need to be converted to numerical from the string value.
There's probably a better way, but this works too:
df['num'] = df.column1.str.extract('(\d+)')
df['sign'] = df.column1.str.extract('([<>])').fillna('=')
def get_avg(row):
if not row.num:
return row.num
elif row.sign == '>':
return (int(row.num)+30)/2
elif row.sign == '>':
return (int(row.num)+0)/2
else:
return row.num
df['avg'] = df.apply(lambda row: get_avg(row), axis=1)
Output:
column1 sign num avg
0 >20 > 20 25
1 NaN = NaN NaN
2 <5 < 5 5
3 12 = 12 12
4 >1 > 1 15.5
5 <10 < 10 10
6 NaN = NaN NaN
7 8 = 8 8
The below code applies a custom function that checks the first character of each element and calculates the average based on that.
import numpy as np
import pandas as pd
upper = 30
lower = 0
df = pd.DataFrame({'col1':['>20',np.NaN,'<5','12','>1','<10',np.NaN,'8']})
def avg(val):
if val is not np.NaN:
char = val[0]
if char == '>':
res = (float(val[1:])+upper)/2
elif char == '<':
res = (float(val[1:])+lower)/2
else:
res = float(val)
return res
print(df["col1"].apply(avg))
Output:
0 25.0
1 NaN
2 2.5
3 12.0
4 15.5
5 5.0
6 NaN
7 8.0
You can use np.select to assign the value you want to average with. And then you can average, after converting column1 to a number.
import pandas as pd
import numpy as np
lt = df[df.column1.notnull()].column1.str.contains('<')
gt = df[df.column1.notnull()].column1.str.contains('>')
conds = [lt, gt, ~(lt | gt)]
choice = [0, 30, pd.to_numeric(df[df.column1.notnull()].column1, errors='coerce')]
df.loc[df.column1.notnull(), 'column2'] = np.select(conds, choice)
df['column1'] = pd.to_numeric(df.column1.str.replace('<|>', ''))
df['Avg'] = df.mean(axis=1)
Output:
column1 column2 Avg
0 20.0 30.0 25.0
1 NaN NaN NaN
2 5.0 0.0 2.5
3 12.0 12.0 12.0
4 1.0 30.0 15.5
5 10.0 0.0 5.0
6 NaN NaN NaN
7 8.0 8.0 8.0
You could write a function to calculate your "custom average" then call apply on your column.
x = np.array([['>20'],[np.NaN],['<5'],['>1'],['<10'],[np.NaN]])
df = pd.DataFrame(x,columns=["column1"])
def myFunc(content, up, low):
try:
if content.isnumeric(): return float(content)
return {
'>': (float(content[1:])+up)/2,
'<': (float(content[1:])+low)/2
}[content[0]]
except:
return np.nan
df["avg"] = df.column1.apply(lambda x: myFunc(x, up=30, low=0))
Related
I am using bnp-paribas-cardif-claims-management from Kaggle.
Dataset : https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
df=pd.read_csv('F:\\Data\\Paribas_Claim\\train.csv',nrows=5000)
df.info() gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 133 entries, ID to v131
dtypes: float64(108), int64(6), object(19)
memory usage: 5.1+ MB
My requirement is :
I am trying to fill null values for columns with datatypes as int and object. I am trying to fill the nulls based on the target column.
My code is
df_obj = df.select_dtypes(['object','int64']).columns.to_list()
for cols in df_obj:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = df[df['target'] == 1][cols].mode()
df[( df['target'] == 0 )&( df[cols].isnull() )][cols] = df[df['target'] == 0][cols].mode()
I am able to get output in below print statement:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols]
also the able to print the values for df[df['target'] == 0][cols].mode() if I substitute cols.
But unable to replace the null values with mode values.
I tried df.loc, df.at options instead of df[] and df[...] == np.nan instead of df[...].isnull() but of no use.
Please assist if I need to do any changes in the code. Thanks.
Here is problem is select integers columns, then no contain missing values (because NaN is float), so cannot be replaced. Possible solution is select all numeric columns and in loop set first value of mode per conditions with DataFrame.loc for avoid chain indexing and Series.iat for return only first value (mode should return sometimes 2 values):
df=pd.read_csv('train.csv',nrows=5000)
#only numeric columns
df_obj = df.select_dtypes(np.number).columns.to_list()
#all columns
#df_obj = df.columns.to_list()
#print (df_obj)
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1 & (df[cols].isnull()), cols] = df.loc[m1, cols].mode().iat[0]
df.loc[m2 & (df[cols].isnull()), cols] = df.loc[m2, cols].mode().iat[0]
Another solution with replace missing values by Series.fillna:
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1, cols] = df.loc[m1, cols].fillna(df.loc[m1, cols].mode().iat[0])
df.loc[m2, cols] = df.loc[m2, cols].fillna(df.loc[m2, cols].mode().iat[0])
print (df.head())
ID target v1 v2 v3 v4 v5 v6 \
0 3 1 1.335739e+00 8.727474 C 3.921026 7.915266 2.599278e+00
1 4 1 -9.543625e-07 1.245405 C 0.586622 9.191265 2.126825e-07
2 5 1 9.438769e-01 5.310079 C 4.410969 5.326159 3.979592e+00
3 6 1 7.974146e-01 8.304757 C 4.225930 11.627438 2.097700e+00
4 8 1 -9.543625e-07 1.245405 C 0.586622 2.151983 2.126825e-07
v7 v8 ... v122 v123 v124 v125 \
0 3.176895e+00 1.294147e-02 ... 8.000000 1.989780 3.575369e-02 AU
1 -9.468765e-07 2.301630e+00 ... 1.499437 0.149135 5.988956e-01 AF
2 3.928571e+00 1.964513e-02 ... 9.333333 2.477596 1.345191e-02 AE
3 1.987549e+00 1.719467e-01 ... 7.018256 1.812795 2.267384e-03 CJ
4 -9.468765e-07 -7.783778e-07 ... 1.499437 0.149135 -9.962319e-07 Z
v126 v127 v128 v129 v130 v131
0 1.804126e+00 3.113719e+00 2.024285 0 0.636365 2.857144e+00
1 5.521558e-07 3.066310e-07 1.957825 0 0.173913 -9.932825e-07
2 1.773709e+00 3.922193e+00 1.120468 2 0.883118 1.176472e+00
3 1.415230e+00 2.954381e+00 1.990847 1 1.677108 1.034483e+00
4 5.521558e-07 3.066310e-07 0.100455 0 0.173913 -9.932825e-07
[5 rows x 133 columns]
You don't have a sample data so I'll just give the methods I think you can use to solve your problem.
Try to read your DataFrame with na_filter = False that way your columns with np.nan or has null values will be replaced by blanks instead.
Then, during your loop use the '' as your identifier for null values. Easier to tag than trying to use the type of the value you are parsing.
I think pd.fillna should help.
# random dataset
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 2, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
print(df)
A B C D
0 NaN 2.0 NaN 0
1 3.0 2.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Assuming you want to replace missing values with the mode value of a given column, I'd just use:
df.fillna({'A':df.A.mode()[0],'B':df.B.mode()[0]})
A B C D
0 3.0 2.0 NaN 0
1 3.0 2.0 NaN 1
2 3.0 2.0 NaN 5
3 3.0 3.0 NaN 4
This would also work if you needed a mode value from a subset of values from given column to fill NaNs with.
# let's add 'type' column
A B C D type
0 NaN 2.0 0 1
1 3.0 2.0 1 1
2 NaN NaN 5 2
3 NaN 3.0 4 2
For example, if you want to fill df['B'] NaNs with the mode value of each row that is equal to df['type'] 2:
df.fillna({
'B': df.loc[df.type.eq(2)].B.mode()[0] # type 2
})
A B C D type
0 NaN 2.0 NaN 0 1
1 3.0 2.0 NaN 1 1
2 NaN 3.0 NaN 5 2
3 NaN 3.0 NaN 4 2
# ↑ this would have been '2.0' hadn't we filtered the column with df.loc[]
Your problem is this
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = ...
Do NOT chain index, especially when assigning. See Why does assignment fail when using chained indexing? section in this doc.
Instead use loc:
df.loc[(df['target'] == 1) & (df[cols].isnull()),
cols] = df.loc[df['target'] == 1,
cols].mode()
I have this data frame
x = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
Update: I want a function If the slope is negetive and the length of the group is more than 2 then it should return True, index of start and end of the group. for this case it should return: result=True, index=5, index=8
1- I want to split the data frame based on the slope. This example should have 6 groups.
2- how can I check the length of groups?
I tried to get groups by the below code but I don't know how can split the data frame and how can check the length of each part
New update: Thanks Matt W. for his code. finally I found the solution.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().fillna(0)
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
def get_slope(df):
x=np.array(df.iloc[:,0].index)
y=np.array(df.iloc[:,0])
X = x - x.mean()
Y = y - y.mean()
slope = (X.dot(Y)) / (X.dot(X))
return slope
df['g'] = init[1:]
df.groupby('g').apply(get_slope)
Result
0 NaN
1 NaN
2 NaN
3 0.0
4 NaN
5 -1.5
6 NaN
Take the difference and bfill() the start so that you have the same number in the 0th element. Then turn all negatives the same so we can imitate them being the same "slope". Then I shift it to check to see if the next number is the same and iterate through giving us a list of when it changes, assigning that to g.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
df['g'] = init[1:]
df
entity diff g
0 5 2.0 1
1 7 2.0 1
2 5 -1.0 2
3 5 0.0 3
4 5 0.0 3
5 6 1.0 4
6 3 -1.0 5
7 2 -1.0 5
8 0 -1.0 5
9 5 5.0 6
Just wanted to present another solution that doesn't require a for-loop:
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[diff < 0, 'diff'] = -1
df['g'] = (~(df['diff'] == df['diff'].shift(1))).cumsum()
df
I have a dataframe which looks like
City Crime_Rate
A 10
B 20
C inf
D 15
I want to replace the inf with the max value of the Crime_Rate column , so that my resulting dataframe should look like
City Crime_Rate
A 10
B 20
C 20
D 15
I tried
df['Crime_Rate'].replace([np.inf],max(df['Crime_Rate']),inplace=True)
But python takes inf as the maximum value , where am I going wrong here ?
Filter out inf values first and then get max of Series:
m = df.loc[df['Crime_Rate'] != np.inf, 'Crime_Rate'].max()
df['Crime_Rate'].replace(np.inf,m,inplace=True)
Another solution:
mask = df['Crime_Rate'] != np.inf
df.loc[~mask, 'Crime_Rate'] = df.loc[mask, 'Crime_Rate'].max()
print (df)
City Crime_Rate
0 A 10.0
1 B 20.0
2 C 20.0
3 D 15.0
Here is a solution for a whole matrix/data frame:
highest_non_inf = df.max().loc[lambda v: v<np.Inf].max()
df.replace(np.Inf, highest_non_inf)
Set use_inf_as_nan to true and then use fillna. (Use this if you want to consider inf and nan both as missing value) i.e
pd.options.mode.use_inf_as_na = True
df['Crime_Rate'].fillna(df['Crime_Rate'].max(),inplace=True)
City Crime_Rate
0 A 10.0
1 B 20.0
2 C 20.0
3 D 15.0
One way to do it using an additional function replace(np.inf, np.nan) within max().
It replaces inf with nan for the operations happening inside max() and max returns the expected maximum value not inf
Example below : Max value is 100 and replaces inf
#Create dummy data frame
import pandas as pd
import numpy as np
a = float('Inf')
v = [1,2,5,a,10,5,a,5,100,2]
df = pd.DataFrame({'Col_A': v})
#Data frame looks like this
In [33]: df
Out[33]:
Col_A
0 1.000000
1 2.000000
2 5.000000
3 inf
4 10.000000
5 5.000000
6 inf
7 5.000000
8 100.000000
9 2.000000
# Replace inf
df['Col_A'].replace([np.inf],max(df['Col_A'].replace(np.inf,
np.nan)),inplace=True)
In[35]: df
Out[35]:
Col_A
0 1.0
1 2.0
2 5.0
3 100.0
4 10.0
5 5.0
6 100.0
7 5.0
8 100.0
9 2.0
Hope that works !
Use numpy clip. It's elegant and blazingly fast:
import numpy as np
import pandas as pd
df = pd.DataFrame({"x": [-np.inf, +np.inf, np.nan, 4, 3]})
df["x"] = np.clip(df["x"], -np.inf, 100)
# Out:
# x
# 0 -inf
# 1 100.0
# 2 NaN
# 3 4.0
# 4 3.0
To get rid of the negative infinity as well, replace -np.inf with a small number. NaN is always unaffected. To get the max, use max(df["x"]).
Say I have a time series data as below.
df
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 Nan 34.23
4 32 Nan
5 18.75 41.1
6 Nan 45.12
7 23 39.67
8 Nan 36.45
9 36 Nan
Now I want to fill NaNs in column priceA by taking mean of previous N values in the column. In this case take N=3.
And for column priceB I have to fill Nan by value M rows above(current index-M).
I tried to write for loop for it which is not a good practice as my data is too large. Is there a better way to do this?
N=3
M=2
def fillPriceA( df,indexval,n):
temp=[ ]
for i in range(n):
if i < 0:
continue
temp.append(df.loc[indexval-(i+1), 'priceA'])
return np.nanmean(np.array(temp, dtype=np.float))
def fillPriceB(df, indexval, m):
return df.loc[indexval-m, 'priceB']
for idx, rows for df.iterrows():
if idx< N:
continue
else:
if rows['priceA']==None:
rows['priceA']= fillPriceA(df, idx,N)
if rows['priceB']==None:
rows['priceB']=fillPrriceB(df,idx,M)
Expected output:
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 32.31 34.23
4 32 29.08
5 18.75 41.1
6 27.68 45.12
7 23 39.67
8 23.14 36.45
9 36 39.67
A solution could be to only work with the nan index (see dataframe boolean indexing):
param = dict(priceA = 3, priceB = 2) #Number of previous values to consider
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method
print(df)
Result:
priceA priceB
0 25.670000 30.56
1 34.120000 28.43
2 37.140000 29.08
3 32.310000 34.23
4 32.000000 29.08
5 18.750000 41.10
6 27.686667 45.12
7 23.000000 39.67
8 23.145556 36.45
9 36.000000 39.67
Note
1. Using np.isnan() implies that your columns are numeric. If not convert your columns before with pd.to_numeric():
...
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
...
Or use pd.isnull() instead (see example below). Be aware of the performances (numpy is faster):
from random import randint
#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
arr[randint(0,9999)] = np.nan
#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop
%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop
2. A more generic alternative could be to define methods and window size to apply for each column in a dict:
import pandas as pd
param = {}
param['priceA'] = {'n':3,
'method':lambda x: pd.isnull(x)}
param['priceB'] = {'n':2,
'method':lambda x: x[0]}
param contains now n the number of elements and method a lambda expression. Accordingly rewrite your loops:
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method
print(df)#This leads to a similar result.
You can use an NA mask to do what you need per column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df
# a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0
for col in df.columns:
s = df[col]
na_indices = s[s.isnull()].index.tolist()
prev = 0
for k in na_indices:
s[k] = np.mean(s[prev:k])
prev = k
df[col] = s
print(df)
a b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0
While this is still a custom operation, I am pretty sure it will be slightly faster because it is not iterating over each row, just over the NA values, which I am assuming will be sparse compared to the actual data
To fill priceA use rolling, then shift and use this result in fillna,
# make some data
df = pd.DataFrame({'priceA': range(10)})
#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan
n = 3
df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))
The only edge case here is when two nans are within n of one another but it seems to handle this as in your question.
For priceB just use shift,
df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan
m = 2
df.priceB = df.priceB.fillna(df.priceB.shift(m))
Like before, there is the edge case where there is a nan exactly m before another nan.
My flowchart is illustrated as above, I want to take 2 rows from the original dataset, then import them to another(because I don't want to modify the original data). In the new dataset, check if 2 rows have the same number of non-NaN value (df.iloc[i,:].count()), if not, fill the difference in number by zero, and then continue to perform the operation.
Example:
Original Data:
3 5 5 NaN NaN NaN
1 4 7 5 NaN NaN
NaN NaN 3 6 7 NaN
NaN 3 8 4 11 NaN
3 0 3 7 2 1
Take 2 row i and i+1 and import them to another dataset:
3 5 5 NaN NaN NaN
1 4 7 5 NaN NaN
Because df.iloc[i+1,:].count() != df.iloc[i,:].count() , then the row with more NaN value must be filled like this:
3 5 5 0 NaN NaN
1 4 7 5 NaN NaN
In case of row 3 and 4
NaN 0 3 6 7 NaN
NaN 3 8 4 11 NaN
And then perform the operation.
Here is my code:
for i in range():
process[1,:] = df.iloc[i,:]
process[2,:] = df.iloc[i+1,:]
while True:
if process[1,:].count() == process[2,:].count():
break
else:
if process[1,:].count() > process[2,:].count():
process[2,:] = process[2,:].fillna(value = 0, limit = process[1,:].count() - process[2,:].count())
else:
process[2,:] = process[2,:].fillna(value = 0, limit = process[2,:].count() - process[1,:].count())
A[i,:] = stats.ttest_rel(process[1,:].values, process[2,:].values) #this line is just for the statistical test, you can ignore it
i += 1
My algorithm didn't work, and I feel that they are somehow too clumsy by checking row and row over and over again.
Any suggestion and correction are welcome, thank you very much.
P/s: I want to consecutively perform a statistical test of every row to each other, so before doing so, I need to make them have equal numbers of non-NaN value.
Finally, I can come up with the answer and I want to share it here:
for i in range(5):
process = pd.DataFrame(columns=df.columns)
process = process.append(df.iloc[i,:], ignore_index = True)
process = process.append(df.iloc[i+1,:], ignore_index = True)
while True:
if process.iloc[0,:].count() == process.iloc[1,:].count():
break
else:
if process.iloc[0,:].count() > process.iloc[1,:].count():
process.iloc[1,:] = process.iloc[1,:].fillna(value = 0, limit = process.iloc[0,:].count() - process.iloc[1,:].count())
else:
process.iloc[0,:] = process.iloc[0,:].fillna(value = 0, limit = process.iloc[1,:].count() - process.iloc[0,:].count())
i += 1