I want to calculate a pandas dataframe, but some rows contain missing values. For those missing values, i want to use a diffent algorithm. Lets say:
If column B contains a value, then substract A from B
If column B does not contain a value, then subtract A from C
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,None,1],'c':[2,2,2,2]})
df['calc'] = df['b']-df['a']
results in:
print(df)
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 NaN
3 4 1.0 2 -3.0
Approach 1: fill the NaN rows using .where:
df['calc'].where(df['b'].isnull()) = df['c']-df['a']
which results in SyntaxError: cannot assign to function call.
Approach 2: fill the NaN rows using .iterrows():
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
is executed without errors and calculation is correct, these i values are printed to the console:
0.0
-1.0
-1.0
-3.0
but the values are not written into df['calc'], the datafram remains as is:
print(df['calc'])
0 0.0
1 -1.0
2 NaN
3 -3.0
What is the correct way of overwriting the NaN values?
Finally, I stumbled over .fillna:
df['calc'] = df['calc'].fillna( df['c']-df['a'] )
gets the job done! Can anyone explain what is wrong with above two approaches...?
Approach 2:
you are assigning it to i value. but this won't modify your original dataframe.
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
df.loc[index,'calc'] = i #<------------- here
also don't use iterrows() it is too slow.
Approach 1:
Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
it should be:
df['calc'] = df['calc'].where(df['b'].isnull(), df['c']-df['a'])
but this will only find those row value where you have non zero value and fill that with the given value.
Use:
df['calc'] = df['calc'].where(~df['b'].isnull(), df['c']-df['a'])
OR
df['calc'] = np.where(df['b'].isnull(), df['c']-df['a'], df['calc'])
Instead of subtracting b from a then c from a what you can do is first fill the nan values in column b with the values from column c, then subtract column a:
df['calc'] = df['b'].fillna(df['c']) - df['a']
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 -1.0
3 4 1.0 2 -3.0
Related
I have a dataset that I want to groupby("CustomerID") and fill NaNs with the nearest number within the group.
I can fill by nearest number irregardless of group like this:
df['num'] = df['num'].interpolate(method="nearest")
When I tried:
df['num'] = df.groupby('CustomerID')['num'].transform(lambda x: x.interpolate(method="nearest"))
I got ValueError: x and y arrays must have at least 2 entries, which I assume is because
some customers only have one entry with NaN or only NaNs.
However, when I extracted a select few rows that should have worked and made a new dataframe, nothing happened.
Is there a way I can group by customerID and fill NaNs with nearest number within the group, and skip customers with only NaNs or just one observation?
I ran into the same "ValueError: x and y arrays must have at least 2 entries" in my code. Adapted to your code (which I obviously could not reproduce) here is how I solved the problem:
import pandas as pd
import numpy as np
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
It does the following:
The first "groupby + apply" interpolates each group with the method 'nearest' ONLY if the group has at least two non NaNs values.
np.isnan(group) returns an array containing True where group has NaNs and False where it has values.
np.count_nonzero(np.isnan(group)) returns the number of True in the previous array (i.e. the number of NaNs in the group).
If the number of NaNs is strictly smaller than the length of the group minus 1 (i.e. there are at least two non NaNs in the group), the group is interpolated using 'nearest', otherwise it is left untouched.
The second "groupby + apply" finishes to interpolate each group, using method='linear' and argument limit_direction='both'.
If a group was fully interpolated in the previous step: nothing
happens.
If a group had only one non NaN value (therefore was left
untouched in the previous step): The non NaN value will be used to
fill the entire group.
If a group had only NaNs (therefore was left untouched in the previous step): the group remains full of NaNs.
Here's a dummy example using your notations:
df=pd.DataFrame({'CustomerID':['a']*3+['b']*3+['c']*3,'num':[1,np.nan,2,np.nan,1,np.nan,np.nan,np.nan,np.nan]})
df
CustomerID num
0 a 1.0
1 a NaN
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b 1.0
4 b 1.0
5 b 1.0
6 c NaN
7 c NaN
8 c NaN
EDIT: important note
The interpolate method 'nearest' uses the numerical values of the index (see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html). It works well in my dummy example above because the index is clean. If the index of your dataframe is messy (e.g. after concatenating dataframes) you may want to do df.reset_index(inplace=True) before you interpolate.
im trying to compute the minimum for each row in a pandas dataframe.
I would like to add a column that calculates the minimum values and ignores "NaN" and "WD"
For example
A B C D
1 3 2 WD
3 WD NaN 2
should give me a new column like
Min
1
2
I tried df.where(df > 0).min(axis=1)
and df.where(df != "NaN").min(axis=1) without success
Convert values to numeric with non numeric to NaNs by errors='coerce' in to_numeric and DataFrame.apply so is possible use min:
df['Min'] = df.apply(pd.to_numeric, errors='coerce').min(axis=1)
print (df)
A B C D Min
0 1 3 2.0 WD 1.0
1 3 WD NaN 2 2.0
list=DataFrame.min(axis=1, skipna=[Nan,WD],numeric_only=true)
DataFrame['min']=list
I have to fill the nan values of a column in a dataframe with the mean of the previous 3 instances.
Here is the following example:
df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 NaN
5 NaN
6 NaN
7 7.0
And here is the output I need:
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0
5 4.3
6 4.4
7 7.0
I tried pd.rolling, but it does not work the way I want when the column has more than one NaN value in a roll:
df.fillna(df.rolling(3, min_periods=1).mean().shift())
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
Can someone help me with that? Thanks in advance!
Probably not the most efficient but terse and gets the job done
from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)
output
col1
0 1.000000
1 3.000000
2 4.000000
3 5.000000
4 4.000000
5 4.333333
6 4.444444
7 7.000000
we basically use fillna but require min_periods=3 meaning it will only fill a single NaN at a time, or rather those NaNs that have three non-NaN numbers immediately preceeding it. Then we use reduce to repeat this operation as many times as there are NaNs in col1
I tried two approaches to this problem. One is a loop over the dataframe, and the second is essentially trying the approach you suggest multiple times, to converge on the right answer.
Loop approach
For each row in the dataframe, get the value from col1. Then, take the average of the last rows. (There can be less than 3 in this list, if we're at the beginning of the dataframe.) If the value is NaN, replace it with the average value. Then, save the value back into the dataframe. If the list of values from the last rows has more than 3 values, then remove the last one.
def impute(df2, col_name):
last_3 = []
for index in df.index:
val = df2.loc[index, col_name]
if len(last_3) > 0:
imputed = np.nanmean(last_3)
else:
imputed = None
if np.isnan(val):
val = imputed
last_3.append(val)
df2.loc[index, col_name] = val
if len(last_3) > 3:
last_3.pop(0)
Repeated column operation
The core idea here is to notice that in your example of pd.rolling, the first NA replacement value is correct. So, you apply the rolling average, take the first NA value for each run of NA values, and use that number. If you apply this repeatedly, you fill in the first missing value, then the second missing value, then the third. You'll need to run this loop as many times as the longest series of consecutive NA values.
def impute(df2, col_name):
while df2[col_name].isna().any().any():
# If there are multiple NA values in a row, identify just
# the first one
first_na = df2[col_name].isna().diff() & df2[col_name].isna()
# Compute mean of previous 3 values
imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
# Replace NA values with mean if they are very first NA
# value in run of NA values
df2.loc[first_na, col_name] = imputed
Performance comparison
Running both of these on an 80000 row dataframe, I get the following results:
Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds
I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4
I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()