Background:
In pandas, if I use the following:
df.sum(axis=1)
It returns sum of each row.
In the same manner, I expect the following to drop any row that contains missing value:
df.dropna(how='any', axis=1)
But the above code line actually drops any column that contains missing values rather than dropping rows with missing values.
The Question: I understand why the first line returns sum of rows; but how come dropna(axis=1) drops columns?
=========
To clarify the question, I have provided the following example:
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), columns=list('ABC'))
A B C
0 NaN 9 4.0
1 8.0 8 1.0
2 5.0 3 6.0
3 3.0 3 NaN
df.sum(axis=1)
0 13.0
1 17.0
2 14.0
3 6.0
df.dropna(how='any', axis=1)
B
0 9
1 8
2 3
3 3
df.sum(axis=1) returns the sum of all columns, that is, the sum of values in each column, and therefore it returns a row. sum aggerates and therefore reduces. When the values of the columns are all reduced, they are the ones being summed, and not the rows.
df.sum(axis=0) returns the sum of all rows where the cells are reduced row-wise.
axis=1 reference the columns. df.dropna(how='any', axis=1) looks for NaN values and if a column contains one, it is dropped.
Related
I have a dataset that I want to groupby("CustomerID") and fill NaNs with the nearest number within the group.
I can fill by nearest number irregardless of group like this:
df['num'] = df['num'].interpolate(method="nearest")
When I tried:
df['num'] = df.groupby('CustomerID')['num'].transform(lambda x: x.interpolate(method="nearest"))
I got ValueError: x and y arrays must have at least 2 entries, which I assume is because
some customers only have one entry with NaN or only NaNs.
However, when I extracted a select few rows that should have worked and made a new dataframe, nothing happened.
Is there a way I can group by customerID and fill NaNs with nearest number within the group, and skip customers with only NaNs or just one observation?
I ran into the same "ValueError: x and y arrays must have at least 2 entries" in my code. Adapted to your code (which I obviously could not reproduce) here is how I solved the problem:
import pandas as pd
import numpy as np
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
It does the following:
The first "groupby + apply" interpolates each group with the method 'nearest' ONLY if the group has at least two non NaNs values.
np.isnan(group) returns an array containing True where group has NaNs and False where it has values.
np.count_nonzero(np.isnan(group)) returns the number of True in the previous array (i.e. the number of NaNs in the group).
If the number of NaNs is strictly smaller than the length of the group minus 1 (i.e. there are at least two non NaNs in the group), the group is interpolated using 'nearest', otherwise it is left untouched.
The second "groupby + apply" finishes to interpolate each group, using method='linear' and argument limit_direction='both'.
If a group was fully interpolated in the previous step: nothing
happens.
If a group had only one non NaN value (therefore was left
untouched in the previous step): The non NaN value will be used to
fill the entire group.
If a group had only NaNs (therefore was left untouched in the previous step): the group remains full of NaNs.
Here's a dummy example using your notations:
df=pd.DataFrame({'CustomerID':['a']*3+['b']*3+['c']*3,'num':[1,np.nan,2,np.nan,1,np.nan,np.nan,np.nan,np.nan]})
df
CustomerID num
0 a 1.0
1 a NaN
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b 1.0
4 b 1.0
5 b 1.0
6 c NaN
7 c NaN
8 c NaN
EDIT: important note
The interpolate method 'nearest' uses the numerical values of the index (see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html). It works well in my dummy example above because the index is clean. If the index of your dataframe is messy (e.g. after concatenating dataframes) you may want to do df.reset_index(inplace=True) before you interpolate.
I have to fill the nan values of a column in a dataframe with the mean of the previous 3 instances.
Here is the following example:
df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 NaN
5 NaN
6 NaN
7 7.0
And here is the output I need:
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0
5 4.3
6 4.4
7 7.0
I tried pd.rolling, but it does not work the way I want when the column has more than one NaN value in a roll:
df.fillna(df.rolling(3, min_periods=1).mean().shift())
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
Can someone help me with that? Thanks in advance!
Probably not the most efficient but terse and gets the job done
from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)
output
col1
0 1.000000
1 3.000000
2 4.000000
3 5.000000
4 4.000000
5 4.333333
6 4.444444
7 7.000000
we basically use fillna but require min_periods=3 meaning it will only fill a single NaN at a time, or rather those NaNs that have three non-NaN numbers immediately preceeding it. Then we use reduce to repeat this operation as many times as there are NaNs in col1
I tried two approaches to this problem. One is a loop over the dataframe, and the second is essentially trying the approach you suggest multiple times, to converge on the right answer.
Loop approach
For each row in the dataframe, get the value from col1. Then, take the average of the last rows. (There can be less than 3 in this list, if we're at the beginning of the dataframe.) If the value is NaN, replace it with the average value. Then, save the value back into the dataframe. If the list of values from the last rows has more than 3 values, then remove the last one.
def impute(df2, col_name):
last_3 = []
for index in df.index:
val = df2.loc[index, col_name]
if len(last_3) > 0:
imputed = np.nanmean(last_3)
else:
imputed = None
if np.isnan(val):
val = imputed
last_3.append(val)
df2.loc[index, col_name] = val
if len(last_3) > 3:
last_3.pop(0)
Repeated column operation
The core idea here is to notice that in your example of pd.rolling, the first NA replacement value is correct. So, you apply the rolling average, take the first NA value for each run of NA values, and use that number. If you apply this repeatedly, you fill in the first missing value, then the second missing value, then the third. You'll need to run this loop as many times as the longest series of consecutive NA values.
def impute(df2, col_name):
while df2[col_name].isna().any().any():
# If there are multiple NA values in a row, identify just
# the first one
first_na = df2[col_name].isna().diff() & df2[col_name].isna()
# Compute mean of previous 3 values
imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
# Replace NA values with mean if they are very first NA
# value in run of NA values
df2.loc[first_na, col_name] = imputed
Performance comparison
Running both of these on an 80000 row dataframe, I get the following results:
Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds
How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4
I have created a function that replaces the NaNs in a Pandas dataframe with the means of the respective columns. I tested the function with a small dataframe and it worked. When I applied it though to a much larger dataframe (30,000 rows, 9 columns) I got the error message: IndexError: index out of bounds
The function is the following:
# The 'update' function will replace all the NaNs in a dataframe with the mean of the respective columns
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean()[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
The small dataframe I used to test the function is the following:
0 1 2 3
0 NaN NaN 3 4
1 NaN NaN 7 8
2 9.0 10.0 11 12
Could you explain the error? Your advice will be appreciated.
I would use DataFrame.fillna() method in conjunction with DataFrame.mean() method:
In [130]: df.fillna(df.mean())
Out[130]:
0 1 2 3
0 9.0 10.0 3 4
1 9.0 10.0 7 8
2 9.0 10.0 11 12
Mean values:
In [138]: df.mean()
Out[138]:
0 9.0
1 10.0
2 7.0
3 8.0
dtype: float64
The reason you are getting "index out of bounds" is because you are assigning the value df.mean()[i] when i is one iteration of what are supposed to be ordinal positions. df.mean() is a Series whose indices are the columns of df. df.mean()[something] implies something better be a column name. But they aren't and that's why you get your error.
your code... fixed
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean().iloc[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
Also, your function is altering the df directly. You may want to be careful. I'm not sure that's what you intended.
All that said. I'd recommend another approach
def update(df):
return df.where(df.notnull(), df.mean(), axis=1)
You could use any number of methods to fill missing with the mean. I'd suggest using #MaxU's answer.
df.where
takes df when first arg is True otherwise second argument
df.where(df.notnull(), df.mean(), axis=1)
df.combine_first with awkward pandas broadcasting
df.combine_first(pd.DataFrame([df.mean()], df.index))
np.where
pd.DataFrame(
np.where(
df.notnull(), df.values,
np.nanmean(df.values, 0, keepdims=1)),
df.index, df.columns)