I have created a function that replaces the NaNs in a Pandas dataframe with the means of the respective columns. I tested the function with a small dataframe and it worked. When I applied it though to a much larger dataframe (30,000 rows, 9 columns) I got the error message: IndexError: index out of bounds
The function is the following:
# The 'update' function will replace all the NaNs in a dataframe with the mean of the respective columns
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean()[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
The small dataframe I used to test the function is the following:
0 1 2 3
0 NaN NaN 3 4
1 NaN NaN 7 8
2 9.0 10.0 11 12
Could you explain the error? Your advice will be appreciated.
I would use DataFrame.fillna() method in conjunction with DataFrame.mean() method:
In [130]: df.fillna(df.mean())
Out[130]:
0 1 2 3
0 9.0 10.0 3 4
1 9.0 10.0 7 8
2 9.0 10.0 11 12
Mean values:
In [138]: df.mean()
Out[138]:
0 9.0
1 10.0
2 7.0
3 8.0
dtype: float64
The reason you are getting "index out of bounds" is because you are assigning the value df.mean()[i] when i is one iteration of what are supposed to be ordinal positions. df.mean() is a Series whose indices are the columns of df. df.mean()[something] implies something better be a column name. But they aren't and that's why you get your error.
your code... fixed
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean().iloc[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
Also, your function is altering the df directly. You may want to be careful. I'm not sure that's what you intended.
All that said. I'd recommend another approach
def update(df):
return df.where(df.notnull(), df.mean(), axis=1)
You could use any number of methods to fill missing with the mean. I'd suggest using #MaxU's answer.
df.where
takes df when first arg is True otherwise second argument
df.where(df.notnull(), df.mean(), axis=1)
df.combine_first with awkward pandas broadcasting
df.combine_first(pd.DataFrame([df.mean()], df.index))
np.where
pd.DataFrame(
np.where(
df.notnull(), df.values,
np.nanmean(df.values, 0, keepdims=1)),
df.index, df.columns)
Related
Background:
In pandas, if I use the following:
df.sum(axis=1)
It returns sum of each row.
In the same manner, I expect the following to drop any row that contains missing value:
df.dropna(how='any', axis=1)
But the above code line actually drops any column that contains missing values rather than dropping rows with missing values.
The Question: I understand why the first line returns sum of rows; but how come dropna(axis=1) drops columns?
=========
To clarify the question, I have provided the following example:
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), columns=list('ABC'))
A B C
0 NaN 9 4.0
1 8.0 8 1.0
2 5.0 3 6.0
3 3.0 3 NaN
df.sum(axis=1)
0 13.0
1 17.0
2 14.0
3 6.0
df.dropna(how='any', axis=1)
B
0 9
1 8
2 3
3 3
df.sum(axis=1) returns the sum of all columns, that is, the sum of values in each column, and therefore it returns a row. sum aggerates and therefore reduces. When the values of the columns are all reduced, they are the ones being summed, and not the rows.
df.sum(axis=0) returns the sum of all rows where the cells are reduced row-wise.
axis=1 reference the columns. df.dropna(how='any', axis=1) looks for NaN values and if a column contains one, it is dropped.
When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.
I have a DataFrame that looks like this:
When I try to add a list of values (of arbitrary length) to one of the columns I get an error:
mydf['a','curr(A)'] = [6,6,6,6,6]
or
mydf['a','curr(A)'] = [6,6]
gives the following error:
"ValueError: Length of values does not match length of index"
But this works:
mydf['a','curr(A)'] = [6,6,6]
How can I add an arbitrary number of entries to a column and pad the DataFrame with NaN's when necessary? Is there a parameter I can set when defining the DataFrame to do this padding automatically?
Thanks for your help.
I think the best way to do this would be something with concat
df2 = pd.DataFrame({
0:[1,2,3],
1:[1,2,3],
2:[4,5,6]
})
row = pd.Series([6,6,6,6])
pd.concat([df2,row], axis=0, ignore_index=True)
Results:
0 1 2
0 1 1.0 4.0
1 2 2.0 5.0
2 3 3.0 6.0
3 6 NaN NaN
4 6 NaN NaN
5 6 NaN NaN
6 6 NaN NaN
I don't think you are able to do this by just assigning the values to a column
Turn the sequence into another df (with the same column names) and then use .combine_first().
df_val = pd.DataFrame({('a', 'curr(a)'): [6, 6, 6]})
df_final = mydf.combine_first(df_val)
I found a workaround to solve my specific problem but it only works because I have all the columns I want in the dataframe ahead of time.
# 2 pairs of lists I want to use as column data.
mydf = pd.DataFrame([[1,2],[3,4],[5,6,7,8,9],[-3,4,-5,6,12]])
mydf = mydf.transpose() # Transpose to go from 4 rows to 4 columns.
# Create multilevel index with 4 indices
multi_idx = multi_idx = pd.MultiIndex.from_product([['a','b'],['curr(A)','volt(V)']])
for col in mydf.columns: # loop through to rename each column
mydf = mydf.rename(columns = {col : multi_idx[col]})
It works, but it seems like there must be a simpler way to do this.
Thanks for your help everyone!
I am trying to multiply dataframe 1 column a by dataframe 2 column b.
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'])
pnlValue column has many numbers and fx_rate column is just the one number.
The code executes but my end result ends up with tons of NaN .
Any help would be appreciated.
It is probably due to the index of your dataframe. You need to use df_fxrate['fx_rate'].values:
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'].values)
or better:
combineQueryandBookFiltered['pnlValue']=combineQueryandBookFiltered['pnlValue']*df_fxrate['fx_rate'].values
I show you an example:
df1=pd.DataFrame(index=[1, 2])
df2=pd.DataFrame(index=[0])
df1['col1']=[1,1]
print(df1)
col1
1 1
2 1
df2['col1']=[1]
print(df2)
col1
0 1
print(np.multiply(df1['col1'],df2['col1']))
0 NaN
1 NaN
2 NaN
as you can see the multiplication is done according to the index
So you need something like this:
np.multiply(df1['col1'],df2['col1'].values)
or
df1['col1']*df2['col1'].values
Output:
1 1
2 1
Name: 1, dtype: int64
as you can see now only the df1['col1'] series index is used
-- Hi excelguy,
Is there a reason why you can't use the simple column multiplication?
df['C'] = df['A'] * df['B']
As was pointed out, multiplications of two series are based on their indices and it's likely that your fx_rate series does not have the same indices as the pnlValue series.
But since your fx_rate is only one value, I suggest multiplying your dataframe with a scalar instead:
fx_rate = df_fxrate['fx_rate'].iloc[0]
combineQueryandBookFiltered['pnlValue'] = combineQueryandBookFiltered['pnlValue'] * fx_rate
I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4