Here is my code:
df.head(20)
df = df.fillna(df.mean()).head(20)
Below is the result:
There are many NaN.
I want to replace NaN by average number, I used df.fillna(df.mean()), but useless.
What's the problem??
I have got it !! before replace the NaN number, I need to reset index at first.
Below is code:
df = df.reset_index()
df = df.fillna(df.mean())
now everything is okay!
This worked for me
for i in df.columns:
df.fillna(df[i].mean(),inplace=True)
Each column in your DataFrame has at least one non-numeric value (in the rows #0 and partially #1). When you apply .mean() to a DataFrame, it skips all non-numeric columns (in your case, all columns). Thus, the NaNs are not replaced. Solution: drop the non-numeric rows.
I think the problem may be your columns are not float or int type. check with df.dtypes. if return object then mean wont work. change the type using df.astype()
Related
I have a pandas DataFrame with data that looks like this:
With the data extending beyond what you can see here. I can't tell if the blue cells hold numeric or string data, but it should be numeric, since I transformed them to those values with multiplication. But I don't know pandas well enough to be sure.
Anyway, I call .max(axis=1) on this dataframe, and it gives me this:
As far as I know, there are no empty cells or cells with weird data. So why am I getting all nan?
First convert all values to numeric by DataFrame.astype:
df = df.astype(float)
If not working, use to_numeric with errors='coerce' for NaNs if not numeric values:
df = df.apply(pd.to_numeric, errors='coerce')
And then count max:
print (df.max(axis=1))
I have a pivot table, unfortunately I am unable to cast the column to an int value due to NaN values, and it represents a year in the data. Is there a way to use a function to manipulate the columns (lambda?) in the creation of the pivot table?
submissions_by_country = df_maa_lu.pivot_table(index=["COUNTRY_DISPLAY_LABEL"], columns=["APPROVAL_YEAR"], values='LU_NUMBER_NO_SUFFIX', aggfunc='nunique', fill_value=0)
#smackenzie,
Is it possible to replace the value and recast. For e.g: Assuming your dataframe is called df
df.replace(to_replace =np.nan, value =0.)
df.astype(float)
If retaining np.nan is important, you can replace with a unique value like -999. and then upon changing the datatype, replace it again..
If you just wanted to updated the value in a column i.e. pd.Series instead of the entire dataframe, you could try that like this, not sure if dictionaries would allow NaN to be the key though:
df['Afghanistan'] = w['Afghanistan'].map({np.NaN: 0.0})
Can you post a sample dataset to work with?
I have started with a data science course which requires me to handle missing data either by deleting the row containing NaN in "price" subset or replacing the NaN with some mean value. However both of my dropna() and replace() doesn't seem to work. What could be the problem?
I went through a lot of solutions on stackoverflow but my problem was not solved. I also tried going through pandas.pydata.org to look for solution where I learnt about different arguments for dropna() like thresh, how='any', etc but nothing helped.
import pandas as pd
import numpy as np
url="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df=pd.read_csv(url,header=None)
'''
Our data comes without any header or column name,hence we assign each column a header name.
'''
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engnie-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
'''
Now that we have to eliminate rows containing NaN or ? in "price" column in our data
'''
df.dropna(subset=["price"], axis=0, inplace=True)
df.head(12)
#or
df.dropna(subset=["price"], how='any')
df.head(12)
#also to replace
mean=df["price"].mean()
df["price"].replace(np.nan,mean)
df.head(12)
It was expected that all the rows containig NaN or "?" in the "price" column to be deleted for dropna() or replaced for replace(). However there seems to be no change in data.
Please use this code to drop ? value as following:
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df = df.dropna()
to_numeric method converts argument to a numeric type.
And, coerce sets invalids as NaN.
Then, dropna can clear records include NaN.
I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')
I have a dataframe with 15 columns. 5 of those columns use numbers but some of the entries are either blanks, or words. I want to convert those to zero.
I am able to convert the entries in one of the column to zero but when I try to do that for multiple columns, I am not able to do it. I tried this for one column:
pd.to_numeric(Tracker_sample['Product1'],errors='coerce').fillna(0)
and it works, but when I try this for multiple columns:
pd.to_numeric(Tracker_sample[['product1','product2','product3','product4','Total']],errors='coerce').fillna(0)
I get the error : arg must be a list, tuple, 1-d array, or Series
I think it is the way I am calling the columns to be fixed. I am new to pandas so any help would be appreciated. Thank you
You can use:
Tracker_sample[['product1','product2','product3','product4','Total']].apply(pd.to_numeric, errors='coerce').fillna(0)
With a for loop?
for col in ['product1','product2','product3','product4','Total']:
Tracker_sample[col] = pd.to_numeric(Tracker_sample[col],errors='coerce').fillna(0)