I have a pandas DataFrame with data that looks like this:
With the data extending beyond what you can see here. I can't tell if the blue cells hold numeric or string data, but it should be numeric, since I transformed them to those values with multiplication. But I don't know pandas well enough to be sure.
Anyway, I call .max(axis=1) on this dataframe, and it gives me this:
As far as I know, there are no empty cells or cells with weird data. So why am I getting all nan?
First convert all values to numeric by DataFrame.astype:
df = df.astype(float)
If not working, use to_numeric with errors='coerce' for NaNs if not numeric values:
df = df.apply(pd.to_numeric, errors='coerce')
And then count max:
print (df.max(axis=1))
Related
This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?
Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work
You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()
You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN
I have dataframe that looks like this:
And once I run following code: DF= DF.groupby('CIF').mean() (and fill NaN with zeros)
I get following dataframe:
Why are two columns 'CYCLE' and 'BALANCE.GEL' disappearing?
Because there are mixed missing values, numeric and strings repr of numbers, so columns are removed.
So try convert all columns without CIF to numbers and because CIF column is converted to index is possible aggregate by mean per index:
DF= DF.set_index('CIF').astype(float).mean(level=0)
If first solution failed then use to_numeric with errors='coerce' for convert non numbers to NaNs:
DF= DF.set_index('CIF').apply(pd.to_numeric, errors='coerce').mean(level=0)
I have started with a data science course which requires me to handle missing data either by deleting the row containing NaN in "price" subset or replacing the NaN with some mean value. However both of my dropna() and replace() doesn't seem to work. What could be the problem?
I went through a lot of solutions on stackoverflow but my problem was not solved. I also tried going through pandas.pydata.org to look for solution where I learnt about different arguments for dropna() like thresh, how='any', etc but nothing helped.
import pandas as pd
import numpy as np
url="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df=pd.read_csv(url,header=None)
'''
Our data comes without any header or column name,hence we assign each column a header name.
'''
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engnie-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
'''
Now that we have to eliminate rows containing NaN or ? in "price" column in our data
'''
df.dropna(subset=["price"], axis=0, inplace=True)
df.head(12)
#or
df.dropna(subset=["price"], how='any')
df.head(12)
#also to replace
mean=df["price"].mean()
df["price"].replace(np.nan,mean)
df.head(12)
It was expected that all the rows containig NaN or "?" in the "price" column to be deleted for dropna() or replaced for replace(). However there seems to be no change in data.
Please use this code to drop ? value as following:
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df = df.dropna()
to_numeric method converts argument to a numeric type.
And, coerce sets invalids as NaN.
Then, dropna can clear records include NaN.
I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')
Here is my code:
df.head(20)
df = df.fillna(df.mean()).head(20)
Below is the result:
There are many NaN.
I want to replace NaN by average number, I used df.fillna(df.mean()), but useless.
What's the problem??
I have got it !! before replace the NaN number, I need to reset index at first.
Below is code:
df = df.reset_index()
df = df.fillna(df.mean())
now everything is okay!
This worked for me
for i in df.columns:
df.fillna(df[i].mean(),inplace=True)
Each column in your DataFrame has at least one non-numeric value (in the rows #0 and partially #1). When you apply .mean() to a DataFrame, it skips all non-numeric columns (in your case, all columns). Thus, the NaNs are not replaced. Solution: drop the non-numeric rows.
I think the problem may be your columns are not float or int type. check with df.dtypes. if return object then mean wont work. change the type using df.astype()