I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')
Related
I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)
I am working on a calculator to determine what to feed your fish as a fun project to learn python, pandas, and numpy.
My data is organized like this:
As you can see, my fishes are rows, and the different foods are columns.
What I am hoping to do, is have the user (me) input a food, and have the program output to me all those values which are not nan.
The reason why I would prefer to leave them as nan rather than 0, is that I use different numbers in different spots to indicate preference. 1 is natural diet, 2 is ok but not ideal, 3 is live only.
Is there anyway to do this using pandas? Everywhere I look online helps me filter rows out of columns, but it is quite difficult to find info on filter columns out of rows.
Currently, my code looks like this:
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\Daniel\OneDrive\Documents\AquariumAiMVP.xlsx')
clownfish = df[0:1]
angelfish = df[1:2]
damselfish = df[2:3]
So, as you can see, I haven't really gotten anywhere yet. I tried filtering out the nulls using, the following idea:
clownfish_wild_diet = pd.isnull(df.clownfish)
But it results in an error, saying:
AttributeError: 'DataFrame' object has no attribute 'clownfish'
Thanks for the help guys. I'm a total pandas noob so it is much appreciated.
You can use masks in pandas:
food = 'Amphipods'
mask = df[food].notnull()
result_set = df[mask]
df[food].notnull() returns a mask (a Series of boolean values indicating if the condition is met for each row), and you can use that mask to filter the real DF using df[mask].
Usually you can combine these two rows to have a more pythonic code, but that's up to you:
result_set = df[df[food].notnull()]
This returns a new DF with the subset of rows that meet the condition (including all columns from the original DF), so you can use other operations on this new DF (e.g selecting a subset of columns, drop other missing values, etc)
See more about .notnull(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html
I have a dataframe with 100+ columns where all columns after col10 are of type float. What I would like to do is find the average of certain range of columns within loop. Here is what I tried so far,
for index,row in df.iterrows():
a = row.iloc[col30:col35].mean(axis=0)
This unfortunately returns unexpected values and I'm not able to get the average of col30,col31,col32,col33,col34,col35 for every row.Can someone please help.
try:
df.iloc[:, 30:35].mean(axis=1)
You may need to adjust 30:35 to 29:35 (you can remove the .mean and play around to get an idea of how the .iloc works). Generally in pandas you want to avoid loops as much as possible. The .iloc method allows you to select the index and columns based on their positional index. Then you can use the .mean() with axis=1 to sum across the 1st axis (Rows).
You really should be putting a small example where I reproduce the example, please see this below where the mentioned solution in comments works.
import pandas as pd
df = pd.DataFrame({i:val for i,val in enumerate(range(100))}, index=list(range(100)))
for i,row in df.iterrows():
a = row.iloc[29:25].mean() # a should be 31.5 for each row
print(a)
I have started with a data science course which requires me to handle missing data either by deleting the row containing NaN in "price" subset or replacing the NaN with some mean value. However both of my dropna() and replace() doesn't seem to work. What could be the problem?
I went through a lot of solutions on stackoverflow but my problem was not solved. I also tried going through pandas.pydata.org to look for solution where I learnt about different arguments for dropna() like thresh, how='any', etc but nothing helped.
import pandas as pd
import numpy as np
url="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df=pd.read_csv(url,header=None)
'''
Our data comes without any header or column name,hence we assign each column a header name.
'''
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engnie-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
'''
Now that we have to eliminate rows containing NaN or ? in "price" column in our data
'''
df.dropna(subset=["price"], axis=0, inplace=True)
df.head(12)
#or
df.dropna(subset=["price"], how='any')
df.head(12)
#also to replace
mean=df["price"].mean()
df["price"].replace(np.nan,mean)
df.head(12)
It was expected that all the rows containig NaN or "?" in the "price" column to be deleted for dropna() or replaced for replace(). However there seems to be no change in data.
Please use this code to drop ? value as following:
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df = df.dropna()
to_numeric method converts argument to a numeric type.
And, coerce sets invalids as NaN.
Then, dropna can clear records include NaN.
I have a Pandas DataFrame with quite a few missing values that are being represented by np.nan. I would like to be able to return the rows in the DataFrame having more than 80% of their values missing.
So far I have tried the following:
data.loc[lambda x: (len(x.isna()) / len(x.columns)) > .8]
but this is apparently not how loc works when passed a lambda function. My interpretation of this was that Pandas was simply running a loop over each row and applying the function, expecting a True or False value in return indicating to keep or discard the row, respectively. Essentially a filter function.
Is there a Pandas way of achieving what I want or shall I resort to plain python?
Using dropna with thresh (thresh : Require that many non-NA values.)
df.dropna(thresh=len(df.columns)*0.8)
Update :
df[(df.isna().sum(1)/df.shape[1]).gt(0.8)]