Pandas: Calculating column-wise mean yields nulls - python

I have a pandas DataFrame, df, and I'd like to get the mean for columns 180 through the end (not including the last column), only using the first 100K rows.
If I use the whole DataFrame:
df.mean().isnull().any()
I get False
If I use only the first 100K rows:
train_means = df.iloc[:100000, 180:-1].mean()
train_means.isnull().any()
I get: True
I'm not sure how this is possible, since the second approach is only getting the column means for a subset of the full DataFrame. So if no column in the full DataFrame has a mean of NaN, I don't see how a column in a subset of the full DataFrame can.
For what it's worth, I ran:
df.columns[df.isna().all()].tolist()
and I get: []. So I don't think I have any columns where every entry is NaN (which would cause a NaN in my train_means calculation).
Any idea what I'm doing incorrectly?
Thanks!

Try look at
(df.iloc[:100000, 180:-1].isnull().sum()==100000).any()
If this return True , which mean you have a columns' value is all NaN in the first 100000 rows
And Now let us explain why you get all notnull when do the mean to the whole dataframe , since mean have skipna default as True so it will drop NaN before mean

Related

How do I drop all rows in a DataFrame that have NAN in that row, in a specified column?

EDIT: (User error, I wasn't scanning entire dataframe. Delete Question if needed )A page I found had a solution that claimed to drop all rows with NAN in a selected column. In this case I am interested in the column with index 78 (int, not string, I checked).
The code fragment they provided turns out to look like this for me:
df4=df_transposed.dropna(subset=[78])
That did exactly the opposite of what I wanted. df4 is a dataframe that has NAN in all elements of the dataframe. I'm not sure how to
I tried the dropna() method as suggested on half a dozen pages and I expected a dataframe with no NAN values in the column with index 78. Instead every element was NAN in the dataframe.
df_transposed.dropna(subset=[78], in place=True) #returns dataframe with rows that have missing values in column 78 removed.

fillna() only fills the 1st value of the dataframe

I'm facing a strange issue in which I'm trying to replace all NaN values in a dataframe with values taken from another one (same length) that has the relevant values.
Here's a glimpse for the "target dataframe" in which I want to replace the values:
data_with_null
Here's the dataframe where I want to take data from: predicted_paticipant_groups
I've tried:
data_with_null.participant_groups.fillna(predicted_paticipant_groups.participant_groups, inplace=True)
but it just fills all values NaN values with the 1st one (Infra)
Is it because of the indexes of data_with_null are all zeros?
Reset the index and try again.
data_with_null.reset_index(drop=True, inplace=True)

How to insert a data point at a time to a pandas dataframe?

It might be non-pythonic (if yes, let me know also) but I am running a function that produces only one data point at a time, and I would like to add those points to my dataframe. The reason for this is that for each row, I have 252 rows (in another dataframe) that I will input in a function, and will return me a single number.
I am using this method:
data.loc[row, 'ColumnA'] = some integer
but it appends the rows/values at the end of the dataframe, when I want to create a new column and populate it a data point at a time. So, for example, if I have this column in a dataframe:
Column A
NaN
NaN
NaN
and I run this:
data.loc[0, 'ColumnA'] = 10
I would like to see:
Column A
10
NaN
NaN
Thank you!
Have a look at the .at function on a dataframe
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html
It can be used to look at and change the value using a row/column pair
df.at[row, column] = value
So your code would look like
data.at[row, 'columnA'] = 10

Getting nan when making a dataframe column equal to another

I am trying to make a subset of a dataframe
combo.iloc[:,orig_start_col:orig_start_col+2]
equal to the values another subset already has
combo.iloc[:,sm_col:sm_col+2]
where the columns will vary in a loop. The problem is that all I am getting is NaNs despite that the second subset values are not NaN
I tried to do this for the first column and it worked, however doing so with just the second column of the subset returns all NaNs. Then doing for the whole subset returns Nan values for everything
My code is:
for node_col in ('leg2_node', 'leg4_node'):
combo=orig_combos.merge(all, how='inner', left_on='leg6_node', right_on=node_col)
combo.reset_index(drop=True, inplace=True)
orig_start_col=combo.columns.get_loc('leg6_alpha_x')
sm_col=combo.columns.get_loc(node_col+'_y')
combo.iloc[:,orig_start_col+1:orig_start_col+2]=combo.
iloc[:,sm_col+1:sm_col+2]
What I would expect having sm_col:sm_col+2 subset all rows with values is to have those values in orig_start_col:orig_start_col+2 subset, but instead what I am having is all values=NaN

How to deal with missing values in Pandas DataFrame?

I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')

Categories