I'm facing a strange issue in which I'm trying to replace all NaN values in a dataframe with values taken from another one (same length) that has the relevant values.
Here's a glimpse for the "target dataframe" in which I want to replace the values:
data_with_null
Here's the dataframe where I want to take data from: predicted_paticipant_groups
I've tried:
data_with_null.participant_groups.fillna(predicted_paticipant_groups.participant_groups, inplace=True)
but it just fills all values NaN values with the 1st one (Infra)
Is it because of the indexes of data_with_null are all zeros?
Reset the index and try again.
data_with_null.reset_index(drop=True, inplace=True)
Related
I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)
I want to change the Nan values in a specific column in a list of DataFrame. I have applied methods (below). I am not unable to change the nan to zero. Is there any way to replace the values to zero
Data is the list of DataFrame and qobs is the specific column in each DataFrame
for value in data:
value['qobs']= value['qobs'].replace(np.nan,0)
for value in data:
value['qobs']= value['qobs'].fillna(0)
You can change column like this:
data['qobs'] = data['qobs'].fillna(0)
print(data)
I am trying to make a subset of a dataframe
combo.iloc[:,orig_start_col:orig_start_col+2]
equal to the values another subset already has
combo.iloc[:,sm_col:sm_col+2]
where the columns will vary in a loop. The problem is that all I am getting is NaNs despite that the second subset values are not NaN
I tried to do this for the first column and it worked, however doing so with just the second column of the subset returns all NaNs. Then doing for the whole subset returns Nan values for everything
My code is:
for node_col in ('leg2_node', 'leg4_node'):
combo=orig_combos.merge(all, how='inner', left_on='leg6_node', right_on=node_col)
combo.reset_index(drop=True, inplace=True)
orig_start_col=combo.columns.get_loc('leg6_alpha_x')
sm_col=combo.columns.get_loc(node_col+'_y')
combo.iloc[:,orig_start_col+1:orig_start_col+2]=combo.
iloc[:,sm_col+1:sm_col+2]
What I would expect having sm_col:sm_col+2 subset all rows with values is to have those values in orig_start_col:orig_start_col+2 subset, but instead what I am having is all values=NaN
I have a pandas DataFrame, df, and I'd like to get the mean for columns 180 through the end (not including the last column), only using the first 100K rows.
If I use the whole DataFrame:
df.mean().isnull().any()
I get False
If I use only the first 100K rows:
train_means = df.iloc[:100000, 180:-1].mean()
train_means.isnull().any()
I get: True
I'm not sure how this is possible, since the second approach is only getting the column means for a subset of the full DataFrame. So if no column in the full DataFrame has a mean of NaN, I don't see how a column in a subset of the full DataFrame can.
For what it's worth, I ran:
df.columns[df.isna().all()].tolist()
and I get: []. So I don't think I have any columns where every entry is NaN (which would cause a NaN in my train_means calculation).
Any idea what I'm doing incorrectly?
Thanks!
Try look at
(df.iloc[:100000, 180:-1].isnull().sum()==100000).any()
If this return True , which mean you have a columns' value is all NaN in the first 100000 rows
And Now let us explain why you get all notnull when do the mean to the whole dataframe , since mean have skipna default as True so it will drop NaN before mean
I have this dataframe:
I want to create a ZIP column which will get the value of ZIP_y when ZIP_x is NaN and the value of ZIP_x when ZIP_x is not NaN.
I tried this code:
dm["ZIP"]=numpy.where(dm["ZIP_x"] is numpy.nan, dm["ZIP_y"],dm["ZIP_x"])
But that gave me this output:
As you can see, the ZIP column seems to be getting the values of ZIP_x in each of its cells.
Do you know how to achieve what I am after?
You want this:
dm["ZIP"]=numpy.where(dm["ZIP_x"].isnull(), dm["ZIP_y"],dm["ZIP_x"])
You can't use is or == for that matter to compare NaNs