Suppose I have the following df
import pandas as pd
import numpy as np
test = pd.DataFrame(data = {
'a': [1,np.nan,np.nan,4,np.nan,5,6,7,8,np.nan,np.nan,6],
'b': [10,np.nan,np.nan,1,np.nan,1,1,np.nan,1,1,np.nan,1]
})
I would like to use pd.fillna(method='ffill') but only when two separate condition are both met:
Both elements of a row are NaN
The element of the previous row of column 'b' is 10
Note: the first row can never be NaN
I am looking for a smart way - maybe a lambda expression or a vectorized form, avoiding for loop or .iterrows()
Required result:
result= pd.DataFrame(data = {
'a': [1,1,1,4,np.nan,5,6,7,8,np.nan,np.nan,6],
'b': [10,10,10,1,np.nan,1,1,np.nan,1,1,np.nan,1]
})
You can test if forward filled value in b is 10 and also if all columns are filled with missing values and pass to DataFrame.mask with ffill:
mask = test['b'].ffill().eq(10) & test.isna().all(axis=1)
test = test.mask(mask, test.ffill())
print (test)
a b
0 1.0 10.0
1 1.0 10.0
2 1.0 10.0
3 4.0 1.0
4 NaN NaN
5 5.0 1.0
6 6.0 1.0
7 7.0 NaN
8 8.0 1.0
9 NaN 1.0
10 NaN NaN
11 6.0 1.0
Related
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
I want to set the first N rows of a pandas dataframe that satisfy a condition. Eg for:
data = {'id':[1,1,2,2,2,4,2,2], 'val':[9,9,9,9,9,9,9,9]}
df = pd.DataFrame(data)
If I do:
df.loc[df.id == 2] = None
I set all rows with id == 2, but I only want to set N of them.
I've tried
df.loc[df.id == 2][range(N)] = None
but to no avail.
Is there any way to do this, without using loops?
IIUC, use pandas.DataFrame.loc twice. Not so pretty, but works:
df.loc[df.loc[df["id"].eq(2)].index[:2]]=None
print(df)
Output:
id val
0 1.0 9.0
1 1.0 9.0
2 NaN NaN
3 NaN NaN
4 2.0 9.0
5 4.0 9.0
6 2.0 9.0
7 2.0 9.0
I'd like to replace nans in a dataframe with:
If nan is in between two columns with values, with the mean of both columns ('prev' and 'next')
Else, keep the same path of the series.
For instance:
In[1]:
df = pd.DataFrame([[1, 2,np.nan,np.nan], [np.nan, 4,6,8],[3,np.nan,6,np.nan]], columns=['A', 'B','C','D'])
Out[2]:
A B C D
0 1.0 2.0 NaN NaN
1 NaN 4.0 6.0 8.0
2 3.0 NaN 6.0 NaN
Desired output:
Out[2]:
A B C D
0 1.0 2.0 4.0 6.0
1 3.0 4.0 6.0 8.0
2 3.0 4.0 6.0 8.0
I have tried without much success:
for col in df.columns:
for i in range(len(df.columns)-1):
prev = df[df.columns[i-1]]
nextval = df[df.columns[i+1]]
df[col] = df[col].fillna((nextval+prev)/2)
You could use fillna() twice: one with method "bfill", one with method "ffill" and then average them.
I have a dataframe which contains nan values at few places. I am trying to perform data cleaning in which I fill the nan values with mean of it's previous five instances. To do so, I have come up with the following.
input_data_frame[var_list].fillna(input_data_frame[var_list].rolling(5).mean(), inplace=True)
But, this is not working. It isn't filling the nan values. There is no change in the dataframe's null count before and after the above operation. Assuming I have a dataframe with just integer column, How can I fill NaN values with mean of the previous five instances? Thanks in advance.
This should work:
input_data_frame[var_list]= input_data_frame[var_list].fillna(pd.rolling_mean(input_data_frame[var_list], 6, min_periods=1))
Note that the window is 6 because it includes the value of NaN itself (which is not counted in the average). Also the other NaN values are not used for the averages, so if less that 5 values are found in the window, the average is calculated on the actual values.
Example:
df = {'a': [1, 1,2,3,4,5, np.nan, 1, 1, 2, 3, 4, 5, np.nan] }
df = pd.DataFrame(data=df)
print df
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 NaN
Output:
a
0 1.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 3.0
7 1.0
8 1.0
9 2.0
10 3.0
11 4.0
12 5.0
13 3.0
rolling_mean function has been modified in pandas. If you fill the entire dataset, you can use;
filled_dataset = dataset.fillna(dataset.rolling(6,min_periods=1).mean())
you can simply use interpolate()
df = {'a': [1,5, np.nan, np.nan, np.nan, 2, 5, np.nan] }
df = pd.DataFrame(data=df)
print(df)
df['a'].interpolate()
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.