import pandas as pd
df1 = pd.DataFrame({
'value1': ["a","a","a","b","b","b","c","c"],
'value2': [1,2,3,4,4,4,5,5],
'value3': [1,2,3, None , None, None, None, None],
'value4': [1,2,3,None , None, None, None, None],
'value5': [1,2,3,None , None, None, None, None]})
df2 = pd.DataFrame({
'value1': ["k","j","l","m","x","y"],
'value2': [2, 2, 1, 3, 4, 5],
'value3': [2, 2, 2, 3, 4, 5],
'value4': [3, 2, 2, 3, 4, 5],
'value5': [2, 1, 2, 3, 4, 5]})
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
df2 =
value1 value2 value3 value4 value5
0 k 2 2 3 2
1 j 2 2 2 1
2 l 1 2 2 2
3 m 3 3 3 3
4 x 4 4 4 4
5 y 5 5 5 5
I would like to fill NaN in df1 from values in df2
So the results of df1 will look like
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2 2 1
4 b 4 2 2 2
5 b 4 3 3 3
6 c 5 4 4 4
7 c 5 5 5 5
I used following codes.
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
tmp2 = df2.iloc[1:, 2:]
tmp1 = tmp2 can update values in tmp1, but when I use following
df1[df1.value1 == 'b'].iloc[:, 2:]= tmp2
It doesn't update the values in df1 as shown below.
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
Why it happens and how can I solve this issue?
Thank you.
This line doesn't do what you think it's doing:
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
Methods are applied sequentially, so df1[df1.value1 == 'b'] keeps only rows 3, 4, 5 from df1. But this isn't what you want, you want to update all rows starting from the first instance your condition is satisfied.
Instead, first find the required index.
idx = df1['value1'].eq('b').values.argmax()
You then need to explicitly assign the last n rows from df2:
df1.iloc[idx:, 2:] = df2.iloc[-(len(df1.index)-idx):, 2:].values
print(df1)
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2.0 2.0 1.0
4 b 4 2.0 2.0 2.0
5 b 4 3.0 3.0 3.0
6 c 5 4.0 4.0 4.0
7 c 5 5.0 5.0 5.0
If you want to replace the nan values using index alignment use pandas fillna
df1.fillna(df2)
Add inplace if you want to update df1
df1.fillna(df2, inplace=True)
-
edit for case without aligned indexes:
If indexes of target and replacement values are not aligned, they can be aligned so that the dataframe fillna method can be used.
To align the indexes, get the indexes of rows containing nans in df1 to be replaced, filter df2 to include replacement values and then assign the replacement indexes from df1 as the index of df2. Then use fillna to transfer values from df2 to df1.
# in this case, find index values when df1.value1 is greater than or equal to 'b'
# (alternately could be indexes of rows containing nans)
idx = df1.index[df1.value1 >= 'b']
# get the section of df2 which will provide replacement values
# limit length to length of idx
align_df = df2[1:len(idx) + 1]
# set the index to match the nan rows from df1
align_df.index = idx
# use auto-alignment with fillna to transfer values from align_df(df2) to df1
df1.fillna(align_df)
# or can use df1.combine_first(align_df) because of the matching target and replacement indexes
Related
Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3
Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3
Suppose I have a DataFrame with some NaNs:
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
The result should be like this which is just +10 of the previous NaN value of the column.
0 1 2 3
1 4 12 13
2 14 12 9
Is there any way to do this using any methods?or do I have to iterate each column?
You can use ffill() to fill the NaNs with the previous non-NaN value, and then a simple mask to increment all by 10:
result = df.ffill()
result[df.isna()] += 10
Output
0 1 2
0 1.0 2.0 3.0
1 4.0 12.0 13.0
2 14.0 12.0 9.0
I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3