Using fillna to replace missing data - python

When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`

You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.

Related

Pandas dynamically replace nan values

I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0

How to freeze first numbers in sequences between NaNs in Python pandas dataframe

Is there a Pythonic way to, in a timeseries dataframe, by column, go down and pick the first number in a sequence, and then push it forward until the next NaN, and then take the next non-NaN number and push that one down until the next NaN, and so on (retaining the indices and NaNs).
For example, I would like to convert this dataframe:
DF = pd.DataFrame(data={'A':[np.nan,1,3,5,7,np.nan,2,4,6,np.nan], 'B':[8,6,4,np.nan,np.nan,9,7,3,np.nan,3], 'C':[np.nan,np.nan,4,2,6,np.nan,1,5,2,8]})
A B C
0 NaN 8.0 NaN
1 1.0 6.0 NaN
2 3.0 4.0 4.0
3 5.0 NaN 2.0
4 7.0 NaN 6.0
5 NaN 9.0 NaN
6 2.0 7.0 1.0
7 4.0 3.0 5.0
8 6.0 NaN 2.0
9 NaN 3.0 8.0
To this dataframe:
Result = pd.DataFrame(data={'A':[np.nan,1,1,1,1,np.nan,2,2,2,np.nan], 'B':[8,8,8,np.nan,np.nan,9,9,9,np.nan,3], 'C':[np.nan,np.nan,4,4,4,np.nan,1,1,1,1]})
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0
I know I can use a loop to iterate down the columns to do this, but would appreciate some help on how to do it in a more efficient Pythonic way on a very large dataframe. Thank you.
IIUC:
# where DF is not NaN
mask = DF.notna()
Result = (DF.shift(-1) # fill the original NaN's with their next value
.mask(mask) # replace all the original non-NaN with NaN
.ffill() # forward fill
.fillna(DF.iloc[0]) # starting of the the columns with a non-NaN
.where(mask) # replace the original NaN's back
)
Output:
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0

Create a column that has the same length of the longest column in the data at the same time

I have the following data:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
dataFrame = pandas.DataFrame(data).transpose()
Output:
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 NaN 4.0 4.0
4 NaN 5.0 5.0
5 NaN NaN 6.0
6 NaN NaN 7.0
Is it possible to create a 4th column AT THE SAME TIME the others columns are created in data, which has the same length as the longest column of this dataframe (3rd one)?
The data of this column doesn't matter. Assume it's 8. So this is the desired output can be:
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
In my script the dataframe keeps changing every time. This means the longest columns keeps changing with it.
Thanks for reading
This is quite similar to answers from #jpp, #Cleb, and maybe some other answers here, just slightly simpler:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]] + [[]]
This will automatically give you a column of NaNs that is the same length as the longest columnn, so you don't need the extra work of calculating the length of the longest column. Resulting dataframe:
0 1 2 3
0 1.0 1.0 1.0 NaN
1 2.0 2.0 2.0 NaN
2 3.0 3.0 3.0 NaN
3 NaN 4.0 4.0 NaN
4 NaN 5.0 5.0 NaN
5 NaN NaN 6.0 NaN
6 NaN NaN 7.0 NaN
Note that this answer is less general than some others here (such as by #jpp & #Cleb) in that it will only fill with NaNs. If you want some default fill values other than NaN, you should use one of their answers.
You can append to a list which then immediately feeds the pd.DataFrame constructor:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
df = pd.DataFrame(data + [[8]*max(map(len, data))]).transpose()
print(df)
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
But this is inefficient. Pandas uses NumPy to hold underlying series and setting a series to a constant value is trivial and efficient; you can simply use:
df[3] = 8
It is not entirely clear what you mean by at the same time, but the following would work:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
# get the longest list in data
data.append([8] * max(map(len, data)))
pd.DataFrame(data).transpose()
yielding
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If you'd like to do it as you create the DataFrame, simply chain a call to assign:
pd.DataFrame(data).T.assign(**{'3': 8})
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
You can do a def (read comments):
def f(df):
l=[8]*df[max(df,key=lambda x:df[x].count())].count()
df[3]=l+[np.nan]*(len(df)-len(l))
# the above two lines can be just `df[3] = another solution currently for this problem`
return df
dataFrame = f(pandas.DataFrame(data).transpose())
Then now:
print(dataFrame)
Returns:
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
If at you mean at the same time as running pd.DataFrame, the data has to be prepped before it is loaded to your frame.
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
longest = max(len(i) for i in data)
dummy = [8 for i in range(longest)] #dummy data filled with 8
data.append(dummy)
dataFrame = pd.DataFrame(data).transpose()
The example above gets the longest element in your list and creates a dummy to be added onto it before creating your dataframe.
One solution is to add an element to the list that is passed to the dataframe:
pd.DataFrame(data + [[np.hstack(data).max() + 1] * len(max(data))]).T
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If data is to be modified just:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
data = data + [[np.hstack(data).max() + 1] * len(max(data))]
pd.DataFrame(data).T

Pandas shift based on different values to calculate percentages

I am trying to calculate percentages of first down from a dataframe.
Here is the dataframe
down distance
1 1.0 10.0
2 2.0 13.0
3 3.0 15.0
4 3.0 20.0
5 4.0 1.0
6 1.0 10.0
7 2.0 9.0
8 3.0 3.0
9 1.0 10.0
I would like to calculate the percent from first down, meaning for second down, what is the percent of yards gained. For third down, perc of third based on first.
For example, I would like to have the following output.
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 (13-10)/13
3 3.0 15.0 (15-10)/15
4 3.0 20.0 (20-10)/20
5 4.0 1.0 (1-10)/20
6 1.0 10.0 NaN # New calculation
7 2.0 9.0 (9-10)/9
8 3.0 3.0 (3-10)/3
9 1.0 10.0 NaN
Thanks
Current solutions all work correctly for the first question.
Here's a vectorised solution:
# define condition
cond = df['down'] == 1
# calculate value to subtract
first = df['distance'].where(cond).ffill().mask(cond)
# perform calculation
df['percentage'] = (df['distance'] - first) / df['distance']
print(df)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
Using groupby and transform:
s = df.groupby(df.down.eq(1).cumsum()).distance.transform('first')
s = df.distance.sub(s).div(df.distance)
df['percentage'] = s.mask(s.eq(0))
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
With Numpy Bits
Should be pretty zippy!
m = df.down.values == 1 # mask where equal to 1
i = np.flatnonzero(m) # positions where equal to 1
d = df.distance.values # Numpy array of distances
j = np.diff(np.append(i, len(df))) # use diff to find distances between
# values equal to 1. Note that I append
# the length of the df as a terminal value
k = i.repeat(j) # I repeat the positions where equal to 1
# a number of times in order to fill in.
p = np.where(m, np.nan, 1 - d[k] / d) # reduction of % formula while masking
df.assign(percentage=p)
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN
use groupby to group by each time down is equal to 1, than transform with your desired calculation. Then you can find where down is 1 again, and convert to NaN (as the calculation is meaningless there, as per your example):
df['percentage'] = (df.groupby(df.down.eq(1).cumsum())['distance']
.transform(lambda x: (x-x.iloc[0])/x))
df.loc[df.down.eq(1),'percentage'] = np.nan
>>> df
down distance percentage
1 1.0 10.0 NaN
2 2.0 13.0 0.230769
3 3.0 15.0 0.333333
4 3.0 20.0 0.500000
5 4.0 1.0 -9.000000
6 1.0 10.0 NaN
7 2.0 9.0 -0.111111
8 3.0 3.0 -2.333333
9 1.0 10.0 NaN

Pandas fillna() not working as expected

I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.

Categories