Pandas - Fill NaN using multiple values - python

I have a column ( lets call it Column X) containing around 16000 NaN values. The column has two possible values, 1 or 0 ( so like a binary )
I want to fill the NaN values in column X, but i don't want to use a single value for ALL the NaN entries.
say for instance that; i want to fill 50% of the NaN values with '1' and the other 50% with '0'.
I have read the ' fillna() ' documentation but i have not found any such relevant information which could satisfy this functionality.
I have literally no idea on how to move forward regarding this problem, so i haven't tried anything.
df['Column_x'] = df['Column_x'].fillna(df['Column_x'].mode()[0], inplace= True)
but this would fill ALL the NaN values in Column X of my dataframe 'df' with the mode of the column, i want to fill 50% with one value and other 50% with a different value.
Since i haven't tried anything yet, i can't show or describe any actual results.
what i can tell is that the expected result would be something along the lines of 8000 NaN values of column x replaced with '1' and another 8000 with '0' .
A visual result would be something like;
Before Handling NaN
Index Column_x
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
After Handling NaN
Index Column_x
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 0.0
13 0.0
14 0.0
15 0.0
16 1.0
17 1.0
18 1.0
19 1.0

You can use random.choices with its weights parameter to ensure the distribution stays the same. I've simulated a NaN column with numpy here, and get the exact length of the replacement needed. This approach can also be used for columns with more than two classes and more complex distributions.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'col1': range(16000)})
df['col2'] = np.nan
nans = df['col2'].isna()
length = sum(nans)
replacement = random.choices([0, 1], weights=[.5, .5], k=length)
df.loc[nans,'col2'] = replacement
print(df.describe())
'''
Out:
col1 col2
count 16000.000000 16000.000000
mean 7999.500000 0.507625
std 4618.946489 0.499957
min 0.000000 0.000000
25% 3999.750000 0.000000
50% 7999.500000 1.000000
75% 11999.250000 1.000000
max 15999.000000 1.000000
'''

Using pandas.Series.sample:
mask = df['Column_x'].isna()
ind = df['Column_x'].loc[mask].sample(frac=0.5).index
df.loc[ind, 'Column_x'] = 1
df['Column_x'] = df['Column_x'].fillna(0)
print(df)
Output:
Index Column_x
0 0 0.0
1 1 0.0
2 2 0.0
3 3 0.0
4 4 0.0
5 5 0.0
6 6 1.0
7 7 1.0
8 8 1.0
9 9 1.0
10 10 1.0
11 11 1.0
12 12 1.0
13 13 0.0
14 14 1.0
15 15 0.0
16 16 0.0
17 17 1.0
18 18 1.0
19 19 0.0

Use slicing columns and fill value
isnull() - function detect missing values in the given series object
Ex.
import pandas as pd
df = pd.DataFrame({'Column_y': pd.Series(range(9), index=['a', 'b', 'c','d','e','f','g','h','i']),
'Column_x': pd.Series(range(1), index=['a'])})
print(df)
# get list of index series which have NaN Column_x value
idx = df['Column_x'].index[df['Column_x'].isnull()]
total_nan_len = len(idx)
first_nan = total_nan_len//2
# fill first 50% of 1
df.loc[idx[0:first_nan], 'Column_x'] = 1
# fill last 50% of 0
df.loc[idx[first_nan:total_nan_len], 'Column_x'] = 0
print(df)
O/P:
Before Dataframe
Column_y Column_x
a 0 0.0
b 1 NaN
c 2 NaN
d 3 NaN
e 4 NaN
f 5 NaN
g 6 NaN
h 7 NaN
i 8 NaN
After Dataframe
Column_y Column_x
a 0 0.0
b 1 1.0
c 2 1.0
d 3 1.0
e 4 1.0
f 5 0.0
g 6 0.0
h 7 0.0
i 8 0.0

Related

Backfill column values using real value divided by number of preceding NA values in Pandas

test_df = pd.DataFrame({'a':[np.nan,np.nan,np.nan,4,np.nan,np.nan,6]})
test_df
a
0 NaN
1 NaN
2 NaN
3 4.0
4 NaN
5 NaN
6 6.0
I'm trying to backfill with the real value divided by the number of na values + itself. The following is what I'm trying to get
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
Try:
# identify the blocks by cumsum on the reversed non-nan series
groups = test_df['a'].notna()[::-1].cumsum()
# groupby and transform
test_df['a'] = test_df['a'].fillna(0).groupby(groups).transform('mean')
Output:
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
IIUC use:
# get reverse group
group = test_df.loc[::-1,'a'].notna().cumsum()
# get size and divide
test_df['a'] = (test_df['a']
.bfill()
.div(test_df.groupby(group)['a'].transform('size'))
)
Or with rdiv:
test_df['a'] = (test_df
.groupby(group)['a']
.transform('size')
.rdiv(test_df['a'].bfill())
)
Output (as new column for clarity):
a a2
0 NaN 1.0
1 NaN 1.0
2 NaN 1.0
3 4.0 1.0
4 NaN 2.0
5 NaN 2.0
6 6.0 2.0

Forward fill only certain value

I have an array which represents object states, where 0 - object is off, and 1 - object is on.
import pandas as pd
import numpy as np
s = [np.nan, 0, np.nan, np.nan, 1, np.nan, np.nan, 0, np.nan, 1, np.nan]
df = pd.DataFrame(s, columns=["s"])
df
s
0 NaN
1 0.0
2 NaN
3 NaN
4 1.0
5 NaN
6 NaN
7 0.0
8 NaN
9 1.0
10 NaN
I need to forward will only 0-values in it, like below.
>>> df_wanted
s
0 NaN
1 0.0
2 0.0
3 0.0
4 1.0
5 NaN
6 NaN
7 0.0
8 0.0
9 1.0
10 NaN
After browsing similar queations here, I just compare ffill-ed and bfill-ed values and assign back with a mask:
mask = (df.ffill() == 0) & (df.bfill() == 1)
df[mask] = 0
df
s
0 NaN
1 0.0
2 0.0
3 0.0
4 1.0
5 NaN
6 NaN
7 0.0
8 0.0
9 1.0
10 NaN
But it won't help if any 0 value is not followed by 1. What could be more elegant solution that takes such cases into account?
mask = (df.ffill() == 0) should only be suffice to fulfill your usecase.
Firstly, df.ffill will propagate the last valid observation forward. So rows followed by 0 will be filled by 0s, and rows followed by 1 will be filled by 1s. Compare that to 0 to select rows with 0s only and use it as mask to get your final df.
Example: (Added a 0 and few NaNs to the end of your df)
>>> s = [np.nan, 0, np.nan, np.nan, 1, np.nan, np.nan, 0, np.nan, 1, np.nan, np.nan, 0, np.nan, np.nan, np.nan]
>>> df = pd.DataFrame(s, columns=["s"])
>>> df
s
0 NaN
1 0.0
2 NaN
3 NaN
4 1.0
5 NaN
6 NaN
7 0.0
8 NaN
9 1.0
10 NaN
11 NaN
12 0.0
13 NaN
14 NaN
15 NaN
>>>
>>>
>>> df[df.ffill() == 0] = 0
>>> df
s
0 NaN
1 0.0
2 0.0
3 0.0
4 1.0
5 NaN
6 NaN
7 0.0
8 0.0
9 1.0
10 NaN
11 NaN
12 0.0
13 0.0
14 0.0
15 0.0
One way, maybe not much elegant but that works for you, would be to just ffill with everything and then pick from it where your original series was NaN and your ffilled series is 0.
sf = df.ffill().values[:, 0]
desired = np.where(np.isnan(s) & (sf==0), sf, s)
pandas has a where function too, I'm just more comfortable with numpy since it's more versatile.

Pandas dataframe insert missing row and fill with previous row

I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN
Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN
If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5
Try this
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))
df=df.sort_values("id").reset_index(drop=True)
output
id A B
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 0.0 1.0
3 3.0 0.0 1.0
4 4.0 1.0 NaN
5 5.0 0.0 NaN

Compute rolling sum shifted for each group

My goal is to perform a groupby, then creating rolling total stats and then shift. I need it to shift the first instance of each unique player. Right now it is shifting the entire dataframe once, and not doing it for each grouped player.
Original Data -
player date won
0 A 2016-01-11 0
1 A 2016-02-01 0
2 A 2016-02-01 1
3 A 2016-02-01 1
4 A 2016-10-24 0
5 A 2016-10-31 0
6 A 2018-10-22 0
7 B 2016-10-24 0
8 B 2016-10-24 1
9 B 2017-11-13 0
Things I've tried -
1
temp = temp_master.groupby('player', sort=False)[count_fields].rolling(10, min_periods=1).sum().shift(1).reset_index(drop=True)
temp = temp.add_suffix('_total')
temp['won_total'].head(10)
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 0.0
9 1.0
2
temp = temp_master.groupby('player', sort=False)[count_fields].shift(1).rolling(10, min_periods=1).sum().reset_index(drop=True)
temp = temp.add_suffix('_total')
temp['won_total'].head(10)
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 3.0
3
temp = temp_master.groupby('player', sort=False)[count_fields].rolling(10, min_periods=1).sum().reset_index(drop=True)
temp = temp.add_suffix('_total')
temp = temp.shift(1)
temp['won_total'].head(10)
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 0.0
9 1.0
This is what I need the results to be -
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 NaN
8 0.0
9 1.0
index #7 should equal NaN. It should be the first instance of player B and I want it to shift at the first instance of every new player to sumarrize stats by player.
index 8 should equal 0
index 9 should equal 1
It looks like attempt #1 & #3 is close but it's not assigning the NaN value on the new player. #3 isn't doing a groupedby player anymore though so I know that won't really work.
Also, this will be done on a good amount of data (around 100K-300K rows) and the 'count_fields' column contains around 3K-4K columns that I am calculating. Just something to be aware of.
Any ideas on how to create running stats by player and shift down at for every player?
You need apply here , this two functions are not chain under the groupby object , sum is under the groupby , but shift will implement to the result after sum which is whole columns
temp = temp_master.groupby('player', sort=False)['won'].apply(lambda x : x.rolling(10, min_periods=1).sum().shift(1))\
.reset_index(drop=True)
temp
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 NaN
8 0.0
9 1.0
Name: won, dtype: float64
Another option if you don't want to use apply is to layer a second groupby call and perform the shifting:
(df.groupby('player', sort=False)
.won.rolling(10, min_periods=1)
.sum()
.groupby(level=0)
.shift()
.reset_index(drop=True))
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 NaN
8 0.0
9 1.0
Name: won, dtype: float64

transform on multiple columns to interpolate/copy missing values

I'm trying to fill out missing values in a pandas dataframe by interpolating or copying the last-known value within a group (identified by trip). My data looks like this:
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 NaN 1.264 1
3 NaN 0.000 1
4 0.0 NaN 1
5 NaN 1.264 1
6 NaN 6.704 1
7 1.0 NaN 1
8 0.0 NaN 1
9 NaN 11.746 2
10 1.0 NaN 2
11 0.0 NaN 2
12 NaN 16.961 3
13 1.0 NaN 3
14 NaN 11.832 3
15 0.0 NaN 3
16 NaN 17.082 3
17 NaN 22.435 3
18 NaN 28.707 3
19 NaN 34.216 3
I have found Pandas interpolate within a groupby but I need brake to simply be copied from the last-known, yet speed to be interpolated (my actual dataset has 12 columns that each need such treatment)
You can apply separate methods to each column. For example:
# interpolate speed
df['speed'] = df.groupby('trip').speed.transform(lambda x: x.interpolate())
# fill brake with last known value
df['brake'] = df.groupby('trip').brake.transform(lambda x: x.fillna(method='ffill'))
>>> df
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 1.0 1.2640 1
3 1.0 0.0000 1
4 0.0 0.6320 1
5 0.0 1.2640 1
6 0.0 6.7040 1
7 1.0 6.7040 1
8 0.0 6.7040 1
9 NaN 11.7460 2
10 1.0 11.7460 2
11 0.0 11.7460 2
12 NaN 16.9610 3
13 1.0 14.3965 3
14 1.0 11.8320 3
15 0.0 14.4570 3
16 0.0 17.0820 3
17 0.0 22.4350 3
18 0.0 28.7070 3
19 0.0 34.2160 3
Note that this means you remain with some NaN in brake, because there was no "last known value" for the first row of a trip, and some NaNs in speed when the first few rows were NaN. You can replace these as you see fit with fillna()

Categories