pandas fillna in subset of rows

pandas fillna in subset of rows - python

I've got df as follows:
a b
0 1 NaN
1 2 NaN
2 1 1.0
3 4 NaN
4 9 1.0
5 6 NaN
6 5 2.0
7 8 NaN
8 9 2.0
I'd like fill nan's only between numbers to get df like this:
a b
0 1 NaN
1 2 NaN
2 1 1.0
3 4 1.0
4 9 1.0
5 6 NaN
6 5 2.0
7 8 2.0
8 9 2.0
and then create two new dataframes:
a b
2 1 1.0
3 4 1.0
4 9 1.0
a b
6 5 2.0
7 8 2.0
8 9 2.0
meaning select all columns and rows with fiiled out nan only.
My idea for first part, this with filling out nan is to create separate dataframe with row indexes like:
2 1.0
4 1.0
6 2.0
8 2.0
and based on that create range of row indexes to fill out.
My question is maybe there is, related to this part with replacing nan, more pythonic function to do this.

How about
df[df.b.ffill()==df.b.bfill()].ffill()
results in
# a b
# 2 1 1.0
# 3 4 1.0
# 4 9 1.0
# 6 5 2.0
# 7 8 2.0
# 8 9 2.0
Explanation:
df['c'] = df.b.ffill()
df['d'] = df.b.bfill()
# a b c d
# 0 1 NaN NaN 1.0
# 1 2 NaN NaN 1.0
# 2 1 1.0 1.0 1.0
# 3 4 NaN 1.0 1.0
# 4 9 1.0 1.0 1.0
# 5 6 NaN 1.0 2.0
# 6 5 2.0 2.0 2.0
# 7 8 NaN 2.0 2.0
# 8 9 2.0 2.0 2.0

Related

Pandas get rank on rolling with FixedForwardWindowIndexer

I am using Pandas 1.51 and I'm trying to get the rank of each row in a dataframe in a rolling window that looks ahead by employing FixedForwardWindowIndexer. But I can't make sense of the results. My code:
df = pd.DataFrame({"X":[9,3,4,5,1,2,8,7,6,10,11]})
window_size = 5
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window_size)
df.rolling(window=indexer).rank(ascending=False)
results:
X
0 5.0
1 4.0
2 1.0
3 2.0
4 3.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
By my reckoning, it should look like:
X
0 1.0 # based on the window [9,3,4,5,1], 9 is ranked 1st w/ascending = False
1 3.0 # based on the window [3,4,5,1,2], 3 is ranked 3rd
2 3.0 # based on the window [4,5,1,2,8], 4 is ranked 3rd
3 3.0 # etc
4 5.0
5 5.0
6 3.0
7 NaN
8 NaN
9 NaN
10 NaN
I am basing this on a backward-looking window, which works fine:
>>> df.rolling(window_size).rank(ascending=False)
X
0 NaN
1 NaN
2 NaN
3 NaN
4 5.0
5 4.0
6 1.0
7 2.0
8 3.0
9 1.0
10 1.0
Any assistance is most welcome.

Here is another way to do it:
df["rank"] = [
x.rank(ascending=False).iloc[0].values[0]
for x in df.rolling(window_size)
if len(x) == window_size
] + [pd.NA] * (window_size - 1)
Then:
print(df)
# Output
X rank
0 9 1.0
1 3 3.0
2 4 3.0
3 5 3.0
4 1 5.0
5 2 5.0
6 8 3.0
7 7 <NA>
8 6 <NA>
9 10 <NA>
10 11 <NA>

Pandas: Fill nan values in multiple columns with respective median values but accessing the columns using indices

I have a DataFrame with 15 columns and 5000 rows.
In the DataFrame there are 4 columns that contain NaN values. I would like to replace the values with the median.
As there are several columns, I would like to do this via a for-loop.
These are the column numbers: 1,5,8,9.
The NaN values per column get the corresponding median.
I tried:
for i in [1,5,8,9]:
df[i] = df[i].fillna(df[i].transform('median'))

No need for a loop, use a vectorial approach:
out = df.fillna(df.median())
Or, to limit to specific columns names:
cols = [1, 5, 8, 9]
# or automatic selection of columns with NaNs
# cols = df.isna().any()
out = df.fillna(df[cols].median())
or positional indices:
col_pos = [1, 5, 8, 9]
out = df.fillna(df.iloc[:, col_pos].median())
output:
0 1 2 3 4 5 6 7 8 9
0 9 7.0 1 3.0 5.0 7 3 6.0 6.0 7
1 9 1.0 9 6.0 4.5 3 8 4.0 1.0 4
2 5 3.5 3 1.0 4.0 4 4 3.5 3.0 8
3 4 6.0 9 3.0 3.0 2 1 2.0 1.0 3
4 4 1.0 1 3.0 7.0 8 4 3.0 5.0 6
used example input:
0 1 2 3 4 5 6 7 8 9
0 9 7.0 1 3.0 5.0 7 3 6.0 6.0 7
1 9 1.0 9 6.0 NaN 3 8 4.0 1.0 4
2 5 NaN 3 1.0 4.0 4 4 NaN NaN 8
3 4 6.0 9 3.0 3.0 2 1 2.0 1.0 3
4 4 1.0 1 NaN 7.0 8 4 3.0 5.0 6

You can simply do:
df[[1,5,8,9]] = df[[1,5,8,9]].fillna(df[[1,5,8,9]].median())

Drop nan of each column in Pandas DataFrame

I have a dataframe as example:
A B C
0 1
1 1
2 1
3 1 2
4 1 2
5 1 2
6 2 3
7 2 3
8 2 3
9 3
10 3
11 3
And I would like to remove nan values of each column to get the result:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
Do I have an easy way to do that?

You can apply a custom sorting function for each column that doesn't actually sort numerically, it justs moves all the NaN values to the end of the column. Then, dropna:
df = df.apply(lambda x: sorted(x, key=lambda v: isinstance(v, float) and np.isnan(v))).dropna()
Output:
>>> df
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0

Given
>>> df
A B C
0 1.0 NaN NaN
1 1.0 NaN NaN
2 1.0 NaN NaN
3 1.0 2.0 NaN
4 1.0 2.0 NaN
5 1.0 2.0 NaN
6 NaN 2.0 3.0
7 NaN 2.0 3.0
8 NaN 2.0 3.0
9 NaN NaN 3.0
10 NaN NaN 3.0
11 NaN NaN 3.0
use
>>> df.apply(lambda s: s.dropna().to_numpy())
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0

Fill Nan based on multiple column condition in Pandas

The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0

Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0

First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0

We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0

fill NA of a column with elements of another column

i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?

Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3

IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas fillna in subset of rows - python

Related

Pandas get rank on rolling with FixedForwardWindowIndexer

Pandas: Fill nan values in multiple columns with respective median values but accessing the columns using indices

Drop nan of each column in Pandas DataFrame

Fill Nan based on multiple column condition in Pandas

fill NA of a column with elements of another column

Categories

Resources