I am using Pandas 1.51 and I'm trying to get the rank of each row in a dataframe in a rolling window that looks ahead by employing FixedForwardWindowIndexer. But I can't make sense of the results. My code:
df = pd.DataFrame({"X":[9,3,4,5,1,2,8,7,6,10,11]})
window_size = 5
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window_size)
df.rolling(window=indexer).rank(ascending=False)
results:
X
0 5.0
1 4.0
2 1.0
3 2.0
4 3.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
By my reckoning, it should look like:
X
0 1.0 # based on the window [9,3,4,5,1], 9 is ranked 1st w/ascending = False
1 3.0 # based on the window [3,4,5,1,2], 3 is ranked 3rd
2 3.0 # based on the window [4,5,1,2,8], 4 is ranked 3rd
3 3.0 # etc
4 5.0
5 5.0
6 3.0
7 NaN
8 NaN
9 NaN
10 NaN
I am basing this on a backward-looking window, which works fine:
>>> df.rolling(window_size).rank(ascending=False)
X
0 NaN
1 NaN
2 NaN
3 NaN
4 5.0
5 4.0
6 1.0
7 2.0
8 3.0
9 1.0
10 1.0
Any assistance is most welcome.
Here is another way to do it:
df["rank"] = [
x.rank(ascending=False).iloc[0].values[0]
for x in df.rolling(window_size)
if len(x) == window_size
] + [pd.NA] * (window_size - 1)
Then:
print(df)
# Output
X rank
0 9 1.0
1 3 3.0
2 4 3.0
3 5 3.0
4 1 5.0
5 2 5.0
6 8 3.0
7 7 <NA>
8 6 <NA>
9 10 <NA>
10 11 <NA>
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
I have a DataFrame with 15 columns and 5000 rows.
In the DataFrame there are 4 columns that contain NaN values. I would like to replace the values with the median.
As there are several columns, I would like to do this via a for-loop.
These are the column numbers: 1,5,8,9.
The NaN values per column get the corresponding median.
I tried:
for i in [1,5,8,9]:
df[i] = df[i].fillna(df[i].transform('median'))
No need for a loop, use a vectorial approach:
out = df.fillna(df.median())
Or, to limit to specific columns names:
cols = [1, 5, 8, 9]
# or automatic selection of columns with NaNs
# cols = df.isna().any()
out = df.fillna(df[cols].median())
or positional indices:
col_pos = [1, 5, 8, 9]
out = df.fillna(df.iloc[:, col_pos].median())
output:
0 1 2 3 4 5 6 7 8 9
0 9 7.0 1 3.0 5.0 7 3 6.0 6.0 7
1 9 1.0 9 6.0 4.5 3 8 4.0 1.0 4
2 5 3.5 3 1.0 4.0 4 4 3.5 3.0 8
3 4 6.0 9 3.0 3.0 2 1 2.0 1.0 3
4 4 1.0 1 3.0 7.0 8 4 3.0 5.0 6
used example input:
0 1 2 3 4 5 6 7 8 9
0 9 7.0 1 3.0 5.0 7 3 6.0 6.0 7
1 9 1.0 9 6.0 NaN 3 8 4.0 1.0 4
2 5 NaN 3 1.0 4.0 4 4 NaN NaN 8
3 4 6.0 9 3.0 3.0 2 1 2.0 1.0 3
4 4 1.0 1 NaN 7.0 8 4 3.0 5.0 6
You can simply do:
df[[1,5,8,9]] = df[[1,5,8,9]].fillna(df[[1,5,8,9]].median())
i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?
Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3
IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
I've got df as follows:
a b
0 1 NaN
1 2 NaN
2 1 1.0
3 4 NaN
4 9 1.0
5 6 NaN
6 5 2.0
7 8 NaN
8 9 2.0
I'd like fill nan's only between numbers to get df like this:
a b
0 1 NaN
1 2 NaN
2 1 1.0
3 4 1.0
4 9 1.0
5 6 NaN
6 5 2.0
7 8 2.0
8 9 2.0
and then create two new dataframes:
a b
2 1 1.0
3 4 1.0
4 9 1.0
a b
6 5 2.0
7 8 2.0
8 9 2.0
meaning select all columns and rows with fiiled out nan only.
My idea for first part, this with filling out nan is to create separate dataframe with row indexes like:
2 1.0
4 1.0
6 2.0
8 2.0
and based on that create range of row indexes to fill out.
My question is maybe there is, related to this part with replacing nan, more pythonic function to do this.
How about
df[df.b.ffill()==df.b.bfill()].ffill()
results in
# a b
# 2 1 1.0
# 3 4 1.0
# 4 9 1.0
# 6 5 2.0
# 7 8 2.0
# 8 9 2.0
Explanation:
df['c'] = df.b.ffill()
df['d'] = df.b.bfill()
# a b c d
# 0 1 NaN NaN 1.0
# 1 2 NaN NaN 1.0
# 2 1 1.0 1.0 1.0
# 3 4 NaN 1.0 1.0
# 4 9 1.0 1.0 1.0
# 5 6 NaN 1.0 2.0
# 6 5 2.0 2.0 2.0
# 7 8 NaN 2.0 2.0
# 8 9 2.0 2.0 2.0
I want to select rows with groupby conditions.
import pandas as pd
import numpy as np
dftest = pd.DataFrame({'A':['Feb',np.nan,'Air','Flow','Feb',
'Beta','Cat','Feb','Beta','Air'],
'B':['s','s','t','s','t','s','t','t','t','t'],
'C':[5,4,3,2,1,7,6,5,4,3],
'D':[4,np.nan,3,np.nan,2,
np.nan,2,3,np.nan,7]})
def filcols3(df,dd):
if df.iloc[0]['D']==dd:
return df
dd=4
grp=dftest.groupby('B').apply(filcols3,dd)
the result of grp is:
A B C D
B
s 0 Feb s 5 4.0
1 NaN s 4 NaN
3 Flow s 2 NaN
5 Beta s 7 NaN
this is what I want.
while if I use the following code(part 2)
def filcols3(df,dd):
if df.iloc[0]['D']<=dd:
return df
dd=3
the result is:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 Air t 3.0 3.0
3 NaN NaN NaN NaN
4 Feb t 1.0 2.0
5 NaN NaN NaN NaN
6 Cat t 6.0 2.0
7 Feb t 5.0 3.0
8 Beta t 4.0 NaN
9 Air t 3.0 7.0
I'm surprise for this result, I mean to get
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
what's wrong with the code of part 2? how to get the final result I want?
apply's behaviour is a little non-intuitive here, but if the idea is to filter out entire groups based on a specific condition per group, you can use GroupBy.transform and get a mask to filter df:
df[df.groupby('B')['D'].transform('first') <= 3]
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
Or, fixing your code,
df[df.groupby('B')['D'].transform(lambda x: x.values[0] <= 3)]
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
May Check with filter
dftest.groupby('B').filter(lambda x : any(x['D'].head(1)<=3))
Out[538]:
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0
Or without groupby drop_duplicates
s=df.drop_duplicates('B').D<=3
df[df.B.isin(df.loc[s.index,'B'][s])]
Out[550]:
A B C D
2 Air t 3 3.0
4 Feb t 1 2.0
6 Cat t 6 2.0
7 Feb t 5 3.0
8 Beta t 4 NaN
9 Air t 3 7.0