Pandas: start a new group on every non-NA value

Pandas: start a new group on every non-NA value - python

I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!

You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4

You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64

Related

Pandas get rank on rolling with FixedForwardWindowIndexer

I am using Pandas 1.51 and I'm trying to get the rank of each row in a dataframe in a rolling window that looks ahead by employing FixedForwardWindowIndexer. But I can't make sense of the results. My code:
df = pd.DataFrame({"X":[9,3,4,5,1,2,8,7,6,10,11]})
window_size = 5
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window_size)
df.rolling(window=indexer).rank(ascending=False)
results:
X
0 5.0
1 4.0
2 1.0
3 2.0
4 3.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
By my reckoning, it should look like:
X
0 1.0 # based on the window [9,3,4,5,1], 9 is ranked 1st w/ascending = False
1 3.0 # based on the window [3,4,5,1,2], 3 is ranked 3rd
2 3.0 # based on the window [4,5,1,2,8], 4 is ranked 3rd
3 3.0 # etc
4 5.0
5 5.0
6 3.0
7 NaN
8 NaN
9 NaN
10 NaN
I am basing this on a backward-looking window, which works fine:
>>> df.rolling(window_size).rank(ascending=False)
X
0 NaN
1 NaN
2 NaN
3 NaN
4 5.0
5 4.0
6 1.0
7 2.0
8 3.0
9 1.0
10 1.0
Any assistance is most welcome.

Here is another way to do it:
df["rank"] = [
x.rank(ascending=False).iloc[0].values[0]
for x in df.rolling(window_size)
if len(x) == window_size
] + [pd.NA] * (window_size - 1)
Then:
print(df)
# Output
X rank
0 9 1.0
1 3 3.0
2 4 3.0
3 5 3.0
4 1 5.0
5 2 5.0
6 8 3.0
7 7 <NA>
8 6 <NA>
9 10 <NA>
10 11 <NA>

Fill Nan based on multiple column condition in Pandas

The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0

Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0

First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0

We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0

Forward fill non na values with last observation carried forwards in Python

Suppose I had a column in a dataframe like :
colname
Na
Na
Na
1
2
3
4
Na
Na
Na
Na
2
8
5
44
Na
Na
Does anyone know of a function to forward fill the Non NA values with the first value in the non na run? To produce :
colname
Na
Na
Na
1
1
1
1
Na
Na
Na
Na
2
2
2
2
Na
Na

Use GroupBy.transform with GroupBy.first by compare values for missing values by Series.isna with cumulative sum by Series.cumsum, last correct NaNs by Series.where with Series.duplicated:
s = df['colNaNme'].isna().cumsum()
df['colNaNme'] = df.groupby(s)['colNaNme'].transform('first').where(s.duplicated())
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Or filter only non missing values by invert mask m and processing only these groups:
m = df['colNaNme'].isna()
df.loc[~m, 'colNaNme'] = df[~m].groupby(m.cumsum())['colNaNme'].transform('first')
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Solution with non groupby:
m = df['colNaNme'].isna()
m1 = m.cumsum().shift().bfill()
m2 = ~m1.duplicated() & m.duplicated(keep=False)
df['colNaNme'] = df['colNaNme'].where(m2).ffill().mask(m)
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN

You could try groupby and cumsum with shift and transform('first'):
>>> df.groupby(df['colname'].isna().ne(df['colname'].isna().shift()).cumsum()).transform('first')
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>
Or try something like:
>>> x = df.groupby(df['colname'].isna().cumsum()).transform('first')
>>> x.loc[~x.duplicated()] = np.nan
>>> x
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>

concat result of groupby pandas

I am raising this question for learning a new method for myself.
I have a dataframe like below,
ID Value
0 1 10
1 1 12
2 1 14
3 1 16
4 1 18
5 2 32
6 2 12
7 2 -8
8 2 -28
9 2 -48
10 2 -68
11 3 12
12 3 1
13 3 43
I want to convert this into:
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
one way to solve this,
print
pd.concat([df[df['ID']==1].reset_index(drop=True),df[df['ID']==2].reset_index(drop=True),df[df['ID']==3].reset_index(drop=True)],axis=1)
But I'm thinking can I do the same concat operation for each groupby method result instead of filtering by value?
Any better/new approaches are more appreciated.
Thanks in advance.

Yup, very possible and quite simple with pd.concat, in fact.
df = pd.concat({k : g.reset_index(drop=True) for k, g in df.groupby('ID')}, axis=1)
df.columns = df.columns.droplevel(0)
Or, a minor variation in Dark's (now deleted) answer (which does not give you the opportunity to specify column suffixes automatically) -
pd.concat([g.reset_index(drop=True) for _, g in df.groupby('ID')], axis=1)
df
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
Those column names are terrible, though. Rather than dropping the first level, you should consider concatenating them to form a pre/suf-fix for the second level. That should be a good exercise for you with df.columns.map.

transform on multiple columns to interpolate/copy missing values

I'm trying to fill out missing values in a pandas dataframe by interpolating or copying the last-known value within a group (identified by trip). My data looks like this:
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 NaN 1.264 1
3 NaN 0.000 1
4 0.0 NaN 1
5 NaN 1.264 1
6 NaN 6.704 1
7 1.0 NaN 1
8 0.0 NaN 1
9 NaN 11.746 2
10 1.0 NaN 2
11 0.0 NaN 2
12 NaN 16.961 3
13 1.0 NaN 3
14 NaN 11.832 3
15 0.0 NaN 3
16 NaN 17.082 3
17 NaN 22.435 3
18 NaN 28.707 3
19 NaN 34.216 3
I have found Pandas interpolate within a groupby but I need brake to simply be copied from the last-known, yet speed to be interpolated (my actual dataset has 12 columns that each need such treatment)

You can apply separate methods to each column. For example:
# interpolate speed
df['speed'] = df.groupby('trip').speed.transform(lambda x: x.interpolate())
# fill brake with last known value
df['brake'] = df.groupby('trip').brake.transform(lambda x: x.fillna(method='ffill'))
>>> df
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 1.0 1.2640 1
3 1.0 0.0000 1
4 0.0 0.6320 1
5 0.0 1.2640 1
6 0.0 6.7040 1
7 1.0 6.7040 1
8 0.0 6.7040 1
9 NaN 11.7460 2
10 1.0 11.7460 2
11 0.0 11.7460 2
12 NaN 16.9610 3
13 1.0 14.3965 3
14 1.0 11.8320 3
15 0.0 14.4570 3
16 0.0 17.0820 3
17 0.0 22.4350 3
18 0.0 28.7070 3
19 0.0 34.2160 3
Note that this means you remain with some NaN in brake, because there was no "last known value" for the first row of a trip, and some NaNs in speed when the first few rows were NaN. You can replace these as you see fit with fillna()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: start a new group on every non-NA value - python

You can use forward fill: df['number'].ffill().fillna(0) Output: 0 0.0 1 0.0 2 1.0 3 1.0 4 1.0 5 1.0 6 2.0 7 2.0 8 2.0 9 3.0 10 3.0 11 3.0 12 3.0 13 3.0 14 3.0 15 4.0 16 4.0 17 4.0 Name: number, dtype: float64

Related

Pandas get rank on rolling with FixedForwardWindowIndexer

Fill Nan based on multiple column condition in Pandas

Forward fill non na values with last observation carried forwards in Python

concat result of groupby pandas

transform on multiple columns to interpolate/copy missing values

Categories

Resources