I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0
Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0
First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0
We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0
If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9
First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.
If I have a Pandas Data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 1 1
2 1 NaN NaN 1 NaN 1
3 NaN 1 1 NaN 1 1
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
How do I count each group of ones and assign a value based on the number of groups in each row? Such that I get a data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 2 2
2 1 NaN NaN 2 NaN 3
3 NaN 1 NaN NaN 2 2
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
It is a little bit hard to finding a simple way
s=df.isnull().cumsum(1) # cumsum get the null
s=s[df.notnull()].apply(lambda x : pd.factorize(x)[0],1)+1 # then we need assign the groukey
df=s.mask(s==0)# and mask 0 as NaN
df
0 1 2 3 4 5
1 NaN NaN 1.0 NaN 2.0 2.0
2 1.0 NaN NaN 2.0 NaN 3.0
3 NaN 1.0 1.0 NaN 2.0 2.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN NaN NaN NaN NaN NaN