Append rows to groups in pandas

Append rows to groups in pandas - python

I'm trying to append a number of NaN rows to each group in a pandas dataframe. Essentially I want to pad each group to be 5 rows long. Ordering is important. I have:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 1 e
8 2 e
9 3 e
I want:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e

Using pd.crosstab:
df = pd.crosstab(df.Rank, df.ID).iloc[:5].unstack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
Output:
ID Rank
0 a 1.0
1 a 2.0
2 a 3.0
3 a 4.0
4 a 5.0
5 c 1.0
6 c 2.0
7 c NaN
8 c NaN
9 c NaN
10 e 1.0
11 e 2.0
12 e 3.0
13 e NaN
14 e NaN
Another approach, assuming the maximum group size in df is exactly 5.
In [251]: df.groupby('ID').Rank.apply(np.array).apply(pd.Series).stack(dropna=False)
Out[251]:
ID
a 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
c 0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
e 0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Full explanation:
import pandas as pd
import numpy as np
df = pd.read_csv(pd.compat.StringIO("""Rank ID
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
6 1 c
7 2 c
8 1 e
9 2 e
10 3 e"""), sep=r' +')
df = pd.crosstab(df.Rank, df.ID).iloc[:5].T.stack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
# pd.crosstab(df.Rank, df.ID) produces:
# ID a c e
# Rank
# 1.0 1 1 1
# 2.0 1 1 1
# 3.0 1 0 1
# 4.0 1 0 0
# 5.0 1 0 0
# applying .T.stack().reset_index() yields:
# ID Rank 0
# 0 a 1.0 1
# 1 a 2.0 1
# 2 a 3.0 1
# 3 a 4.0 1
# 4 a 5.0 1
# 5 c 1.0 1
# 6 c 2.0 1
# 7 c 3.0 0
# 8 c 4.0 0
# 9 c 5.0 0
# 10 e 1.0 1
# 11 e 2.0 1
# 12 e 3.0 1
# 13 e 4.0 0
# 14 e 5.0 0
# finally, use df[0] to filter df['Rank']

concat and reindex
This solution does not consider the values in the Rank column and only adds more rows if more are needed.
pd.concat([
d.reset_index(drop=True).reindex(range(5)).assign(id=n)
for n, d in df.groupby('id')
], ignore_index=True)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Same answer phrased a bit differently
f = lambda t: t[1].reset_index(drop=True).reindex(range(5)).assign(id=t[0])
pd.concat(map(f, df.groupby('id')), ignore_index=True)
factorize
This solution produces the Cartesian product of unique values from id and Rank
i, r = df.id.factorize()
j, c = df.Rank.factorize()
b = np.empty((r.size, c.size))
b.fill(np.nan)
b[i, j] = df.Rank.values
pd.DataFrame(dict(Rank=b.ravel(), id=r.repeat(c.size)))
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e

You can use the frequency of the id's and pd.concat to merge the repetitions i.e
di = (5-df.groupby('id').size()).to_dict()
temp = pd.concat([pd.DataFrame({
'Rank':np.nan,
'id': pd.Series(np.repeat(i,di[i]))
}) for i in df['id'].unique()])
ndf = pd.concat([df,temp],ignore_index=True).sort_values('id')
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
10 NaN c
11 NaN c
12 NaN c
7 1.0 e
8 2.0 e
9 3.0 e
13 NaN e
14 NaN e

One possible solution is create helper DataFrame by numpy.repeat and then append to original, last sort_values:
s = (5 - df['id'].value_counts())
df = (df.append(pd.DataFrame({'id':np.repeat(s.index, s.values), 'Rank':np.nan}))
.sort_values('id')
.reset_index(drop=True))
print (df)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Another solution is no sorting is possible is groupby with custom function and append:
def f(x):
return x.append(pd.DataFrame([[np.nan, x.name]] * (5 - len(x)), columns=['Rank','id']))
df = df.groupby('id', sort=False).apply(f).reset_index(drop=True)
print (df)
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e

Here's one way using a single pd.DataFrame.append following by sort_values.
from itertools import chain
counts = df.groupby('id')['Rank'].count()
lst = list(chain.from_iterable([[np.nan, i]]*(5-c) for i, c in counts.items()))
res = df.append(pd.DataFrame(lst, columns=df.columns))\
.sort_values(['id', 'Rank'])\
.reset_index(drop=True)
print(res)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e

excellent answers so far. i had another idea because it fits more to the problem i'm dealing with, using an outer join and pd.merge.
in addition to the example from above, i have several metric columns (m1 and m2 in this example) that i want to set to zero for every group that does not contain those Rank values. in my case, Rank is simply a time dimension, and i the df contains a time series over several IDs.
df = pd.read_csv(pd.compat.StringIO("""Rank ID m1 m2
0 1 a 1 3
1 2 a 2 3
2 3 a 1 2
3 4 a 1 3
4 5 a 2 3
6 1 c 2 2
7 2 c 2 4
8 1 e 1 3
9 2 e 1 4
10 3 e 1 2"""), sep=r' +')
i then define the df that contains all the Ranks, in this example from 1 to 10.
df_outer_right = pd.DataFrame({'Rank':np.arange(1,11,1)})
finally, i group the initial df by ID and apply an outer join using pd.merge on each group.
df.groupby('ID').apply(lambda df: pd.merge(df, df_outer_right, how='outer', on='Rank'))
which yields:
ID Rank ID m1 m2
a 0 1 a 1.0 3.0
a 1 2 a 2.0 3.0
a 2 3 a 1.0 2.0
a 3 4 a 1.0 3.0
a 4 5 a 2.0 3.0
a 5 6 NaN NaN NaN
a 6 7 NaN NaN NaN
a 7 8 NaN NaN NaN
a 8 9 NaN NaN NaN
a 9 10 NaN NaN NaN
c 0 1 c 2.0 2.0
c 1 2 c 2.0 4.0
c 2 3 NaN NaN NaN
c 3 4 NaN NaN NaN
c 4 5 NaN NaN NaN
c 5 6 NaN NaN NaN
c 6 7 NaN NaN NaN
c 7 8 NaN NaN NaN
c 8 9 NaN NaN NaN
c 9 10 NaN NaN NaN
e 0 1 e 1.0 3.0
e 1 2 e 1.0 4.0
e 2 3 e 1.0 2.0
e 3 4 NaN NaN NaN
e 4 5 NaN NaN NaN
e 5 6 NaN NaN NaN
e 6 7 NaN NaN NaN
e 7 8 NaN NaN NaN
e 8 9 NaN NaN NaN
e 9 10 NaN NaN NaN
i'm pretty sure this might not be the fastest solution :)

Related

Add list to Pandas Dateframe, but keep NaNs at the top

I think this has probably been answered, but I cant find the answer anywhere. It is pretty trivial. How can I add a list to a pandas dataframe as a column, but keep the NaNs at the top?
This is the code i have:
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
print(df)
which returns this:
a b c d
0 1 2 2.0 1.0
1 2 3 3.0 2.0
2 3 5 5.0 3.0
3 4 6 6.0 4.0
4 5 4 4.0 NaN
5 6 3 3.0 NaN
6 7 2 NaN NaN
However, I would like it to return this instead:
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

Let us try
df=df.apply(lambda x : sorted(x,key=pd.notnull))
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

l = df.apply(sorted, key = lambda s: (~np.isnan(s), s), axis = 0)
You can sort the dataframe rows using a key argument to keep NaNs first

If the problem is with assignment instead of transformation, you can also try with iloc with get_loc after creating a dictionary (d):
d = {'c':c,'d':d}
df = df.reindex(columns=df.columns.union(d.keys(),sort=False))
for k,v in d.items():
df.iloc[-len(v):,df.columns.get_loc(k)] = v
print(df)
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

You may find out how many rows have NaN in them (using s.isna().sum()) and then do shift() to that column by the amount of Nans you have.
Code example on d column:
import pandas as pd
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
df['d'] = df['d'].shift(df['d'].isna().sum()) # example on the 'd' row
print(df)
Output:
a b c d
0 1 2 2.0 NaN
1 2 3 3.0 NaN
2 3 5 5.0 NaN
3 4 6 6.0 1.0
4 5 4 4.0 2.0
5 6 3 3.0 3.0
6 7 2 NaN 4.0

the way how to do it! just reset index and put na values first.
df.reset_index()
df2 = df.sort_values(by =['a','b','c','d'], ascending = False, na_position='first')
#Result
a b c d
6 7 2 NaN NaN
5 6 3 3.0 NaN
4 5 4 4.0 NaN
3 4 6 6.0 4.0
2 3 5 5.0 3.0
1 2 3 3.0 2.0
0 1 2 2.0 1.0

dataframe concatenation by column value (no outer merge)

This is a new question after this, with more information
I want to merge two dataframes like the outer join, but I do not want the cartesian product, but only the concatenation, for example:
df1:
A
0 2
1 2
2 2
3 2
4 2
5 3
df2:
B
0 1
1 2
2 2
3 3
4 4
with : df3 = df1.merge(df2, left_on=['A'], right_on=['B'], how='outer') I get df3:
A B
0 2.0 2
1 2.0 2
2 2.0 2
3 2.0 2
4 2.0 2
5 2.0 2
6 2.0 2
7 2.0 2
8 2.0 2
9 2.0 2
10 3.0 3
11 NaN 1
12 NaN
But I want:
A B
0 2.0 2
1 2.0 2
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3
6 NaN 1
7 NaN 4
just concatenate the first 'm' of df1 with the m of df2 and dhe remaining values of df1
with a NaN value

get the cumulative counts of A and B, and use the combination of the counts with A and B as merge conditions :
df1['checker'] = df1.groupby("A").cumcount()
df2['checker'] = df2.groupby("B").cumcount()
res = df1.merge(df2,left_on=['A','checker'],right_on=['B','checker'],how='outer').drop('checker',axis=1)
res
A B
0 2.0 2.0
1 2.0 2.0
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3.0
6 NaN 1.0
7 NaN 4.0

You might want to try/use the concat method. ex:
result = pd.concat([A, B], axis=1, sort=False)
You can read more here.

how to columns into multiple rows in dataframe?

i have dataframe like below
A B C D E F G H G H I J K
1 2 3 4 5 6 7 8 9 10 11 12 13
and i want result like this
A B C D E F G H
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 9
1 2 3 4 5 6 7 10
1 2 3 4 5 6 7 11
1 2 3 4 5 6 7 12
1 2 3 4 5 6 7 13
like a result column 'G~K' is under column 'H'
how can i do this?

You need to adjust your columns by using cummax , then after melt, we create additional key with cumcount, then just do reshape here, I am using unstack , you can using pivot , pivot_table
s=pd.Series(df.columns)
s[(s>='H').cummax()==1]='H'
df.columns=s
df=df.melt()
yourdf=df.set_index(['variable',df.groupby('variable').cumcount()]).\
value.unstack(0).ffill()
yourdf
variable A B C D E F G H
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 9.0
2 1.0 2.0 3.0 4.0 5.0 6.0 7.0 10.0
3 1.0 2.0 3.0 4.0 5.0 6.0 7.0 11.0
4 1.0 2.0 3.0 4.0 5.0 6.0 7.0 12.0
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 13.0

I hope this would give you some help
import pandas as pd
df = pd.DataFrame([list(range(1,14))])
df.columns = ('A','B','C','D','E','F','G','H','G','H','I','J','K')
print('starting data frame:')
print(df)
df1 = df.iloc[:,0:7]
df1 = df1.append([df1]*(len(df.iloc[:,7:].T)-1))
df1.insert(df1.shape[1],'H',list(df.iloc[:,7:].values[0]))
print('result:')
print(df1)

letters = list("ABCDEFGHIJKLM")
df = pd.DataFrame([np.arange(1, len(letters) + 1)], columns=letters)
df = pd.concat([df.iloc[:, :7]] * (len(letters) - 7)).assign(H=df[letters[7:]].values[0])
df = df.reset_index(drop=True)
df
gives you
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13

Your data has some duplicates in columns name, so melt will fail. However, you could change columns name and then apply melt
In [166]: df
Out[166]:
A B C D E F G H G H I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Duplicates in column name 'G' and 'H'. Just change those to 'GG', 'HH'. Finally, apply melt
In [167]: df.columns = ('A','B','C','D','E','F','G','H','GG','HH','I','J','K')
In [168]: df
Out[168]:
A B C D E F G H GG HH I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
In [169]: df.melt(id_vars=df.columns.tolist()[0:7], value_name='H').drop('variable', 1)
Out[169]:
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13

Merging two dataframes with different lengths

How can I merge two pandas dataframes with different lengths like those:
df1 = Index block_id Ut_rec_0
0 0 7
1 1 10
2 2 2
3 3 0
4 4 10
5 5 3
6 6 6
7 7 9
df2 = Index block_id Ut_rec_1
0 0 3
2 2 5
3 3 5
5 5 9
7 7 4
result = Index block_id Ut_rec_0 Ut_rec_1
0 0 7 3
1 1 10 NaN
2 2 2 5
3 3 0 5
4 4 10 NaN
5 5 3 9
6 6 6 NaN
7 7 9 4
I already tried something like, but it did not work:
df_result = pd.concat([df1, df2], join_axes=[df1['block_id']])
I already tried:
df_result = pd.concat([df1,df2,axis = 1)
But the result was:
Index block_id Ut_rec_0 Index block_id Ut_rec_1
0 0 7 0.0 0.0 3.0
1 1 10 1.0 2.0 5.0
2 2 2 2.0 3.0 5.0
3 3 0 3.0 5.0 9.0
4 4 10 4.0 7.0 4.0
5 5 3 NaN NaN NaN
6 6 6 NaN NaN NaN
7 7 9 NaN NaN NaN

pandas.DataFrame.join can "join" dataframes based on overlap in column data (or index). Something like this will likely work for you:
df1.join(df2.set_index('block_id'), on='block_id')

As #Wen said best would be using concat with axis as 1, like the below code:
pd.concat([df1, df2],axis=1)

you need, pd.merge with outer join,
pd.merge(df1,df2,on=['Index','block_id'],how='outer')
#[out]
#Index block_id Ut_rec_0 Ut_rec_1
#0 0 7 3.0
#1 1 10 NaN
#2 2 2 5.0
#3 3 0 5.0
#4 4 10 NaN
#5 5 3 9.0
#6 6 6 NaN
#7 7 9 4.0

How do I join two dataframes based on values in selected columns?

I am trying to join (merge) two dataframes based on values in each column.
For instance, to merge by values in columns in A and B.
So, having df1
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
And df2
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
I want to get a d3 with such structure
A B C D E F L
0 4 3 1 5 4 5 1
1 5 7 0 3 3 3 2
2 3 2 1 6 Nan Nan 4
3 3 8 Nan Nan 5 5 5
Can you, please help me? I've tried both merge and join methods but havent succeed.

UPDATE: (for updated DFs and new desired DF)
In [286]: merged = pd.merge(df1, df2, on=['A','B'], how='outer', suffixes=('','_y'))
In [287]: merged.L.fillna(merged.pop('L_y'), inplace=True)
In [288]: merged
Out[288]:
A B C D L E F
0 4 3 1.0 5.0 1.0 4.0 5.0
1 5 7 0.0 3.0 2.0 3.0 3.0
2 3 2 1.0 6.0 4.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0 5.0
Data:
In [284]: df1
Out[284]:
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
In [285]: df2
Out[285]:
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
OLD answer:
you can use pd.merge(..., how='outer') method:
In [193]: pd.merge(a,b, on=['A','B'], how='outer')
Out[193]:
A B C D E F
0 4 3 1.0 5.0 4.0 5.0
1 5 7 0.0 3.0 3.0 3.0
2 3 2 1.0 6.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0
Data:
In [194]: a
Out[194]:
A B C D
0 4 3 1 5
1 5 7 0 3
2 3 2 1 6
In [195]: b
Out[195]:
A B E F
0 4 3 4 5
1 5 7 3 3
2 3 8 5 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Append rows to groups in pandas - python

Related

Add list to Pandas Dateframe, but keep NaNs at the top

dataframe concatenation by column value (no outer merge)

how to columns into multiple rows in dataframe?

Merging two dataframes with different lengths

How do I join two dataframes based on values in selected columns?

Categories

Resources