Append rows to groups in pandas - python

I'm trying to append a number of NaN rows to each group in a pandas dataframe. Essentially I want to pad each group to be 5 rows long. Ordering is important. I have:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 1 e
8 2 e
9 3 e
I want:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e

Using pd.crosstab:
df = pd.crosstab(df.Rank, df.ID).iloc[:5].unstack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
Output:
ID Rank
0 a 1.0
1 a 2.0
2 a 3.0
3 a 4.0
4 a 5.0
5 c 1.0
6 c 2.0
7 c NaN
8 c NaN
9 c NaN
10 e 1.0
11 e 2.0
12 e 3.0
13 e NaN
14 e NaN
Another approach, assuming the maximum group size in df is exactly 5.
In [251]: df.groupby('ID').Rank.apply(np.array).apply(pd.Series).stack(dropna=False)
Out[251]:
ID
a 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
c 0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
e 0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Full explanation:
import pandas as pd
import numpy as np
df = pd.read_csv(pd.compat.StringIO("""Rank ID
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
6 1 c
7 2 c
8 1 e
9 2 e
10 3 e"""), sep=r' +')
df = pd.crosstab(df.Rank, df.ID).iloc[:5].T.stack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
# pd.crosstab(df.Rank, df.ID) produces:
# ID a c e
# Rank
# 1.0 1 1 1
# 2.0 1 1 1
# 3.0 1 0 1
# 4.0 1 0 0
# 5.0 1 0 0
# applying .T.stack().reset_index() yields:
# ID Rank 0
# 0 a 1.0 1
# 1 a 2.0 1
# 2 a 3.0 1
# 3 a 4.0 1
# 4 a 5.0 1
# 5 c 1.0 1
# 6 c 2.0 1
# 7 c 3.0 0
# 8 c 4.0 0
# 9 c 5.0 0
# 10 e 1.0 1
# 11 e 2.0 1
# 12 e 3.0 1
# 13 e 4.0 0
# 14 e 5.0 0
# finally, use df[0] to filter df['Rank']

concat and reindex
This solution does not consider the values in the Rank column and only adds more rows if more are needed.
pd.concat([
d.reset_index(drop=True).reindex(range(5)).assign(id=n)
for n, d in df.groupby('id')
], ignore_index=True)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Same answer phrased a bit differently
f = lambda t: t[1].reset_index(drop=True).reindex(range(5)).assign(id=t[0])
pd.concat(map(f, df.groupby('id')), ignore_index=True)
factorize
This solution produces the Cartesian product of unique values from id and Rank
i, r = df.id.factorize()
j, c = df.Rank.factorize()
b = np.empty((r.size, c.size))
b.fill(np.nan)
b[i, j] = df.Rank.values
pd.DataFrame(dict(Rank=b.ravel(), id=r.repeat(c.size)))
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e

You can use the frequency of the id's and pd.concat to merge the repetitions i.e
di = (5-df.groupby('id').size()).to_dict()
temp = pd.concat([pd.DataFrame({
'Rank':np.nan,
'id': pd.Series(np.repeat(i,di[i]))
}) for i in df['id'].unique()])
ndf = pd.concat([df,temp],ignore_index=True).sort_values('id')
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
10 NaN c
11 NaN c
12 NaN c
7 1.0 e
8 2.0 e
9 3.0 e
13 NaN e
14 NaN e

One possible solution is create helper DataFrame by numpy.repeat and then append to original, last sort_values:
s = (5 - df['id'].value_counts())
df = (df.append(pd.DataFrame({'id':np.repeat(s.index, s.values), 'Rank':np.nan}))
.sort_values('id')
.reset_index(drop=True))
print (df)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Another solution is no sorting is possible is groupby with custom function and append:
def f(x):
return x.append(pd.DataFrame([[np.nan, x.name]] * (5 - len(x)), columns=['Rank','id']))
df = df.groupby('id', sort=False).apply(f).reset_index(drop=True)
print (df)
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e

Here's one way using a single pd.DataFrame.append following by sort_values.
from itertools import chain
counts = df.groupby('id')['Rank'].count()
lst = list(chain.from_iterable([[np.nan, i]]*(5-c) for i, c in counts.items()))
res = df.append(pd.DataFrame(lst, columns=df.columns))\
.sort_values(['id', 'Rank'])\
.reset_index(drop=True)
print(res)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e

excellent answers so far. i had another idea because it fits more to the problem i'm dealing with, using an outer join and pd.merge.
in addition to the example from above, i have several metric columns (m1 and m2 in this example) that i want to set to zero for every group that does not contain those Rank values. in my case, Rank is simply a time dimension, and i the df contains a time series over several IDs.
df = pd.read_csv(pd.compat.StringIO("""Rank ID m1 m2
0 1 a 1 3
1 2 a 2 3
2 3 a 1 2
3 4 a 1 3
4 5 a 2 3
6 1 c 2 2
7 2 c 2 4
8 1 e 1 3
9 2 e 1 4
10 3 e 1 2"""), sep=r' +')
i then define the df that contains all the Ranks, in this example from 1 to 10.
df_outer_right = pd.DataFrame({'Rank':np.arange(1,11,1)})
finally, i group the initial df by ID and apply an outer join using pd.merge on each group.
df.groupby('ID').apply(lambda df: pd.merge(df, df_outer_right, how='outer', on='Rank'))
which yields:
ID Rank ID m1 m2
a 0 1 a 1.0 3.0
a 1 2 a 2.0 3.0
a 2 3 a 1.0 2.0
a 3 4 a 1.0 3.0
a 4 5 a 2.0 3.0
a 5 6 NaN NaN NaN
a 6 7 NaN NaN NaN
a 7 8 NaN NaN NaN
a 8 9 NaN NaN NaN
a 9 10 NaN NaN NaN
c 0 1 c 2.0 2.0
c 1 2 c 2.0 4.0
c 2 3 NaN NaN NaN
c 3 4 NaN NaN NaN
c 4 5 NaN NaN NaN
c 5 6 NaN NaN NaN
c 6 7 NaN NaN NaN
c 7 8 NaN NaN NaN
c 8 9 NaN NaN NaN
c 9 10 NaN NaN NaN
e 0 1 e 1.0 3.0
e 1 2 e 1.0 4.0
e 2 3 e 1.0 2.0
e 3 4 NaN NaN NaN
e 4 5 NaN NaN NaN
e 5 6 NaN NaN NaN
e 6 7 NaN NaN NaN
e 7 8 NaN NaN NaN
e 8 9 NaN NaN NaN
e 9 10 NaN NaN NaN
i'm pretty sure this might not be the fastest solution :)

Related

Add list to Pandas Dateframe, but keep NaNs at the top

I think this has probably been answered, but I cant find the answer anywhere. It is pretty trivial. How can I add a list to a pandas dataframe as a column, but keep the NaNs at the top?
This is the code i have:
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
print(df)
which returns this:
a b c d
0 1 2 2.0 1.0
1 2 3 3.0 2.0
2 3 5 5.0 3.0
3 4 6 6.0 4.0
4 5 4 4.0 NaN
5 6 3 3.0 NaN
6 7 2 NaN NaN
However, I would like it to return this instead:
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
Let us try
df=df.apply(lambda x : sorted(x,key=pd.notnull))
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
l = df.apply(sorted, key = lambda s: (~np.isnan(s), s), axis = 0)
You can sort the dataframe rows using a key argument to keep NaNs first
If the problem is with assignment instead of transformation, you can also try with iloc with get_loc after creating a dictionary (d):
d = {'c':c,'d':d}
df = df.reindex(columns=df.columns.union(d.keys(),sort=False))
for k,v in d.items():
df.iloc[-len(v):,df.columns.get_loc(k)] = v
print(df)
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
You may find out how many rows have NaN in them (using s.isna().sum()) and then do shift() to that column by the amount of Nans you have.
Code example on d column:
import pandas as pd
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
df['d'] = df['d'].shift(df['d'].isna().sum()) # example on the 'd' row
print(df)
Output:
a b c d
0 1 2 2.0 NaN
1 2 3 3.0 NaN
2 3 5 5.0 NaN
3 4 6 6.0 1.0
4 5 4 4.0 2.0
5 6 3 3.0 3.0
6 7 2 NaN 4.0
the way how to do it! just reset index and put na values first.
df.reset_index()
df2 = df.sort_values(by =['a','b','c','d'], ascending = False, na_position='first')
#Result
a b c d
6 7 2 NaN NaN
5 6 3 3.0 NaN
4 5 4 4.0 NaN
3 4 6 6.0 4.0
2 3 5 5.0 3.0
1 2 3 3.0 2.0
0 1 2 2.0 1.0

dataframe concatenation by column value (no outer merge)

This is a new question after this, with more information
I want to merge two dataframes like the outer join, but I do not want the cartesian product, but only the concatenation, for example:
df1:
A
0 2
1 2
2 2
3 2
4 2
5 3
df2:
B
0 1
1 2
2 2
3 3
4 4
with : df3 = df1.merge(df2, left_on=['A'], right_on=['B'], how='outer') I get df3:
A B
0 2.0 2
1 2.0 2
2 2.0 2
3 2.0 2
4 2.0 2
5 2.0 2
6 2.0 2
7 2.0 2
8 2.0 2
9 2.0 2
10 3.0 3
11 NaN 1
12 NaN
But I want:
A B
0 2.0 2
1 2.0 2
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3
6 NaN 1
7 NaN 4
just concatenate the first 'm' of df1 with the m of df2 and dhe remaining values of df1
with a NaN value
get the cumulative counts of A and B, and use the combination of the counts with A and B as merge conditions :
df1['checker'] = df1.groupby("A").cumcount()
df2['checker'] = df2.groupby("B").cumcount()
res = df1.merge(df2,left_on=['A','checker'],right_on=['B','checker'],how='outer').drop('checker',axis=1)
res
A B
0 2.0 2.0
1 2.0 2.0
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3.0
6 NaN 1.0
7 NaN 4.0
You might want to try/use the concat method. ex:
result = pd.concat([A, B], axis=1, sort=False)
You can read more here.

how to columns into multiple rows in dataframe?

i have dataframe like below
A B C D E F G H G H I J K
1 2 3 4 5 6 7 8 9 10 11 12 13
and i want result like this
A B C D E F G H
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 9
1 2 3 4 5 6 7 10
1 2 3 4 5 6 7 11
1 2 3 4 5 6 7 12
1 2 3 4 5 6 7 13
like a result column 'G~K' is under column 'H'
how can i do this?
You need to adjust your columns by using cummax , then after melt, we create additional key with cumcount, then just do reshape here, I am using unstack , you can using pivot , pivot_table
s=pd.Series(df.columns)
s[(s>='H').cummax()==1]='H'
df.columns=s
df=df.melt()
yourdf=df.set_index(['variable',df.groupby('variable').cumcount()]).\
value.unstack(0).ffill()
yourdf
variable A B C D E F G H
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 9.0
2 1.0 2.0 3.0 4.0 5.0 6.0 7.0 10.0
3 1.0 2.0 3.0 4.0 5.0 6.0 7.0 11.0
4 1.0 2.0 3.0 4.0 5.0 6.0 7.0 12.0
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 13.0
I hope this would give you some help
import pandas as pd
df = pd.DataFrame([list(range(1,14))])
df.columns = ('A','B','C','D','E','F','G','H','G','H','I','J','K')
print('starting data frame:')
print(df)
df1 = df.iloc[:,0:7]
df1 = df1.append([df1]*(len(df.iloc[:,7:].T)-1))
df1.insert(df1.shape[1],'H',list(df.iloc[:,7:].values[0]))
print('result:')
print(df1)
letters = list("ABCDEFGHIJKLM")
df = pd.DataFrame([np.arange(1, len(letters) + 1)], columns=letters)
df = pd.concat([df.iloc[:, :7]] * (len(letters) - 7)).assign(H=df[letters[7:]].values[0])
df = df.reset_index(drop=True)
df
gives you
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13
Your data has some duplicates in columns name, so melt will fail. However, you could change columns name and then apply melt
In [166]: df
Out[166]:
A B C D E F G H G H I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Duplicates in column name 'G' and 'H'. Just change those to 'GG', 'HH'. Finally, apply melt
In [167]: df.columns = ('A','B','C','D','E','F','G','H','GG','HH','I','J','K')
In [168]: df
Out[168]:
A B C D E F G H GG HH I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
In [169]: df.melt(id_vars=df.columns.tolist()[0:7], value_name='H').drop('variable', 1)
Out[169]:
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13

Merging two dataframes with different lengths

How can I merge two pandas dataframes with different lengths like those:
df1 = Index block_id Ut_rec_0
0 0 7
1 1 10
2 2 2
3 3 0
4 4 10
5 5 3
6 6 6
7 7 9
df2 = Index block_id Ut_rec_1
0 0 3
2 2 5
3 3 5
5 5 9
7 7 4
result = Index block_id Ut_rec_0 Ut_rec_1
0 0 7 3
1 1 10 NaN
2 2 2 5
3 3 0 5
4 4 10 NaN
5 5 3 9
6 6 6 NaN
7 7 9 4
I already tried something like, but it did not work:
df_result = pd.concat([df1, df2], join_axes=[df1['block_id']])
I already tried:
df_result = pd.concat([df1,df2,axis = 1)
But the result was:
Index block_id Ut_rec_0 Index block_id Ut_rec_1
0 0 7 0.0 0.0 3.0
1 1 10 1.0 2.0 5.0
2 2 2 2.0 3.0 5.0
3 3 0 3.0 5.0 9.0
4 4 10 4.0 7.0 4.0
5 5 3 NaN NaN NaN
6 6 6 NaN NaN NaN
7 7 9 NaN NaN NaN
pandas.DataFrame.join can "join" dataframes based on overlap in column data (or index). Something like this will likely work for you:
df1.join(df2.set_index('block_id'), on='block_id')
As #Wen said best would be using concat with axis as 1, like the below code:
pd.concat([df1, df2],axis=1)
you need, pd.merge with outer join,
pd.merge(df1,df2,on=['Index','block_id'],how='outer')
#[out]
#Index block_id Ut_rec_0 Ut_rec_1
#0 0 7 3.0
#1 1 10 NaN
#2 2 2 5.0
#3 3 0 5.0
#4 4 10 NaN
#5 5 3 9.0
#6 6 6 NaN
#7 7 9 4.0

How do I join two dataframes based on values in selected columns?

I am trying to join (merge) two dataframes based on values in each column.
For instance, to merge by values in columns in A and B.
So, having df1
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
And df2
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
I want to get a d3 with such structure
A B C D E F L
0 4 3 1 5 4 5 1
1 5 7 0 3 3 3 2
2 3 2 1 6 Nan Nan 4
3 3 8 Nan Nan 5 5 5
Can you, please help me? I've tried both merge and join methods but havent succeed.
UPDATE: (for updated DFs and new desired DF)
In [286]: merged = pd.merge(df1, df2, on=['A','B'], how='outer', suffixes=('','_y'))
In [287]: merged.L.fillna(merged.pop('L_y'), inplace=True)
In [288]: merged
Out[288]:
A B C D L E F
0 4 3 1.0 5.0 1.0 4.0 5.0
1 5 7 0.0 3.0 2.0 3.0 3.0
2 3 2 1.0 6.0 4.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0 5.0
Data:
In [284]: df1
Out[284]:
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
In [285]: df2
Out[285]:
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
OLD answer:
you can use pd.merge(..., how='outer') method:
In [193]: pd.merge(a,b, on=['A','B'], how='outer')
Out[193]:
A B C D E F
0 4 3 1.0 5.0 4.0 5.0
1 5 7 0.0 3.0 3.0 3.0
2 3 2 1.0 6.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0
Data:
In [194]: a
Out[194]:
A B C D
0 4 3 1 5
1 5 7 0 3
2 3 2 1 6
In [195]: b
Out[195]:
A B E F
0 4 3 4 5
1 5 7 3 3
2 3 8 5 5

Categories