dataframe concatenation by column value (no outer merge) - python

This is a new question after this, with more information
I want to merge two dataframes like the outer join, but I do not want the cartesian product, but only the concatenation, for example:
df1:
A
0 2
1 2
2 2
3 2
4 2
5 3
df2:
B
0 1
1 2
2 2
3 3
4 4
with : df3 = df1.merge(df2, left_on=['A'], right_on=['B'], how='outer') I get df3:
A B
0 2.0 2
1 2.0 2
2 2.0 2
3 2.0 2
4 2.0 2
5 2.0 2
6 2.0 2
7 2.0 2
8 2.0 2
9 2.0 2
10 3.0 3
11 NaN 1
12 NaN
But I want:
A B
0 2.0 2
1 2.0 2
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3
6 NaN 1
7 NaN 4
just concatenate the first 'm' of df1 with the m of df2 and dhe remaining values of df1
with a NaN value

get the cumulative counts of A and B, and use the combination of the counts with A and B as merge conditions :
df1['checker'] = df1.groupby("A").cumcount()
df2['checker'] = df2.groupby("B").cumcount()
res = df1.merge(df2,left_on=['A','checker'],right_on=['B','checker'],how='outer').drop('checker',axis=1)
res
A B
0 2.0 2.0
1 2.0 2.0
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3.0
6 NaN 1.0
7 NaN 4.0

You might want to try/use the concat method. ex:
result = pd.concat([A, B], axis=1, sort=False)
You can read more here.

Related

Add list to Pandas Dateframe, but keep NaNs at the top

I think this has probably been answered, but I cant find the answer anywhere. It is pretty trivial. How can I add a list to a pandas dataframe as a column, but keep the NaNs at the top?
This is the code i have:
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
print(df)
which returns this:
a b c d
0 1 2 2.0 1.0
1 2 3 3.0 2.0
2 3 5 5.0 3.0
3 4 6 6.0 4.0
4 5 4 4.0 NaN
5 6 3 3.0 NaN
6 7 2 NaN NaN
However, I would like it to return this instead:
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
Let us try
df=df.apply(lambda x : sorted(x,key=pd.notnull))
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
l = df.apply(sorted, key = lambda s: (~np.isnan(s), s), axis = 0)
You can sort the dataframe rows using a key argument to keep NaNs first
If the problem is with assignment instead of transformation, you can also try with iloc with get_loc after creating a dictionary (d):
d = {'c':c,'d':d}
df = df.reindex(columns=df.columns.union(d.keys(),sort=False))
for k,v in d.items():
df.iloc[-len(v):,df.columns.get_loc(k)] = v
print(df)
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0
You may find out how many rows have NaN in them (using s.isna().sum()) and then do shift() to that column by the amount of Nans you have.
Code example on d column:
import pandas as pd
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
df['d'] = df['d'].shift(df['d'].isna().sum()) # example on the 'd' row
print(df)
Output:
a b c d
0 1 2 2.0 NaN
1 2 3 3.0 NaN
2 3 5 5.0 NaN
3 4 6 6.0 1.0
4 5 4 4.0 2.0
5 6 3 3.0 3.0
6 7 2 NaN 4.0
the way how to do it! just reset index and put na values first.
df.reset_index()
df2 = df.sort_values(by =['a','b','c','d'], ascending = False, na_position='first')
#Result
a b c d
6 7 2 NaN NaN
5 6 3 3.0 NaN
4 5 4 4.0 NaN
3 4 6 6.0 4.0
2 3 5 5.0 3.0
1 2 3 3.0 2.0
0 1 2 2.0 1.0

Merging two dataframes with different lengths

How can I merge two pandas dataframes with different lengths like those:
df1 = Index block_id Ut_rec_0
0 0 7
1 1 10
2 2 2
3 3 0
4 4 10
5 5 3
6 6 6
7 7 9
df2 = Index block_id Ut_rec_1
0 0 3
2 2 5
3 3 5
5 5 9
7 7 4
result = Index block_id Ut_rec_0 Ut_rec_1
0 0 7 3
1 1 10 NaN
2 2 2 5
3 3 0 5
4 4 10 NaN
5 5 3 9
6 6 6 NaN
7 7 9 4
I already tried something like, but it did not work:
df_result = pd.concat([df1, df2], join_axes=[df1['block_id']])
I already tried:
df_result = pd.concat([df1,df2,axis = 1)
But the result was:
Index block_id Ut_rec_0 Index block_id Ut_rec_1
0 0 7 0.0 0.0 3.0
1 1 10 1.0 2.0 5.0
2 2 2 2.0 3.0 5.0
3 3 0 3.0 5.0 9.0
4 4 10 4.0 7.0 4.0
5 5 3 NaN NaN NaN
6 6 6 NaN NaN NaN
7 7 9 NaN NaN NaN
pandas.DataFrame.join can "join" dataframes based on overlap in column data (or index). Something like this will likely work for you:
df1.join(df2.set_index('block_id'), on='block_id')
As #Wen said best would be using concat with axis as 1, like the below code:
pd.concat([df1, df2],axis=1)
you need, pd.merge with outer join,
pd.merge(df1,df2,on=['Index','block_id'],how='outer')
#[out]
#Index block_id Ut_rec_0 Ut_rec_1
#0 0 7 3.0
#1 1 10 NaN
#2 2 2 5.0
#3 3 0 5.0
#4 4 10 NaN
#5 5 3 9.0
#6 6 6 NaN
#7 7 9 4.0

How to restrict the area for operation in python pandas dataframes?

I`d like to qualify my dropna option within the first 3 rows of the dataframe. The original dataframe is:
A C
0 0.0 0
1 NaN 1
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
And I would love to see:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
With only the row indexed 1 removed. Is it possible to make it within just one line of code?
Thanks!
You could use
In [594]: df[df.notnull().all(1) | (df.index > 3)]
Out[594]:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!
Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

Append rows to groups in pandas

I'm trying to append a number of NaN rows to each group in a pandas dataframe. Essentially I want to pad each group to be 5 rows long. Ordering is important. I have:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 1 e
8 2 e
9 3 e
I want:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e
Using pd.crosstab:
df = pd.crosstab(df.Rank, df.ID).iloc[:5].unstack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
Output:
ID Rank
0 a 1.0
1 a 2.0
2 a 3.0
3 a 4.0
4 a 5.0
5 c 1.0
6 c 2.0
7 c NaN
8 c NaN
9 c NaN
10 e 1.0
11 e 2.0
12 e 3.0
13 e NaN
14 e NaN
Another approach, assuming the maximum group size in df is exactly 5.
In [251]: df.groupby('ID').Rank.apply(np.array).apply(pd.Series).stack(dropna=False)
Out[251]:
ID
a 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
c 0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
e 0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Full explanation:
import pandas as pd
import numpy as np
df = pd.read_csv(pd.compat.StringIO("""Rank ID
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
6 1 c
7 2 c
8 1 e
9 2 e
10 3 e"""), sep=r' +')
df = pd.crosstab(df.Rank, df.ID).iloc[:5].T.stack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
# pd.crosstab(df.Rank, df.ID) produces:
# ID a c e
# Rank
# 1.0 1 1 1
# 2.0 1 1 1
# 3.0 1 0 1
# 4.0 1 0 0
# 5.0 1 0 0
# applying .T.stack().reset_index() yields:
# ID Rank 0
# 0 a 1.0 1
# 1 a 2.0 1
# 2 a 3.0 1
# 3 a 4.0 1
# 4 a 5.0 1
# 5 c 1.0 1
# 6 c 2.0 1
# 7 c 3.0 0
# 8 c 4.0 0
# 9 c 5.0 0
# 10 e 1.0 1
# 11 e 2.0 1
# 12 e 3.0 1
# 13 e 4.0 0
# 14 e 5.0 0
# finally, use df[0] to filter df['Rank']
concat and reindex
This solution does not consider the values in the Rank column and only adds more rows if more are needed.
pd.concat([
d.reset_index(drop=True).reindex(range(5)).assign(id=n)
for n, d in df.groupby('id')
], ignore_index=True)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Same answer phrased a bit differently
f = lambda t: t[1].reset_index(drop=True).reindex(range(5)).assign(id=t[0])
pd.concat(map(f, df.groupby('id')), ignore_index=True)
factorize
This solution produces the Cartesian product of unique values from id and Rank
i, r = df.id.factorize()
j, c = df.Rank.factorize()
b = np.empty((r.size, c.size))
b.fill(np.nan)
b[i, j] = df.Rank.values
pd.DataFrame(dict(Rank=b.ravel(), id=r.repeat(c.size)))
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
You can use the frequency of the id's and pd.concat to merge the repetitions i.e
di = (5-df.groupby('id').size()).to_dict()
temp = pd.concat([pd.DataFrame({
'Rank':np.nan,
'id': pd.Series(np.repeat(i,di[i]))
}) for i in df['id'].unique()])
ndf = pd.concat([df,temp],ignore_index=True).sort_values('id')
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
10 NaN c
11 NaN c
12 NaN c
7 1.0 e
8 2.0 e
9 3.0 e
13 NaN e
14 NaN e
One possible solution is create helper DataFrame by numpy.repeat and then append to original, last sort_values:
s = (5 - df['id'].value_counts())
df = (df.append(pd.DataFrame({'id':np.repeat(s.index, s.values), 'Rank':np.nan}))
.sort_values('id')
.reset_index(drop=True))
print (df)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Another solution is no sorting is possible is groupby with custom function and append:
def f(x):
return x.append(pd.DataFrame([[np.nan, x.name]] * (5 - len(x)), columns=['Rank','id']))
df = df.groupby('id', sort=False).apply(f).reset_index(drop=True)
print (df)
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e
Here's one way using a single pd.DataFrame.append following by sort_values.
from itertools import chain
counts = df.groupby('id')['Rank'].count()
lst = list(chain.from_iterable([[np.nan, i]]*(5-c) for i, c in counts.items()))
res = df.append(pd.DataFrame(lst, columns=df.columns))\
.sort_values(['id', 'Rank'])\
.reset_index(drop=True)
print(res)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
excellent answers so far. i had another idea because it fits more to the problem i'm dealing with, using an outer join and pd.merge.
in addition to the example from above, i have several metric columns (m1 and m2 in this example) that i want to set to zero for every group that does not contain those Rank values. in my case, Rank is simply a time dimension, and i the df contains a time series over several IDs.
df = pd.read_csv(pd.compat.StringIO("""Rank ID m1 m2
0 1 a 1 3
1 2 a 2 3
2 3 a 1 2
3 4 a 1 3
4 5 a 2 3
6 1 c 2 2
7 2 c 2 4
8 1 e 1 3
9 2 e 1 4
10 3 e 1 2"""), sep=r' +')
i then define the df that contains all the Ranks, in this example from 1 to 10.
df_outer_right = pd.DataFrame({'Rank':np.arange(1,11,1)})
finally, i group the initial df by ID and apply an outer join using pd.merge on each group.
df.groupby('ID').apply(lambda df: pd.merge(df, df_outer_right, how='outer', on='Rank'))
which yields:
ID Rank ID m1 m2
a 0 1 a 1.0 3.0
a 1 2 a 2.0 3.0
a 2 3 a 1.0 2.0
a 3 4 a 1.0 3.0
a 4 5 a 2.0 3.0
a 5 6 NaN NaN NaN
a 6 7 NaN NaN NaN
a 7 8 NaN NaN NaN
a 8 9 NaN NaN NaN
a 9 10 NaN NaN NaN
c 0 1 c 2.0 2.0
c 1 2 c 2.0 4.0
c 2 3 NaN NaN NaN
c 3 4 NaN NaN NaN
c 4 5 NaN NaN NaN
c 5 6 NaN NaN NaN
c 6 7 NaN NaN NaN
c 7 8 NaN NaN NaN
c 8 9 NaN NaN NaN
c 9 10 NaN NaN NaN
e 0 1 e 1.0 3.0
e 1 2 e 1.0 4.0
e 2 3 e 1.0 2.0
e 3 4 NaN NaN NaN
e 4 5 NaN NaN NaN
e 5 6 NaN NaN NaN
e 6 7 NaN NaN NaN
e 7 8 NaN NaN NaN
e 8 9 NaN NaN NaN
e 9 10 NaN NaN NaN
i'm pretty sure this might not be the fastest solution :)

Categories