dataframe concatenation by column value (no outer merge)

dataframe concatenation by column value (no outer merge) - python

This is a new question after this, with more information
I want to merge two dataframes like the outer join, but I do not want the cartesian product, but only the concatenation, for example:
df1:
A
0 2
1 2
2 2
3 2
4 2
5 3
df2:
B
0 1
1 2
2 2
3 3
4 4
with : df3 = df1.merge(df2, left_on=['A'], right_on=['B'], how='outer') I get df3:
A B
0 2.0 2
1 2.0 2
2 2.0 2
3 2.0 2
4 2.0 2
5 2.0 2
6 2.0 2
7 2.0 2
8 2.0 2
9 2.0 2
10 3.0 3
11 NaN 1
12 NaN
But I want:
A B
0 2.0 2
1 2.0 2
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3
6 NaN 1
7 NaN 4
just concatenate the first 'm' of df1 with the m of df2 and dhe remaining values of df1
with a NaN value

get the cumulative counts of A and B, and use the combination of the counts with A and B as merge conditions :
df1['checker'] = df1.groupby("A").cumcount()
df2['checker'] = df2.groupby("B").cumcount()
res = df1.merge(df2,left_on=['A','checker'],right_on=['B','checker'],how='outer').drop('checker',axis=1)
res
A B
0 2.0 2.0
1 2.0 2.0
2 2.0 NaN
3 2.0 NaN
4 2.0 NaN
5 3.0 3.0
6 NaN 1.0
7 NaN 4.0

You might want to try/use the concat method. ex:
result = pd.concat([A, B], axis=1, sort=False)
You can read more here.

Related

Add list to Pandas Dateframe, but keep NaNs at the top

I think this has probably been answered, but I cant find the answer anywhere. It is pretty trivial. How can I add a list to a pandas dataframe as a column, but keep the NaNs at the top?
This is the code i have:
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
print(df)
which returns this:
a b c d
0 1 2 2.0 1.0
1 2 3 3.0 2.0
2 3 5 5.0 3.0
3 4 6 6.0 4.0
4 5 4 4.0 NaN
5 6 3 3.0 NaN
6 7 2 NaN NaN
However, I would like it to return this instead:
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

Let us try
df=df.apply(lambda x : sorted(x,key=pd.notnull))
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

l = df.apply(sorted, key = lambda s: (~np.isnan(s), s), axis = 0)
You can sort the dataframe rows using a key argument to keep NaNs first

If the problem is with assignment instead of transformation, you can also try with iloc with get_loc after creating a dictionary (d):
d = {'c':c,'d':d}
df = df.reindex(columns=df.columns.union(d.keys(),sort=False))
for k,v in d.items():
df.iloc[-len(v):,df.columns.get_loc(k)] = v
print(df)
a b c d
0 1 2 NaN NaN
1 2 3 2.0 NaN
2 3 5 3.0 NaN
3 4 6 5.0 1.0
4 5 4 6.0 2.0
5 6 3 4.0 3.0
6 7 2 3.0 4.0

You may find out how many rows have NaN in them (using s.isna().sum()) and then do shift() to that column by the amount of Nans you have.
Code example on d column:
import pandas as pd
df = pd.DataFrame()
a = [1,2,3,4,5,6,7]
b = [2,3,5,6,4,3,2]
c = [2,3,5,6,4,3]
d = [1,2,3,4]
df["a"] = a
df["b"] = b
df.loc[range(len(c)),'c'] = c
df.loc[range(len(d)),'d'] = d
df['d'] = df['d'].shift(df['d'].isna().sum()) # example on the 'd' row
print(df)
Output:
a b c d
0 1 2 2.0 NaN
1 2 3 3.0 NaN
2 3 5 5.0 NaN
3 4 6 6.0 1.0
4 5 4 4.0 2.0
5 6 3 3.0 3.0
6 7 2 NaN 4.0

the way how to do it! just reset index and put na values first.
df.reset_index()
df2 = df.sort_values(by =['a','b','c','d'], ascending = False, na_position='first')
#Result
a b c d
6 7 2 NaN NaN
5 6 3 3.0 NaN
4 5 4 4.0 NaN
3 4 6 6.0 4.0
2 3 5 5.0 3.0
1 2 3 3.0 2.0
0 1 2 2.0 1.0

Merging two dataframes with different lengths

How can I merge two pandas dataframes with different lengths like those:
df1 = Index block_id Ut_rec_0
0 0 7
1 1 10
2 2 2
3 3 0
4 4 10
5 5 3
6 6 6
7 7 9
df2 = Index block_id Ut_rec_1
0 0 3
2 2 5
3 3 5
5 5 9
7 7 4
result = Index block_id Ut_rec_0 Ut_rec_1
0 0 7 3
1 1 10 NaN
2 2 2 5
3 3 0 5
4 4 10 NaN
5 5 3 9
6 6 6 NaN
7 7 9 4
I already tried something like, but it did not work:
df_result = pd.concat([df1, df2], join_axes=[df1['block_id']])
I already tried:
df_result = pd.concat([df1,df2,axis = 1)
But the result was:
Index block_id Ut_rec_0 Index block_id Ut_rec_1
0 0 7 0.0 0.0 3.0
1 1 10 1.0 2.0 5.0
2 2 2 2.0 3.0 5.0
3 3 0 3.0 5.0 9.0
4 4 10 4.0 7.0 4.0
5 5 3 NaN NaN NaN
6 6 6 NaN NaN NaN
7 7 9 NaN NaN NaN

pandas.DataFrame.join can "join" dataframes based on overlap in column data (or index). Something like this will likely work for you:
df1.join(df2.set_index('block_id'), on='block_id')

As #Wen said best would be using concat with axis as 1, like the below code:
pd.concat([df1, df2],axis=1)

you need, pd.merge with outer join,
pd.merge(df1,df2,on=['Index','block_id'],how='outer')
#[out]
#Index block_id Ut_rec_0 Ut_rec_1
#0 0 7 3.0
#1 1 10 NaN
#2 2 2 5.0
#3 3 0 5.0
#4 4 10 NaN
#5 5 3 9.0
#6 6 6 NaN
#7 7 9 4.0

How to restrict the area for operation in python pandas dataframes?

I`d like to qualify my dropna option within the first 3 rows of the dataframe. The original dataframe is:
A C
0 0.0 0
1 NaN 1
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
And I would love to see:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6
With only the row indexed 1 removed. Is it possible to make it within just one line of code?
Thanks!

You could use
In [594]: df[df.notnull().all(1) | (df.index > 3)]
Out[594]:
A C
0 0.0 0
2 2.0 2
3 3.0 3
4 NaN 4
5 5.0 5
6 6.0 6

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!

Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

Append rows to groups in pandas

I'm trying to append a number of NaN rows to each group in a pandas dataframe. Essentially I want to pad each group to be 5 rows long. Ordering is important. I have:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 1 e
8 2 e
9 3 e
I want:
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e

Using pd.crosstab:
df = pd.crosstab(df.Rank, df.ID).iloc[:5].unstack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
Output:
ID Rank
0 a 1.0
1 a 2.0
2 a 3.0
3 a 4.0
4 a 5.0
5 c 1.0
6 c 2.0
7 c NaN
8 c NaN
9 c NaN
10 e 1.0
11 e 2.0
12 e 3.0
13 e NaN
14 e NaN
Another approach, assuming the maximum group size in df is exactly 5.
In [251]: df.groupby('ID').Rank.apply(np.array).apply(pd.Series).stack(dropna=False)
Out[251]:
ID
a 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
c 0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
e 0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
dtype: float64
Full explanation:
import pandas as pd
import numpy as np
df = pd.read_csv(pd.compat.StringIO("""Rank ID
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
6 1 c
7 2 c
8 1 e
9 2 e
10 3 e"""), sep=r' +')
df = pd.crosstab(df.Rank, df.ID).iloc[:5].T.stack().reset_index()
df.loc[(df[0]==0),'Rank'] = np.nan
del df[0]
# pd.crosstab(df.Rank, df.ID) produces:
# ID a c e
# Rank
# 1.0 1 1 1
# 2.0 1 1 1
# 3.0 1 0 1
# 4.0 1 0 0
# 5.0 1 0 0
# applying .T.stack().reset_index() yields:
# ID Rank 0
# 0 a 1.0 1
# 1 a 2.0 1
# 2 a 3.0 1
# 3 a 4.0 1
# 4 a 5.0 1
# 5 c 1.0 1
# 6 c 2.0 1
# 7 c 3.0 0
# 8 c 4.0 0
# 9 c 5.0 0
# 10 e 1.0 1
# 11 e 2.0 1
# 12 e 3.0 1
# 13 e 4.0 0
# 14 e 5.0 0
# finally, use df[0] to filter df['Rank']

concat and reindex
This solution does not consider the values in the Rank column and only adds more rows if more are needed.
pd.concat([
d.reset_index(drop=True).reindex(range(5)).assign(id=n)
for n, d in df.groupby('id')
], ignore_index=True)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Same answer phrased a bit differently
f = lambda t: t[1].reset_index(drop=True).reindex(range(5)).assign(id=t[0])
pd.concat(map(f, df.groupby('id')), ignore_index=True)
factorize
This solution produces the Cartesian product of unique values from id and Rank
i, r = df.id.factorize()
j, c = df.Rank.factorize()
b = np.empty((r.size, c.size))
b.fill(np.nan)
b[i, j] = df.Rank.values
pd.DataFrame(dict(Rank=b.ravel(), id=r.repeat(c.size)))
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e

You can use the frequency of the id's and pd.concat to merge the repetitions i.e
di = (5-df.groupby('id').size()).to_dict()
temp = pd.concat([pd.DataFrame({
'Rank':np.nan,
'id': pd.Series(np.repeat(i,di[i]))
}) for i in df['id'].unique()])
ndf = pd.concat([df,temp],ignore_index=True).sort_values('id')
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
10 NaN c
11 NaN c
12 NaN c
7 1.0 e
8 2.0 e
9 3.0 e
13 NaN e
14 NaN e

One possible solution is create helper DataFrame by numpy.repeat and then append to original, last sort_values:
s = (5 - df['id'].value_counts())
df = (df.append(pd.DataFrame({'id':np.repeat(s.index, s.values), 'Rank':np.nan}))
.sort_values('id')
.reset_index(drop=True))
print (df)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e
Another solution is no sorting is possible is groupby with custom function and append:
def f(x):
return x.append(pd.DataFrame([[np.nan, x.name]] * (5 - len(x)), columns=['Rank','id']))
df = df.groupby('id', sort=False).apply(f).reset_index(drop=True)
print (df)
Rank id
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 1 c
6 2 c
7 NaN c
8 NaN c
9 NaN c
10 1 e
11 2 e
12 3 e
13 NaN e
14 NaN e

Here's one way using a single pd.DataFrame.append following by sort_values.
from itertools import chain
counts = df.groupby('id')['Rank'].count()
lst = list(chain.from_iterable([[np.nan, i]]*(5-c) for i, c in counts.items()))
res = df.append(pd.DataFrame(lst, columns=df.columns))\
.sort_values(['id', 'Rank'])\
.reset_index(drop=True)
print(res)
Rank id
0 1.0 a
1 2.0 a
2 3.0 a
3 4.0 a
4 5.0 a
5 1.0 c
6 2.0 c
7 NaN c
8 NaN c
9 NaN c
10 1.0 e
11 2.0 e
12 3.0 e
13 NaN e
14 NaN e

excellent answers so far. i had another idea because it fits more to the problem i'm dealing with, using an outer join and pd.merge.
in addition to the example from above, i have several metric columns (m1 and m2 in this example) that i want to set to zero for every group that does not contain those Rank values. in my case, Rank is simply a time dimension, and i the df contains a time series over several IDs.
df = pd.read_csv(pd.compat.StringIO("""Rank ID m1 m2
0 1 a 1 3
1 2 a 2 3
2 3 a 1 2
3 4 a 1 3
4 5 a 2 3
6 1 c 2 2
7 2 c 2 4
8 1 e 1 3
9 2 e 1 4
10 3 e 1 2"""), sep=r' +')
i then define the df that contains all the Ranks, in this example from 1 to 10.
df_outer_right = pd.DataFrame({'Rank':np.arange(1,11,1)})
finally, i group the initial df by ID and apply an outer join using pd.merge on each group.
df.groupby('ID').apply(lambda df: pd.merge(df, df_outer_right, how='outer', on='Rank'))
which yields:
ID Rank ID m1 m2
a 0 1 a 1.0 3.0
a 1 2 a 2.0 3.0
a 2 3 a 1.0 2.0
a 3 4 a 1.0 3.0
a 4 5 a 2.0 3.0
a 5 6 NaN NaN NaN
a 6 7 NaN NaN NaN
a 7 8 NaN NaN NaN
a 8 9 NaN NaN NaN
a 9 10 NaN NaN NaN
c 0 1 c 2.0 2.0
c 1 2 c 2.0 4.0
c 2 3 NaN NaN NaN
c 3 4 NaN NaN NaN
c 4 5 NaN NaN NaN
c 5 6 NaN NaN NaN
c 6 7 NaN NaN NaN
c 7 8 NaN NaN NaN
c 8 9 NaN NaN NaN
c 9 10 NaN NaN NaN
e 0 1 e 1.0 3.0
e 1 2 e 1.0 4.0
e 2 3 e 1.0 2.0
e 3 4 NaN NaN NaN
e 4 5 NaN NaN NaN
e 5 6 NaN NaN NaN
e 6 7 NaN NaN NaN
e 7 8 NaN NaN NaN
e 8 9 NaN NaN NaN
e 9 10 NaN NaN NaN
i'm pretty sure this might not be the fastest solution :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

dataframe concatenation by column value (no outer merge) - python

You might want to try/use the concat method. ex: result = pd.concat([A, B], axis=1, sort=False) You can read more here.

Related

Add list to Pandas Dateframe, but keep NaNs at the top

Merging two dataframes with different lengths

How to restrict the area for operation in python pandas dataframes?

Python pandas : groupby on two columns and create new variables

Append rows to groups in pandas

Categories

Resources