Join dataframes by key - repeated data as new columns - python

I'm facing the next situation. I have two dataframes lets say df1 and df2, and I need to join them by a key ( ID_ed , ID ) the second dataframe may have more than one occurrence of the key, what I need is to join the two dataframes, and add the repeated occurrences of the keys as new columns ( as shown in the next Image )
I tried to use merge = df2.join( df1 , lsuffix='_ZID', rsuffix='_IID' , how = "left" ) and concat operations but no luck so far .It seems that it only preserve the last occurrence ( as if it was overwriting the data )
Any help in this is really appreciated, and thanks in advance.

Another approach is to create a serial counter for the ID_ed column, set_index and unstack before calling the pivot_table. The pivot_table aggregation would be first. This approach would be fairly similar to this SO answer
Generate the data
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
First the merge and unstack
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
Next generate the pivot table
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
Finally rename the columns and slice by partial column name
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0

I'd use cumcount and pivot_table:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0

Related

Joining or merging multiple columns within one dataframe and keeping all data

I have this dataframe:
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
print(df)
Position1 Count1 Position2 Count2 Position3 Count3
0 1 55 4 15 3 45
1 2 35 2 35 5 95
2 3 45 7 75 6 105
I want to join the Position columns into one column named "Positions" while sorting the data in the Counts columns like so:
Positions Count1 Count2 Count3
0 1 55 Nan Nan
1 2 35 35 Nan
2 3 45 NaN 45
3 4 NaN 15 Nan
4 5 NaN NaN 95
5 6 Nan NaN 105
6 7 Nan 75 NaN
I've tried melting the dataframe, combining and merging columns but I am a bit stuck.
Note that the NaN types can easily be replaced by using df.fillna to get a dataframe like so:
df = df.fillna(0)
Positions Count1 Count2 Count3
0 1 55 0 0
1 2 35 35 0
2 3 45 0 45
3 4 0 15 0
4 5 0 0 95
5 6 0 0 105
6 7 0 75 0
Here is a way to do what you've asked:
df = df[['Position1', 'Count1']].rename(columns={'Position1':'Positions'}).join(
df[['Position2', 'Count2']].set_index('Position2'), on='Positions', how='outer').join(
df[['Position3', 'Count3']].set_index('Position3'), on='Positions', how='outer').sort_values(
by=['Positions']).reset_index(drop=True)
Output:
Positions Count1 Count2 Count3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
Explanation:
Use join first on Position1, Count1 and Position2, Count2 (with Position1 renamed as Positions) then on that join result and Position3, Count3.
Sort by Positions and use reset_index to create a new integer range index (ascending with no gaps).
Does this achieve what you are after?
import pandas as pd
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
df1, df2, df3 = df.iloc[:,:2], df.iloc[:, 2:4], df.iloc[:, 4:6]
df1.columns, df2.columns, df3.columns = ['Positions', 'Count1'], ['Positions', 'Count2'], ['Positions', 'Count3']
df1.merge(df2, on='Positions', how='outer').merge(df3, on='Positions', how='outer').sort_values('Positions')
Output:
wide_to_long unpivots the DF from Long to wide and that is what's used here.
columns names are also renamed here, with this edit
df['id'] = df.index
df2=pd.wide_to_long(df, stubnames=['Position','Count'], i='id', j='pos').reset_index()
df2=df2.pivot(index=['id','Position'], columns='pos', values='Count').reset_index().fillna(0).add_prefix('count_')
df2.rename(columns={'count_id': 'id', 'count_Position' :'Position'}, inplace=True)
df2
RESULT:
pos id Position 1 2 3
0 0 1 55.0 0.0 0.0
1 0 3 0.0 0.0 45.0
2 0 4 0.0 15.0 0.0
3 1 2 35.0 35.0 0.0
4 1 5 0.0 0.0 95.0
5 2 3 45.0 0.0 0.0
6 2 6 0.0 0.0 105.0
7 2 7 0.0 75.0 0.0
PS: I'm unable to format the output, I'll appreciate if someone guide me here. Thanks!
One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = None,
names_to = ('.value', 'num'),
names_pattern = r"(.+)(\d+)")
.pivot_wider(index = 'Position', names_from = 'num')
)
Position Count_1 Count_2 Count_3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
In the pivot_longer section, the .value determines which part of the column names remain as column headers - in this case it is is Position and Count.

Find cumulative sums of each grouping in a row and then set the grouping equal to the maximum sum

If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9
First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0

How to reset cumprod when na's are in for pandas column

I have a 2 columns in a dataframe that I want to calculate the cumprod for both, but the cumprod needs to restart once it sees an na in the cell
I have tried using cumprod straightforwardly, but it's not getting me the correct values because the cumprod is continuous and not restarting when the na shows up
Here is an exaple df
index col1 col2
0 2 4
1 6 4
2 1 na
3 2 7
4 na 6
5 na 8
6 5 na
7 8 9
8 3 2
here is my desired output:
index col1 col2
0 2 4
1 12 16
2 12 na
3 24 7
4 na 42
5 na 336
6 5 na
7 40 9
8 240 18
Here is a solution that operates on each column and concats back together, since the masks are different for each column.
pd.concat(
[df[col].groupby(df[col].isnull().cumsum()).cumprod() for col in df.columns], axis=1)
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
A slightly more efficient approach is to calculate the grouper mask all at once and use zip
m = df.isnull().cumsum()
pd.concat(
[df[col].groupby(mask).cumprod() for col, mask in zip(df.columns, m.values.T)], axis=1)
Here's a similar solution with dict comprehension and the default constructor
pd.DataFrame({c: df[c].groupby(df[c].isna().cumsum()).cumprod() for c in df.columns})
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
You can use groupby with isna and cumsum to get groups to comprod over in each column using apply:
df.apply(lambda x: x.groupby(x.isna().cumsum()).cumprod())
Output:
col1 col2
index
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
Here is a solution without operating column by column:
df = pd.DataFrame([[2,4], [6,4], [1,np.nan], [2,7], [np.nan,6], [np.nan,8], [5,np.nan], [8,9], [3,2]],
columns=['col1', 'col2'])
df_cumprod = df.cumprod()
adjust_factor = df_cumprod.fillna(method='ffill').where(df_cumprod.isnull()).fillna(method='ffill').fillna(1)
print(df_cumprod / adjust_factor)
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0

pandas groupby rolling mean/median with dropping missing values

How can get in pandas groupby rolling mean/median with dropping missing values? I.e. the output should drop missing values before calculating mean/median instead of giving me NaN if a missing value is present.
import pandas as pd
t = pd.DataFrame(data={v.date:[0,0,0,0,1,1,1,1,2,2,2,2],
'i0':[0,1,2,3,0,1,2,3,0,1,2,3],
'i1':['A']*12,
'x':[10.,20.,30.,np.nan,np.nan,21.,np.nan,41.,np.nan,np.nan,32.,42.]})
t.set_index([v.date,'i0','i1'], inplace=True)
t.sort_index(inplace=True)
print(t)
print(t.groupby('date').apply(lambda x: x.rolling(window=2).mean()))
gives
x
date i0 i1
0 0 A 10.0
1 A 20.0
2 A 30.0
3 A NaN
1 0 A NaN
1 A 21.0
2 A NaN
3 A 41.0
2 0 A NaN
1 A NaN
2 A 32.0
3 A 42.0
x
date i0 i1
0 0 A NaN
1 A 15.0
2 A 25.0
3 A NaN
1 0 A NaN
1 A NaN
2 A NaN
3 A NaN
2 0 A NaN
1 A NaN
2 A NaN
3 A 37.0
while I want the following for this example:
x
date i0 i1
0 0 A 10.0
1 A 15.0
2 A 25.0
3 A 30.0
1 0 A NaN
1 A 21.0
2 A 21.0
3 A 41.0
2 0 A NaN
1 A NaN
2 A 32.0
3 A 37.0
what I tried
t.groupby('date').apply(lambda x: x.rolling(window=2).dropna().median())
and
t.groupby('date').apply(lambda x: x.rolling(window=2).median(dropna=True))
(both raise exceptions, but maybe there exists something along the lines)
Thank you for your help!
You're looking for min_periods? Note that you don't need apply, callGroupBy.Rolling directly:
t.groupby('date', group_keys=False).rolling(window=2, min_periods=1).mean()
x
date i0 i1
0 0 A 10.0
1 A 15.0
2 A 25.0
3 A 30.0
1 0 A NaN
1 A 21.0
2 A 21.0
3 A 41.0
2 0 A NaN
1 A NaN
2 A 32.0
3 A 37.0

Fill null values with values from a column in another dataset

I have 2 datasets like this:
df1.head(5)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 NaN
4 3 8.0
5 2 NaN
df2.head(2)
cost
3 33.0
5 55.0
df2 contains one column with values on the same indexes, where df1 is null
I would like to do get this result:
df1.head(5)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 33.0
4 3 8.0
5 2 55.0
So fill the cost column in df1 by values in df2 on the same indexes
fillna
Pandas assigns by index naturally:
df1['cost'] = df1['cost'].fillna(df2['cost'])
print(df1)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 33.0
4 3 8.0
5 2 55.0

Categories