Python: how to merge multiple dataframes with conditions?

Python: how to merge multiple dataframes with conditions? - python

I have a dataframe contains two different IDs lists.
df
ID1 ID2
0 0 35
1 0 35
2 1 33
3 2 27
Then I have two dataframes df1 and df2 that contain the coordinates of such IDs.
df1
ID1 x y
0 0 1.3 2.3
1 1 2.5 7.2
3 2 4.5 4.5
df2
ID2 x y
0 27 3.6 4.5
1 33 3.3 2.3
2 35 2.3 2.5
I would like to to assign to df the coordinates of ID1 if it is repeated more times and the coordinates of ID2 if ID1 only appears once in df
At the end I would like something like that
df
ID1 ID2 x y
0 0 35 1.3 2.3
1 0 35 1.3 2.3
2 1 33 3.3 2.3
3 2 27 3.6 4.5

I think this will do the trick:
df=df.merge(df1).merge(df2,on='ID2',suffixes=['_id1','_id2'])
mask=df.groupby('ID1').transform('count')['ID2']
df['x']=np.where(mask>1,df['x_id1'],df['x_id2'])
df['y']=np.where(mask>1,df['y_id1'],df['y_id2'])
df[['ID1','ID2','x','y']]
ID1 ID2 x y
0 0 35 1.3 2.3
1 0 35 1.3 2.3
2 1 33 3.3 2.3
3 2 27 3.6 4.5

Try this one out
df3 = (df[df.duplicated(subset='ID1')]).merge(df1, how='left')
df4 = (df.drop_duplicates(subset='ID1')).merge(df2, on='ID2')
df5 = df3.merge(df4, how='outer').drop_duplicates(subset='ID1', keep='first')
df5.reindex(df.index, method='ffill')

Related

Python: how to apply a function fo same ids in a pandas dataframe without a loop?

I have two dataframes with same column id and for each id I need to apply the following function
def findConstant(df1,df2):
c = df1.iloc[[0], df1.eq(df1.iloc[0]).all().to_numpy()].squeeze()
return pd.concat([df1, df2]).assign(**c).reset_index(drop=True)
what I am doing the is the following:
df3 = pd.DataFrame()
for idx in df1['id']:
tmp1 = df1[df1['id']==idx]
tmp2 = df2[df2['id']==idx]
tmp3 = findConstant(tmp1,tmp2)
df3 = pd.concat([df3,tmp3], ignore_index(drop=True))
I would like to know how to avoid a loop like that

Use:
print (df1)
A B C id val
0 ar 2 8 1 3.2
1 ar 3 7 1 5.6
3 ar1 0 3 2 7.8
4 ar1 4 3 2 9.2
5 ar1 5 3 2 3.4
print (df2)
id val
0 1 3.3
1 2 6.4
#get number of unique values and first values to df3
df3 = df1.groupby('id').agg(['nunique','first'])
#filter if same values by comapre by 1
m = df3.xs('nunique', axis=1, level=1).eq(1)
#get correct values to df with replace not matched by original df2
df = df3.xs('first', axis=1, level=1).where(m).combine_first(df2.set_index('id'))
print (df)
A B C val
id
1 ar NaN NaN 3.3
2 ar1 NaN 3.0 6.4
#join together
df = pd.concat([df1, df.reset_index()], ignore_index=True)
print (df)
A B C id val
0 ar 2.0 8.0 1 3.2
1 ar 3.0 7.0 1 5.6
2 ar1 0.0 3.0 2 7.8
3 ar1 4.0 3.0 2 9.2
4 ar1 5.0 3.0 2 3.4
5 ar NaN NaN 1 3.3
6 ar1 NaN 3.0 2 6.4

Trying to unstack dataframe with multiple empty columns (NaN)

I currently have a code which turns this:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 40 3 150 5
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
Into this:
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
And here you can check the code:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 50 1 131 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
print(df1)
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
The program works correctly if Key colums have unique values but it fails if there are duplicate values. The issue I have is that my actual dataframe has rows with 30 clumns, other with 60, other with 63, etc. So the program is detecting empty values as duplicate and the program fails.
Please check this example:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 NaN NaN NaN NaN
1 2.2.2 erto 50 7 40 8 150.0 8.0 131.0 2.0
2 3.3.3 gema 131 2 150 5 40.0 1.0 50.0 3.0
And I would like to get something like this:
ID User 40 50 131 150
0 1.1.1 amba 1 4
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
If I try to unstack this, i get the error "Index contains duplicate entries, cannot reshape". I have been reading about this and df.drop_duplicates, pivot_tables, tc could help in this situation but I cannot just make work anything of this with my current code. Any idea about how o fix this? Thanks.

Idea is convert first 2 columns to MultiIndex, then use concat by selected pair and unpair columns by DataFrame.iloc, reshaped by DataFrame.stack and removed third unnecessary level of MultiIndex by DataFrame.reset_index:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
Last add key column to MultiIndex by DataFrame.set_index and reshape by Series.unstack, convert MultiIndex to columns by reset_index, rename columns names and last remove columns levels name by DataFrame.rename_axis:
df = (df.set_index('key', append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
Also it working well for second example, because missing rows are removed by stack, also added rename for convert columns names to int if possible:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
print (df)
key val
A B
1.1.1 amba 50.0 1.0
amba 131.0 4.0
2.2.2 erto 50.0 7.0
erto 40.0 8.0
erto 150.0 8.0
erto 131.0 2.0
3.3.3 gema 131.0 2.0
gema 150.0 5.0
gema 40.0 1.0
gema 50.0 3.0
df = (df.set_index('key', append=True)['val']
.unstack()
.rename(columns=int)
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba NaN 1.0 4.0 NaN
1 2.2.2 erto 8.0 7.0 2.0 8.0
2 3.3.3 gema 1.0 3.0 2.0 5.0
EDIT1 Added helper column with counter for avoid duplicates:
print (df)
A B C D E F G H I J
0 1.1.1 amba 50 1 50 4 40 3 150 5 <- E=50
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
df['g'] = df.groupby(['A','B','key']).cumcount()
print (df)
key val g
A B
1.1.1 amba 50 1 0
amba 50 4 1
amba 40 3 0
amba 150 5 0
2.2.2 erto 50 7 0
erto 40 8 0
erto 150 8 0
erto 131 2 0
3.3.3 gema 131 2 0
gema 150 5 0
gema 40 1 0
gema 50 3 0
df = (df.set_index(['g','key'], append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User g 40 50 131 150
0 1.1.1 amba 0 3.0 1.0 NaN 5.0
1 1.1.1 amba 1 NaN 4.0 NaN NaN
2 2.2.2 erto 0 8.0 7.0 2.0 8.0
3 3.3.3 gema 0 1.0 3.0 2.0 5.0

What you are trying to do seems too complex. May i suggest a simpler solution which just converts each row to a dictionary of desired result and then binds them back together:
pd.DataFrame(list(map(lambda row: {'ID':row['A'], 'User':row['B'], row['C']:row['D'],
row['E']:row['F'], row['G']:row['H'], row['I']:row['J']},
df1.to_dict('r'))))

Pandas: Drop Rows, if group's size is larger than mean

I want all grouped rows to be the same size. I.e either by removing the last rows or adding zeros if the group has a small size.
d = {'ID':['a12', 'a12','a12','a12','a12','b33','b33','b33','b33','v55','v55','v55','v55','v55','v55'], 'Exp_A':[2.2,2.2,2.2,2.2,2.2,3.1,3.1,3.1,3.1,1.5,1.5,1.5,1.5,1.5,1.5],
'Exp_B':[2.4,2.4,2.4,2.4,2.4,1.2,1.2,1.2,1.2,1.5,1.5,1.5,1.5,1.5,1.5],
'A':[0,0,1,0,1,0,1,0,1,0,1,1,1,0,1], 'B':[0,0,1,1,1,0,0,1,1,1,0,0,1,0,1]}
df1 = pd.DataFrame(data=d)
I want all df1.ID to be size df1.groupby('ID').size().mean().
So df1 should look like:
A B Exp_A Exp_B ID
0 0 0 2.2 2.4 a12
1 0 0 2.2 2.4 a12
2 1 1 2.2 2.4 a12
3 0 1 2.2 2.4 a12
4 1 1 2.2 2.4 a12
5 0 0 3.1 1.2 b33
6 1 0 3.1 1.2 b33
7 0 1 3.1 1.2 b33
8 1 1 3.1 1.2 b33
9 0 0 3.1 1.2 b33
10 0 1 1.5 1.5 v55
11 1 0 1.5 1.5 v55
12 1 0 1.5 1.5 v55
13 1 1 1.5 1.5 v55
14 0 0 1.5 1.5 v55

Here's one solution using GroupBy. The complication arises with your condition to add extra rows with certain columns set to 0, whenever a particular group is too small.
g = df1.groupby('ID')
n = int(g.size().mean())
res = []
for _, df in g:
k = len(df.index)
excess = n - k
if excess > 0:
df = df.append(pd.concat([df.iloc[[-1]].assign(A=0, B=0)]*excess))
res.append(df.iloc[:n])
res = pd.concat(res, ignore_index=True)
print(res)
A B Exp_A Exp_B ID
0 0 0 2.2 2.4 a12
1 0 0 2.2 2.4 a12
2 1 1 2.2 2.4 a12
3 0 1 2.2 2.4 a12
4 1 1 2.2 2.4 a12
5 0 0 3.1 1.2 b33
6 1 0 3.1 1.2 b33
7 0 1 3.1 1.2 b33
8 1 1 3.1 1.2 b33
9 0 0 3.1 1.2 b33
10 0 1 1.5 1.5 v55
11 1 0 1.5 1.5 v55
12 1 0 1.5 1.5 v55
13 1 1 1.5 1.5 v55
14 0 0 1.5 1.5 v55

Here is a solution without looping. You can first determine the number of rows for each ID and then go about changing stuff.
# Getting the minimum required number of rows for each ID
min_req = df.groupby('ID').size().mean()
# Adding auto-increment column with respect to ID column
df['row_count'] = df.groupby(['ID']).cumcount()+1
# Adding excess rows equal to required rows
# we will delete unneeded ones later
df2 = df.groupby('ID', as_index=False).max()
df2 = df2.loc[df2['row_count']<int(min_req)]
df2 = df2.assign(A=0, B=0)
df = df.append([df2]*int(min_req), ignore_index=True)
# recalculating the count
df = df.drop('row_count', axis=1)
df = df.sort_values(by=['ID', 'A', 'B'], ascending=[True, False, False])
df['row_count'] = df.groupby(['ID']).cumcount()+1
# Dropping excess rows
df = df.drop((df.loc[df['row_count']>5]).index)
df = df.drop('row_count', axis=1)
df
A B Exp_A Exp_B ID
0 0 0 2.2 2.4 a12
1 0 0 2.2 2.4 a12
2 1 1 2.2 2.4 a12
3 0 1 2.2 2.4 a12
4 1 1 2.2 2.4 a12
17 0 0 3.1 1.2 b33
16 0 0 3.1 1.2 b33
15 0 0 3.1 1.2 b33
18 0 0 3.1 1.2 b33
19 0 0 3.1 1.2 b33
10 1 0 1.5 1.5 v55
11 1 0 1.5 1.5 v55
12 1 1 1.5 1.5 v55
13 0 0 1.5 1.5 v55
14 1 1 1.5 1.5 v55

Pandas wide_to_long, the id variables need to uniquely identify each row

Lets say I have a data frame like this
ID,Time1,Value1,Time2,Value2,Time3,Value3
1,2,1.1,3,1.2,4,1.3
1,5,2.1,6,2.2,7,2.3
And the expected dataframe is this
ID,Time,Value
1,2,1.1
1,3,1.2
1,4,1.3
1,5,2.1
1,6,2.2
1,7,2.3
If the row has unique id, the pd.wide_to_long works perfectly in such case.
df = pd.wide_to_long(df, ['Time',Value],'ID','value', sep='', suffix='.+')\
.reset_index()\
.sort_values(['ID', 'Time'])\
.drop('value', axis=1)\
.dropna(how='any')
but how to fix in such situation, if row's ID is not unique

Trick is use reset_index for column of unique values:
df = (pd.wide_to_long(df.reset_index(), ['Time','Value'],i='index',j='value')
.reset_index(drop=True)
.sort_values(['ID', 'Time'])
.dropna(how='any')
)
print (df)
ID Time Value
0 1 2 1.1
2 1 3 1.2
4 1 4 1.3
1 1 5 2.1
3 1 6 2.2
5 1 7 2.3
Detail:
print (pd.wide_to_long(df.reset_index(), ['Time','Value'],i='index',j='value'))
ID Time Value
index value
0 1 1 2 1.1
1 1 1 5 2.1
0 2 1 3 1.2
1 2 1 6 2.2
0 3 1 4 1.3
1 3 1 7 2.3

two different csv file data manipulation using pandas

I have two data frame df1 and df2
df1 has following data (N Rows)
Time(s) sv-01 sv-02 sv-03 Val1 val2 val3
1339.4 1 4 12 1.6 0.6 1.3
1340.4 1 12 4 -0.5 0.5 1.4
1341.4 1 6 8 0.4 5 1.6
1342.4 2 5 14 1.2 3.9 11
...... ..... .... ... ..
df2 has following data which has more rows than df1
Time(msec) channel svid value-1 value-2 valu-03
1000 1 2 0 5 1
1000 2 5 1 4 2
1000 3 2 3 4 7
..... .....................................
1339400 1 1 1.6 0.4 5.3
1339400 2 12 0.5 1.8 -4.4
1339400 3 4 -0.20 1.6 -7.9
1340400 1 1 0.3 0.3 1.5
1340400 2 6 2.3 -4.3 1.0
1340400 3 4 2.0 1.1 -0.45
1341400 1 1 2 2.1 0
1341400 2 8 3.4 -0.3 1
1341400 3 6 0 4.1 2.3
.... .... .. ... ... ...
What I am trying to achieve is
1.first multiplying Time(s) column by 1000 so that it matches with df2
millisecond column.
2.In df1 sv 01,02 and 03 are in independent column but those sv are
present in same column under svid.
So goal is when time of df1(after changing) is matching with time
of df2 copy next three consecutive lines i.e copy all matched
lines of that time instant.
Basically I want to iterate the time of df1 in df2 time column
and if there is a match copy three next rows and copy to a new df.
I have seen examples using pandas merge function but in my case both have
different header.
Thanks.

I think you need double boolean indexing - first df2 with isin, for multiple is used mul:
And then count values per groups by cumcount and filter first 3:
df = df2[df2['Time(msec)'].isin(df1['Time(s)'].mul(1000))]
df = df[df.groupby('Time(msec)').cumcount() < 3]
print (df)
Time(msec) channel svid value-1 value-2 valu-03
3 1339400 1 1 1.6 0.4 5.30
4 1339400 2 12 0.5 1.8 -4.40
5 1339400 3 4 -0.2 1.6 -7.90
6 1340400 1 1 0.3 0.3 1.50
7 1340400 2 6 2.3 -4.3 1.00
8 1340400 3 4 2.0 1.1 -0.45
9 1341400 1 1 2.0 2.1 0.00
10 1341400 2 8 3.4 -0.3 1.00
11 1341400 3 6 0.0 4.1 2.30
Detail:
print (df.groupby('Time(msec)').cumcount())
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
dtype: int64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: how to merge multiple dataframes with conditions? - python

Try this one out df3 = (df[df.duplicated(subset='ID1')]).merge(df1, how='left') df4 = (df.drop_duplicates(subset='ID1')).merge(df2, on='ID2') df5 = df3.merge(df4, how='outer').drop_duplicates(subset='ID1', keep='first') df5.reindex(df.index, method='ffill')

Related

Python: how to apply a function fo same ids in a pandas dataframe without a loop?

Trying to unstack dataframe with multiple empty columns (NaN)

Pandas: Drop Rows, if group's size is larger than mean

Pandas wide_to_long, the id variables need to uniquely identify each row

two different csv file data manipulation using pandas

Categories

Resources