How do I join two dataframes based on values in selected columns? - python

I am trying to join (merge) two dataframes based on values in each column.
For instance, to merge by values in columns in A and B.
So, having df1
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
And df2
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
I want to get a d3 with such structure
A B C D E F L
0 4 3 1 5 4 5 1
1 5 7 0 3 3 3 2
2 3 2 1 6 Nan Nan 4
3 3 8 Nan Nan 5 5 5
Can you, please help me? I've tried both merge and join methods but havent succeed.

UPDATE: (for updated DFs and new desired DF)
In [286]: merged = pd.merge(df1, df2, on=['A','B'], how='outer', suffixes=('','_y'))
In [287]: merged.L.fillna(merged.pop('L_y'), inplace=True)
In [288]: merged
Out[288]:
A B C D L E F
0 4 3 1.0 5.0 1.0 4.0 5.0
1 5 7 0.0 3.0 2.0 3.0 3.0
2 3 2 1.0 6.0 4.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0 5.0
Data:
In [284]: df1
Out[284]:
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
In [285]: df2
Out[285]:
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
OLD answer:
you can use pd.merge(..., how='outer') method:
In [193]: pd.merge(a,b, on=['A','B'], how='outer')
Out[193]:
A B C D E F
0 4 3 1.0 5.0 4.0 5.0
1 5 7 0.0 3.0 3.0 3.0
2 3 2 1.0 6.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0
Data:
In [194]: a
Out[194]:
A B C D
0 4 3 1 5
1 5 7 0 3
2 3 2 1 6
In [195]: b
Out[195]:
A B E F
0 4 3 4 5
1 5 7 3 3
2 3 8 5 5

Related

Nested loop - alternative solution - Python pandas

I have started learning python, and I wanted to ask if there is an alternative faster solution to the below nested loop:
for y in range(total_rows2):
for x in range(total_rows1):
if df2.iloc[y,0]==df1.iloc[x,0]:
df1.iloc[x,1]=df1.iloc[x,1]+df2.iloc[y,17]
Basically, I have found the number of rows (total_rows1 and total_rows2) of two dataframes (df1 and df2). The first column of both dataframes (index=0) correspond to some IDs.
If the IDs of the two dataframes match, then I want to add the value of column 18 of df2(index 17, column name='Profit') to the second column of df1 (index=1, column name='Profit'). An id may appear twice in df2 but I will appear the sum in df1 (please see below for id 0108). So the 'Profit' column of df1 will be the sum of Profit per ID.
df2:
---
ID
Profit
0
0104
0
1
0106
0
2
0107
0
3
0108
0
df1:
---
ID
Loss
Profit
0
0104
100
230
1
0106
200
150
2
0108
150
120
3
0107
120
230
4
0109
100
400
5
0108
150
400
So I want df2 to look as followed:
---
ID
Profit
0
0104
230
1
0106
150
2
0107
230
3
0108
520
Thanks!
I think just merging the two dfs on that first column and then doing the addition would be fine.
frames:
>>> df1
ID B C D
0 e 3 8 1
1 d 5 1 1
2 g 6 5 1
3 e 8 8 7
4 j 9 3 6
5 i 4 0 5
6 g 0 4 1
7 a 3 7 2
8 e 0 6 9
9 b 2 9 6
>>> df2
ID col_17
0 j 9
1 c 3
2 d 6
3 g 4
4 h 4
5 g 5
6 e 1
7 d 8
8 b 0
9 i 6
Merge:
>>> df3 = df1.merge(df2,how='left',on='ID')
>>> df3
ID B C D col_17
0 e 3 8 1 1.0
1 d 5 1 1 6.0
2 d 5 1 1 8.0
3 g 6 5 1 4.0
4 g 6 5 1 5.0
5 e 8 8 7 1.0
6 j 9 3 6 9.0
7 i 4 0 5 6.0
8 g 0 4 1 4.0
9 g 0 4 1 5.0
10 a 3 7 2 NaN
11 e 0 6 9 1.0
12 b 2 9 6 0.0
Add:
>>> df3['B']=np.where(df3['col_17'].notna(),df3['B']+df3['col_17'],df3['B'])
>>> df3
ID B C D col_17
0 e 4.0 8 1 1.0
1 d 11.0 1 1 6.0
2 d 13.0 1 1 8.0
3 g 10.0 5 1 4.0
4 g 11.0 5 1 5.0
5 e 9.0 8 7 1.0
6 j 18.0 3 6 9.0
7 i 10.0 0 5 6.0
8 g 4.0 4 1 4.0
9 g 5.0 4 1 5.0
10 a 3.0 7 2 NaN
11 e 1.0 6 9 1.0
12 b 2.0 9 6 0.0

Python Pandas shift by given value in cell within groupby

Given the following dataframe
df = pd.DataFrame(data={'name': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
'lag': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2],
'value': range(10)})
print(df)
lag name value
0 1 a 0
1 1 a 1
2 1 a 2
3 2 b 3
4 2 b 4
5 2 b 5
6 2 b 6
7 2 c 7
8 2 c 8
9 2 c 9
I am trying to shift values contained in column value to obtain the column expected_value, which is the shifted values grouped by column name and shifted by lag rows. I was thinking of using something like df['expected_value'] = df.groupby(['name', 'lag']).shift(), but I am not sure how to pass lag to the shift() function.
print(df)
lag name value expected_value
0 1 a 0 nan
1 1 a 1 0.0000
2 1 a 2 1.0000
3 2 b 3 nan
4 2 b 4 nan
5 2 b 5 3.0000
6 2 b 6 4.0000
7 2 c 7 nan
8 2 c 8 nan
9 2 c 9 7.0000
You can use GroupBy.transform here.
df.assign(expected_value = df.groupby(['name', 'lag'])['value'].
transform(lambda x: x.shift(x.name[1])))
name lag value expected_value
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0
You can do with an apply:
df['new_val'] = (df.groupby('name')
.apply(lambda x: x['value'].shift(x['lag'].iloc[0]))
.reset_index('name',drop=True)
)
Output:
name lag value new_val
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0

Add Missing Values To Pandas Groups

Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0

how to columns into multiple rows in dataframe?

i have dataframe like below
A B C D E F G H G H I J K
1 2 3 4 5 6 7 8 9 10 11 12 13
and i want result like this
A B C D E F G H
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 9
1 2 3 4 5 6 7 10
1 2 3 4 5 6 7 11
1 2 3 4 5 6 7 12
1 2 3 4 5 6 7 13
like a result column 'G~K' is under column 'H'
how can i do this?
You need to adjust your columns by using cummax , then after melt, we create additional key with cumcount, then just do reshape here, I am using unstack , you can using pivot , pivot_table
s=pd.Series(df.columns)
s[(s>='H').cummax()==1]='H'
df.columns=s
df=df.melt()
yourdf=df.set_index(['variable',df.groupby('variable').cumcount()]).\
value.unstack(0).ffill()
yourdf
variable A B C D E F G H
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 9.0
2 1.0 2.0 3.0 4.0 5.0 6.0 7.0 10.0
3 1.0 2.0 3.0 4.0 5.0 6.0 7.0 11.0
4 1.0 2.0 3.0 4.0 5.0 6.0 7.0 12.0
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 13.0
I hope this would give you some help
import pandas as pd
df = pd.DataFrame([list(range(1,14))])
df.columns = ('A','B','C','D','E','F','G','H','G','H','I','J','K')
print('starting data frame:')
print(df)
df1 = df.iloc[:,0:7]
df1 = df1.append([df1]*(len(df.iloc[:,7:].T)-1))
df1.insert(df1.shape[1],'H',list(df.iloc[:,7:].values[0]))
print('result:')
print(df1)
letters = list("ABCDEFGHIJKLM")
df = pd.DataFrame([np.arange(1, len(letters) + 1)], columns=letters)
df = pd.concat([df.iloc[:, :7]] * (len(letters) - 7)).assign(H=df[letters[7:]].values[0])
df = df.reset_index(drop=True)
df
gives you
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13
Your data has some duplicates in columns name, so melt will fail. However, you could change columns name and then apply melt
In [166]: df
Out[166]:
A B C D E F G H G H I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Duplicates in column name 'G' and 'H'. Just change those to 'GG', 'HH'. Finally, apply melt
In [167]: df.columns = ('A','B','C','D','E','F','G','H','GG','HH','I','J','K')
In [168]: df
Out[168]:
A B C D E F G H GG HH I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
In [169]: df.melt(id_vars=df.columns.tolist()[0:7], value_name='H').drop('variable', 1)
Out[169]:
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13

Mapping data from one dataframe to another based on grouby

Probably a similar question has been asked before, but I could not find anyone to solve my problem. Maybe I am not using the proper search words!.
I have two pandas Dataframes as below:
import pandas as pd
import numpy as np
df1
a = np.array([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3])
b = np.array([1,1,2,2,3,3,1,1,2,2,3,3,1,1,2,2,3,3])
df1 = pd.DataFrame({'a':a, 'b':b})
print(df1)
a b
0 1 1
1 1 1
2 1 2
3 1 2
4 1 3
5 1 3
6 2 1
7 2 1
8 2 2
9 2 2
10 2 3
11 2 3
12 3 1
13 3 1
14 3 2
15 3 2
16 3 3
17 3 3
df2 is as below:
a2 = np.array([1,1,1,2,2,2,3,3,3])
b2 = np.array([1,2,3,1,2,3,1,2,3])
c = np.array([4,8,3,np.nan, 2, 5,6, np.nan, 1])
df2 = pd.DataFrame({'a':a2, 'b':b2, 'c': c})
a b c
0 1 1 4.0
1 1 2 8.0
2 1 3 3.0
3 2 1 NaN
4 2 2 2.0
5 2 3 5.0
6 3 1 6.0
7 3 2 NaN
8 3 3 1.0
Now I want to map column c from df2 to df1 but keeping the grouping of columns a=a1 and b=b2. Therefore, df1 is modified as shown below
a b c
0 1 1 4
1 1 1 4
2 1 2 8
3 1 2 8
4 1 3 3
5 1 3 3
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
How can I achieve this with simple and intuitive way using pandas?
Quite simple using merge:
df1.merge(df2)
a b c
0 1 1 4.0
1 1 1 4.0
2 1 2 8.0
3 1 2 8.0
4 1 3 3.0
5 1 3 3.0
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
If you have more columns and you want to specifically only merge on a and b, use:
df1.merge(df2, on=['a','b'])

Categories