How to merge DataFrames based on on column while adding another - python

I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!

Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0

Related

Extract corresponding df value with reference from another df

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?
Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

How change values in dataframe based on another in pandas?

Hi,
i have two dataframes and i want to change values in first dataframe where have same IDs in both dataframes,
suppose i have:
df1 = [ID price
1 200
4 300
5 120
7 230
8 110
9 90
12 180]
and
df2 = [ID price count
3 340 27
4 60 10
5 290 2]
after replace:
df1 = [ID price
1 200
4 60
5 290
7 230
8 110
9 90
12 180]
my first try:
df1.loc[df1.ID.isin(df2.ID),['price']] = df2.loc[df2.ID.isin(df1.ID),['price']].values
but it isn't correct.
Assuming ID is the index (or can be set as the index), then you can just update:
In []:
df1.update(df2)
df1
Out[]:
price
ID
1 200.0
4 60.0
5 290.0
7 230.0
8 110.0
9 90.0
12 180.0
If you need to set_index():
df = df1.set_index('ID')
df.update(df2.set_index('ID'))
df1 = df.reset_index()

Python dataframe: pivot on same column

I have two columns "ID" and "division" as shown below.
df = pd.DataFrame(np.array([['111', 'AAA'],['222','AAA'],['333','BBB'],['444','CCC'],['444','AAA'],['222','BBB'],['111','BBB']]),columns=['ID','division'])
ID division
0 111 AAA
1 222 AAA
2 333 BBB
3 444 CCC
4 444 AAA
5 222 BBB
6 111 BBB
The expected output is as shown below where I need to pivot on the same column but the count is dependent on "division". This should be presented in a heatmap.
df = pd.DataFrame(np.array([['0','2','1','1'],['2','0','1','1'],['1','1','0','0'],['1','1','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 2 1 1
222 2 0 1 1
333 1 1 0 0
444 1 1 0 0
So, technically I am doing an overlap between ID's with respect to division.
Example:
The highlighted box in red where the overlap between 111 and 222 ID's is 2(AAA and BBB). where as the overlap between 111 and 444 is 1 (AAA highlighted in the black box).
I could do this in excel in 2 steps.Not sure if below one helps.
Step1:=SUM(COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,$G2),COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,H$1))-1
Step2:=IF($G12=H$1,0,SUMIFS(H$2:H$8,$G$2:$G$8,$G12))
But is there any way that we can do it in Python using dataframes.
Appreciate your help
Case-2
if df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
ID division count
0 111 AAA 4
1 222 AAA 5
2 333 BBB 6
3 444 CCC 3
4 444 AAA 2
5 222 BBB 2
6 111 BBB 7
Expected output would be
df_result = pd.DataFrame(np.array([['0','18','13','6'],['18','0','8','7'],['13','8','0','0'],['6','7','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 18 13 6
222 18 0 8 7
333 13 8 0 0
444 6 7 0 0
Calculation: Here there is an overlap between 111 and 222 with respect to divisions AAA and BBB hence the sum would be 4+5+2+7=18
Another way to do this is to use a self join with merge and pd.crosstab:
df_out = df.merge(df, on='division')
results = pd.crosstab(df_out.ID_x, df_out.ID_y)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 2.0 1.0 1.0
222 2.0 0.0 1.0 1.0
333 1.0 1.0 0.0 0.0
444 1.0 1.0 0.0 0.0
Case 2
df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
df['count'] = df['count'].astype(int)
df_out = df.merge(df, on='division')
df_out = df_out.assign(count = df_out.count_x + df_out.count_y)
results = pd.crosstab(df_out.ID_x, df_out.ID_y, df_out['count'], aggfunc='sum').fillna(0)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 18.0 13.0 6.0
222 18.0 0.0 8.0 7.0
333 13.0 8.0 0.0 0.0
444 6.0 7.0 0.0 0.0

Pandas: group some data

I have dataframe
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456
And I want to get
id date count
0 123 10-12-2015 0
1 123 11-12-2015 0
2 123 12-12-2015 1
3 123 13-12-2015 1
4 123 14-12-2015 0
5 123 15-12-2015 1
6 123 16-12-2015 1
7 123 17-12-2015 0
8 123 18-12-2015 1
9 456 10-12-2015 1
10 456 11-12-2015 0
11 456 12-12-2015 0
12 456 13-12-2015 1
13 456 14-12-2015 0
14 456 15-12-2015 1
I try before
df = df.groupby('id').resample('D').size().reset_index(name='val')
But it search date between existing to every id. How can I do it to some period?
You can achieve what you want by reindexing in the aggregation of each group and filling NaNs with 0.
import io
import pandas as pd
data = io.StringIO("""\
date id
0 12-12-2015 123
1 13-12-2015 123
2 15-12-2015 123
3 16-12-2015 123
4 18-12-2015 123
5 10-12-2015 456
6 13-12-2015 456
7 15-12-2015 456""")
df = pd.read_csv(data, delim_whitespace=True)
df['date'] = pd.to_datetime(df['date'], format="%d-%m-%Y")
startdate = df['date'].min()
enddate = df['date'].max()
alldates = pd.date_range(startdate, enddate, freq='D', name='date')
def process_id(g):
return g.resample('D').size().reindex(alldates).fillna(0)
output = (df.set_index('date')
.groupby('id')
.apply(process_id)
.stack()
.rename('val')
.reset_index('id'))
print(output)
# id val
# date
# 2015-12-10 123 0.0
# 2015-12-11 123 0.0
# 2015-12-12 123 1.0
# 2015-12-13 123 1.0
# 2015-12-14 123 0.0
# 2015-12-15 123 1.0
# 2015-12-16 123 1.0
# 2015-12-17 123 0.0
# 2015-12-18 123 1.0
# 2015-12-10 456 1.0
# 2015-12-11 456 0.0
# 2015-12-12 456 0.0
# 2015-12-13 456 1.0
# 2015-12-14 456 0.0
# 2015-12-15 456 1.0
# 2015-12-16 456 0.0
# 2015-12-17 456 0.0
# 2015-12-18 456 0.0

Categories