Pandas: Adding two dataframes on the common columns - python

I have 2 tables which have the same columns and I want to add the numbers where the key matches, and if it doesn't then just add it as is in the output df.. I tried combine first, merge, concat and join.. They all create 2 seperate columnd for t1 and t2, but its the same key, so should just be together I know this would be something very basic.. pls could someone help? thanks vm!
df1:
t1 a b
0 USD 2,877 -2,418
1 CNH 600 -593
2 AUD 756 -106
3 JPY 113 -173
4 XAG 8 0
df2:
t2 a b
0 CNH 64 -44
1 USD 756 -774
2 JPY 1,127 -2,574
3 TWO 56 -58
4 TWD 38 -231
Output:
t a b
USD 3,634 -3,192
CNH 664 -637
AUD 756 -106
JPY 1,240 -2,747
XAG 8 0
TWO 56 -58
TWD 38 -231

First set_index in both DataFrames by first columns and then use add with parameter fill_value=0:
print (df1.set_index('t1').add(df2.set_index('t2'), fill_value=0)
.reset_index()
.rename(columns={'index':'t'}))
t a b
0 AUD 756.0 -106.0
1 CNH 664.0 -637.0
2 JPY 1240.0 -2747.0
3 TWD 38.0 -231.0
4 TWO 56.0 -58.0
5 USD 3633.0 -3192.0
6 XAG 8.0 0.0
If need convert output to int:
print (df1.set_index('t1').add(df2.set_index('t2'), fill_value=0)
.astype(int)
.reset_index()
.rename(columns={'index':'t'}))
t a b
0 AUD 756 -106
1 CNH 664 -637
2 JPY 1240 -2747
3 TWD 38 -231
4 TWO 56 -58
5 USD 3633 -3192
6 XAG 8 0

Related

TypeError: 'Series' objects are mutable

I am trying to create a Beta for stocks on 756 days, calculating the Covariance of the stocks Weighted on an index and divided by the Variance of the index.
When i run it for a DF with a single stock, without the groupby arguments, it runs and creates the Beta Column, but i need to do it for all the stocks on my DF and the Groupby was a solution that i read here on Stack Overflow, but i am not able to use it yet.
This is the line that causes the error
df2['Beta 756d'] = df2.groupby('CODNEG').apply(df2['Retorno_Ação'].rolling(window=756,center=False).cov(df2['Retorno_Ibov']) / df2['Retorno_Ibov'].rolling(window=756,center=False).var())
--
This error comes up when the code gets to the line above
TypeError: 'Series' objects are mutable, thus they cannot be hashed
--
Here is an example of df2
TIPREG DATE CODBDI CODNEG TPMERC NOMRES ESPECI PRAZOT MODREF PREABE PREMAX PREMIN PREMED PREULT PREOFC PREOFV TOTNEG QUATOT VOLTOT PREEXE INDOPC DATVEN FATCOT PTOEXE CODISI DISMES Retorno_Ação Retorno_Ibov
0 1 1995-01-02 2 ACE 3 10 ACESITA ON *INT R$ 6300 6300 6300 6300 6300 6300 6500 1 200000 1260000 0 0 99991231 1000 0 ACESACON 119 NaN NaN
105 1 1995-01-02 2 PET 3 10 PETROBRAS ON * R$ 6400 6400 6250 6287 6250 6250 6750 2 40000 251500 0 0 99991231 1000 0 PETRACON 132 NaN NaN
106 1 1995-01-02 2 PET 4 10 PETROBRAS PN * R$ 10700 11000 10400 10599 10500 10500 10650 234 31210000 330805170 0 0 99991231 1000 0 PETRACPN 133 NaN NaN
107 1 1995-01-02 2 BRD 4 10 PETROBRAS BR PN * R$ 4600 4600 4540 4591 4540 4333 4500 13 18200000 83566000 0 0 99991231 1000 0 BRDTACPN 102 NaN NaN
108 1 1995-01-02 2 PTN 4 10 PETTENATI PN * R$ 5189 5189 5189 5189 5189 4700 5280 1 5000000 25945000 0 0 99991231 1000 0 BRPTNTACNPR3 21 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1826575 1 2020-08-19 2 HGTX3 10 CIA HERING ON NM R$ 1576 1598 1547 1575 1575 1574 1575 14582 4640200 7309217700 0 0 99991231 1 0 BRHGTXACNOR9 151 0.005105 0.000788
1826576 1 2020-08-19 2 HOME34 10 HOME DEPOT DRN R$ 77798 78612 77798 78413 78612 55000 0 2 1720 134870760 0 0 99991231 1 0 BRHOMEBDR002 133 0.005719 0.000788
1826577 1 2020-08-19 2 HONB34 10 HONEYWELL DRN ED R$ 86347 86347 86347 86347 86347 0 0 1 100 8634700 0 0 99991231 1 0 BRHONBBDR006 127 0.085375 0.000788
1826579 1 2020-08-19 2 HYPE3 10 HYPERA ON NM R$ 3248 3265 3167 3199 3175 3175 3177 15332 3082700 9864018200 0 0 99991231 1 0 BRHYPEACNOR0 122 -0.034367 0.000788
1826517 1 2020-08-19 2 EMAE4 10 EMAE PN R$ 2869 2954 2831 2888 2952 2832 2953 10 1100 3177600 0 0 99991231 1 0 BREMAEACNPR1 114 0.040903 0.000788
I am using the last two columns (Retorno_Ação, Retorno_Ibov) to calculate the Cov and the Var to generate the Beta.
Can anyone tell me what is causing the error?
This line works fine when the DF has only one stock:
df2['Beta 756d'] = df2['Retorno_Ação'].rolling(window=756,center=False).cov(df2['Retorno_Ibov']) / df2['Retorno_Ibov'].rolling(window=756,center=False).var()
--
The error happens when i use df2['Beta 756d'] = df2.groupby('CODNEG').apply(

How to subtract values in one dataframe from the other based on multiple columns?

I have two pandas dataframes.
import pandas as pd
data1 = {'id1': [625,625,725,625,725,1130,625,1130,725,1130],
'id2': ['AF','AG','AF','AP','AP','BM','BA','BC','BM','AF'],
'Total': [75,68,33,77,42,25,113,80,72,36]}
df1 = pd.DataFrame(data1, columns = ['id1','id2','Total'])
data2 = {'id1': [625,725,625,625,1130,1130,625,725,1130,625],
'id2': ['AF','AF','AG','AP','AF','AG','BA','BA','BM','BM'],
'Part1': [5,8,3,4,2,6,1,2,6,3]}
df2 = pd.DataFrame(data2, columns = ['id1','id2','Part1'])
And I get these two data frames.
df1 id1 id2 Total
0 625 AF 75
1 625 AG 68
2 725 AF 33
3 625 AP 77
4 725 AP 42
5 1130 BM 25
6 625 BA 113
7 1130 BC 80
8 725 BM 72
9 1130 AF 36
df2 id1 id2 Part1
0 625 AF 5
1 725 AF 8
2 625 AG 3
3 625 AP 4
4 1130 AF 2
5 1130 AG 6
6 625 BA 1
7 725 BA 2
8 1130 BM 6
9 625 BM 3
What I want is to create a third dataframe where I get to perserve each unique combination of id1 and id2 while substracting values in column 'Part1' in df2 from 'Total' in df1, given that each id1 and id2 combination only appears once in either of the dataframes.
For example:
The combination of '625' and 'AF' gives a value of 75 in df1, and 5 in df2. What I want is to create a third dataframe where a row would have '625', 'AF', and '70' in three columns.
If one combination appears in df1 but not the df2, we treat it as if it exists in df2 but the value is 0, and vice versa.
Not sure if I described it sufficiently.
Use Series.sub with fill_value=0 parameter for subtraction with convert columns id1, id2 for MultiIndex, so subtract is based by these columns:
df = (df1.set_index(['id1','id2'])['Total']
.sub(df2.set_index(['id1','id2'])['Part1'], fill_value=0)
.reset_index(name='new'))
print (df)
id1 id2 new
0 625 AF 70.0
1 625 AG 65.0
2 625 AP 73.0
3 625 BA 112.0
4 625 BM -3.0
5 725 AF 25.0
6 725 AP 42.0
7 725 BA -2.0
8 725 BM 72.0
9 1130 AF 34.0
10 1130 AG -6.0
11 1130 BC 80.0
12 1130 BM 19.0

Subtract/Add existing values if contents of one dataframe is present in another using pandas

Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0

New dataframe from grouping together two columns

I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0

How do I reshape this DataFrame in Python?

I have a DataFrame df_sale in Python that I want to reshape, count the sum across the price column and add a new coloumn total. Below is the df_sale:
b_no a_id price c_id
120 24 50 2
120 56 100 2
120 90 25 2
120 45 20 2
231 89 55 3
231 45 20 3
231 10 250 3
Excepted Output after reshaping:
b_no a_id_1 a_id_2 a_id_3 a_id_4 total c_id
120 24 56 90 45 195 2
231 89 45 10 0 325 3
What I have tried so far is use the sum() on df_sale['price'] separately for 120 and 231. I do not understand how should I reshape the data, add new column headers and get the total without being computationally inefficient. Thanks.
This might not be the cleanest method (at all), but it gets the outcome you want:
reshaped_df = (df.groupby('b_no')[['price', 'c_id']]
.first()
.join(df.groupby('b_no')['a_id']
.apply(list)
.apply(pd.Series)
.add_prefix('a_id_'))
.drop('price',1)
.join(df.groupby('b_no')['price'].sum().to_frame('total'))
.fillna(0))
>>> reshaped_df
c_id a_id_0 a_id_1 a_id_2 a_id_3 total
b_no
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325
You can achieve this grouping by b_no and c_id, summing total, and flattening a_id:
import pandas as pd
d = {"b_no": [120,120,120,120,231,231, 231],
"a_id": [24,56,90,45,89,45,10],
"price": [50,100,25,20,55,20,250],
"c_id": [2,2,2,2,3,3,3]}
df = pd.DataFrame(data=d)
df2 = df.groupby(['b_no', 'c_id'])['a_id'].apply(list).apply(pd.Series).add_prefix('a_id_').fillna(0)
df2["total"] = df.groupby(['b_no', 'c_id'])['price'].sum()
print(df2)
a_id_0 a_id_1 a_id_2 a_id_3 total
b_no c_id
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325

Categories