I have the following raw data, in a dataframe:
BROKER VENUE QUANTITY
0 BrokerA Venue_1 300
1 BrokerA Venue_2 400
2 BrokerA Venue_2 1400
3 BrokerA Venue_3 800
4 BrokerB Venue_2 500
5 BrokerB Venue_3 1100
6 BrokerC Venue_1 1000
7 BrokerC Venue_1 1200
8 BrokerC Venue_2 17000
I want to do some summarization of the data to see how much each broker sent to each venue, so I created a pivot_table:
pt = df.pivot_table(index=['BROKER', 'VENUE'], values=['QUANTITY'], aggfunc=np.sum)
Result, as expected:
QUANTITY
BROKER VENUE
BrokerA Venue_1 300.0
Venue_2 1800.0
Venue_3 800.0
BrokerB Venue_2 500.0
Venue_3 1100.0
BrokerC Venue_1 2200.0
Venue_2 17000.0
I also want how much was sent by each broker overall. and show it in this same table. I can get that information by typing df.groupby('BROKER').sum(), but how can I add this to my pivot table as a column named, say, BROKER_TOTAL?
Note: This question is similar but seems to be on an older version, and my best guess at adapting it to my situation didn't work: Pandas Pivot tables row subtotals
You can create MultiIndex.from_arrays for df1, concat it to pt and last sort_index:
df1 = df.groupby('BROKER').sum()
df1.index = pd.MultiIndex.from_arrays([df1.index + '_total', len(df1.index) * ['']])
print (df1)
QUANTITY
BrokerA_total 2900
BrokerB_total 1600
BrokerC_total 19200
print (pd.concat([pt, df1]).sort_index())
QUANTITY
BROKER VENUE
BrokerA Venue_1 300
Venue_2 1800
Venue_3 800
BrokerA_total 2900
BrokerB Venue_2 500
Venue_3 1100
BrokerB_total 1600
BrokerC Venue_1 2200
Venue_2 17000
BrokerC_total 19200
Related
I'm looking to stack the indices of some columns on top of one another, this is what I currently have:
Buy Buy Currency Sell Sell Currency
Date
2013-12-31 100 CAD 100 USD
2014-01-02 200 USD 200 CAD
2014-01-03 300 CAD 300 USD
2014-01-06 400 USD 400 CAD
This is what I'm looking to achieve:
Buy/Sell Buy/Sell Currency
100 USD
100 CAD
200 CAD
200 USD
300 USD
300 CAD
And so on, Basically want to take the values in "Buy" and "Buy Currency" and stack their values in the "Sell" and "Sell Currency" columns, one after the other.
And so on. I should mention that my data frame has 10 columns in total so using
df_pl.stack(level=0)
doesn't seem to work.
One option is with pivot_longer from pyjanitor, where for this particular use case, you pass a list of regular expressions (to names_pattern) to aggregate the desired column labels into new groups (in names_to):
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index=None,
names_to = ['Buy/Sell', 'Buy/Sell Currency'],
names_pattern = [r"Buy$|Sell$", ".+Currency$"],
ignore_index = False,
sort_by_appearance=True)
Buy/Sell Buy/Sell Currency
Date
2013-12-31 100 CAD
2013-12-31 100 USD
2014-01-02 200 USD
2014-01-02 200 CAD
2014-01-03 300 CAD
2014-01-03 300 USD
2014-01-06 400 USD
2014-01-06 400 CAD
using concat
import pandas as pd
print(pd.concat(
[df['Buy'], df['sell']], axis=1
).stack().reset_index(1, drop=True).rename(index='buy/sell')
)
output:
0 100
0 100
1 200
1 200
2 300
2 300
3 400
3 400
# assuming that your data has date as index.
df.set_index('date', inplace=True)
# create a mapping to new column names
d={'Buy Currency': 'Buy/Sell Currency',
'Sell Currency' : 'Buy/Sell Currency',
'Buy' : 'Buy/Sell',
'Sell' :'Buy/Sell'
}
df.columns=df.columns.map(d)
# stack first two columns over the next two columns
out=pd.concat([ df.iloc[:,:2],
df.iloc[:,2:]
],
ignore_index=True
)
out
Buy/Sell Buy/Sell Currency
0 100 CAD
1 200 USD
2 300 CAD
3 400 USD
4 100 USD
5 200 CAD
6 300 USD
7 400 CAD
I'm trying to multiply data from 2 different dataframes and my code as below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'v_contract_number': ['VN120001438','VN120001439',
'VN120001440','VN120001438',
'VN120001439','VN120001440'],
'Currency': ['VND','USD','KRW','USD','KRW','USD'],
'Amount': [10000,5000,6000,200,150,175]})
df2 = pd.DataFrame({'Currency': ['VND','USD','KRW'],'Rate': [1,23000,1200]})
print(df1)
# df1
v_contract_number Currency Amount
0 VN120001438 VND 10000
1 VN120001439 USD 5000
2 VN120001440 KRW 6000
3 VN120001438 USD 200
4 VN120001439 KRW 150
5 VN120001440 USD 175
print(df2)
Currency Rate
0 VND 1
1 USD 23000
2 KRW 1200
df1 = df1.merge(df2)
df1['VND AMount'] = df1['Amount'].mul(df1['Rate'])
df1.drop('Rate', axis=1, inplace=True)
print(df1)
# result
v_contract_number Currency Amount VND AMount
0 VN120001438 VND 10000 10000
1 VN120001439 USD 5000 115000000
2 VN120001438 USD 200 4600000
3 VN120001440 USD 175 4025000
4 VN120001440 KRW 6000 7200000
5 VN120001439 KRW 150 180000
This is exactly what I want but I would like to know that have another way to not merge and drop as I did?
The reason that I drop ‘Rate’ because I dont want it appears in my report.
Thanks and best regards
You can use pandas' map for this:
df2 = df2.set_index('Currency').squeeze() # squeeze converts to a Series
df1.assign(VND_Amount = df1.Amount.mul(df1.Currency.map(df2)))
v_contract_number Currency Amount VND_Amount
0 VN120001438 VND 10000 10000
1 VN120001439 USD 5000 115000000
2 VN120001440 KRW 6000 7200000
3 VN120001438 USD 200 4600000
4 VN120001439 KRW 150 180000
5 VN120001440 USD 175 4025000
You can avoid the drop by not overwriting df1 on the merge operation:
df1["VND Amount"] = df1.merge(df2, on="Currency").eval("Amount * Rate")
Alternatively you can use .reindex to align df2 to df1 based on the currency column:
df1["VND Amount"] = (
df1["Amount"] *
(df2.set_index("Currency")["Rate"] # set the index and return Rate column
.reindex(df1["Currency"]) # align "Rate" values to df1 "Currency"
.to_numpy() # get numpy array to avoid pandas
# auto alignment on math ops
)
)
It's been a long night searching for a solution, I appreciate your help.
Having the following df
proposal1_amount
proposal2_amount
proposal3_amount
accepted_proposal
1000
2000
3000
3
5000
5200
4000
2
3000
2400
1120
1
I need to build a new column with the amount coming from the accepted corresponding column, it would be like this:
proposal1_amount
proposal2_amount
proposal3_amount
accepted_proposal
accepted_amount
1000
2000
3000
3
3000
5000
5200
4000
2
5200
1450
2400
1120
1
1450
I've found some examples which work fine when the new column has a fixed value, but in this case the value comes from another column on the same df.
thanks,
vv
Quickest solution I could think of:
df['accepted_amount'] = df.apply(lambda row: row.iloc[row['accepted_proposal']-1],axis=1)
Edit: Because I feel un-easy about the solution being contingent upon the ordering of the columns, here's a slightly wordier yet more dynamic solution:
df['accepted_amount']=df.apply(lambda row: row[['proposal1_amount','proposal2_amount','proposal3_amount']].iloc[row['accepted_proposal']-1],axis=1)
You can use numpy.choose to do this pretty easily.
print(df)
proposal1_amount proposal2_amount proposal3_amount accepted_proposal
0 1000 2000 3000 3
1 5000 5200 4000 2
2 3000 2400 1120 1
# create 2d array of our choices (which corresponds to our amounts)
choices = df.filter(regex="proposal\d_amount").to_numpy()
# subtract 1 from "accepted_proposal" so they line up with indices in choices array
# (we want these 0-indexed, not 1-indexed)
a = df["accepted_proposal"] - 1
# np.choose does all the heavy lifting, assign output to new column
df["accepted_amount"] = np.choose(a, choices)
print(df)
proposal1_amount proposal2_amount proposal3_amount accepted_proposal accepted_amount
0 1000 2000 3000 3 3000
1 5000 5200 4000 2 5200
2 3000 2400 1120 1 3000
np.choose will functionally iterate over each row of choices (e.g. iterate over each "proposalN_amount") and then take the amount that matches the index from accepted_proposal - 1. See the docs for np.choose
proposal1_amount=[1000,5000,1450]
proposal2_amount=[2000,5200,2400]
proposal3_amount=[3000,4000,1120]
accepted_proposal=[3,2,1]
df=pd.DataFrame({'proposal1_amount': proposal1_amount,'proposal2_amount': proposal2_amount,'proposal3_amount':proposal3_amount,'accepted_proposal':accepted_proposal})
df['accepted_proposal']=df['accepted_proposal'].astype(int)
df=df.assign(accepted_amount=df.apply(lambda row: row.iloc[row['accepted_proposal']-1], axis=1))
print(df)
output:
proposal1_amount proposal2_amount proposal3_amount accepted_proposal
0 1000 2000 3000 3
1 5000 5200 4000 2
2 1450 2400 1120 1
accepted_amount
0 3000
1 5200
2 1450
I have a database with transfers orders between two cities. I have, in each record, a departure date, the amount to be delivered, a returning date and the amount to be returned.
The database is something like this:
df = pd.DataFrame({"dep_date":[201701,201701,201702,201703], "del_amount":[100,200,300,400],"ret_date":[201703,201702,201703,201705], "ret_amount":[50,75,150,175]})
df
dep_date del_amount ret_date ret_amount
0 201701 100 201703 50
1 201701 200 201702 75
2 201702 300 201703 150
3 201703 400 201705 175
I want to get a pivot table with dep_data as index, showing the sum of del_amount in that month and the returned amount scheduled for the same month of departure date.
It's an odd construction, cause it seems to has two indexes. The result that I need is:
del_amount ret_amount
dep_date
201701 300 0
201702 300 75
201703 400 200
Note that some returning dates does not match with any departure month. Does anyone know if it is possible to build a proper aggfunc in pivot_table enviroment to achieve this? If it is not possible, can anyone tell me the best approach?
Thanks in advance
You'll need two groupby + sum operations, followed by a reindex and concatenation -
i = df.groupby(df.dep_date % 100)['del_amount'].sum()
j = df.groupby(df.ret_date % 100)['ret_amount'].sum()
pd.concat([i, j.reindex(i.index, fill_value=0)], 1)
del_amount ret_amount
dep_date
1 300 0
2 300 75
3 400 200
If you want to group on the entire date (and not just the month number), change df.groupby(df.dep_date % 100) to df.groupby('dep_date').
Use
In [97]: s1 = df.groupby('dep_date')['del_amount'].sum()
In [98]: s2 = df.groupby('ret_date')['ret_amount'].sum()
In [99]: s1.to_frame().join(s2.rename_axis('dep_date')).fillna(0)
Out[99]:
del_amount ret_amount
dep_date
201701 300 0.0
201702 300 75.0
201703 400 200.0
split it into two df, then we calculation for each of them , then we do join
s=df.loc[:,df.columns.str.startswith('de')]
v=df.loc[:,df.columns.str.startswith('ret')]
s.set_index('dep_date').sum(level=0).join(v.set_index('ret_date').sum(level=0)).fillna(0)
Out[449]:
del_amount ret_amount
dep_date
201701 300 0.0
201702 300 75.0
201703 400 200.0
I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)