I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)
Related
Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200
I have a dataset with data of a set of bands who play in the same two-city festival. I aim to create a pandas dataframe of two colums, Band and total attendance. but I am unable to use the df.groupby argument as the same bands are mentioned in two different columns. Any tips?
Below is an example of how the dataset looks.
F.ex
Time Band_in_London Band_in_Reading Attendance_London Attendance_Reading
12 Kasabian Queen 3000 4000
13 Foo Fighters Beatles 1000 5000
14 U2 A-ha 2500 2500
15 Queen Kasabian 2200 1000
16 Beatles Foo Fighters 1300 4700
17 A-Ha U2 4000 3100
18
Using pd.wide_to_long:
import pandas as pd
# Notice there is a typo for the A-ha.
df = df.replace({'A-ha':'A-Ha'})
attendance = (pd.wide_to_long(df, ['Attendance', 'Band_in'], 'Time', 'City', sep='_', suffix=r'\w+').
groupby('Band_in').sum()
)
print(attendance)
Attendance
Band_in
A-Ha 6500
Beatles 6300
Foo Fighters 5700
Kasabian 4000
Queen 6200
U2 5600
df = original_dataframe[['Band_in_London', 'Attendance_London']]
df = df.rename({'Band_in_London': 'Band', 'Attendance_London': 'Attendance'}, axis=1)
df = df.append(
original_dataframe[['Band_in_Reading', 'Attendance_Reading']].rename(
{'Band_in_Reading': 'Band', 'Attendance_Reading': 'Attendance'}, axis=1))
df = df.groupby(['Band']).sum()
I have a text document with spending information. I want to use pandas and Python 3 to convert the text to a dataframe with two columns, without repeating row names by combining same names into one row with the respective amounts added to produce a single total.
Original "spending.txt:"
shaving 150
shaving 200
coffee 100
food 350
transport 60
transport 40
desired output dataframe:
CATEGORY TOTAL
shaving 350
coffee 100
food 350
transport 100
This should do it:
df = pd.read_csv('spending.txt', header=None, sep='\s+')
df.columns = ['category', 'total']
df.groupby('category', as_index=False).sum()
category total
0 coffee 100
1 food 350
2 shaving 350
3 transport 100
Reading in data
temp = StringIO("""
shaving 150
shaving 200
coffee 100
food 350
transport 60
transport 40
""")
df = pd.read_csv(temp, sep='\s+', engine='python', header=None)
df.groupby(0).sum().reset_index().rename({0:'category',1:'total'}, axis=1)
Output
category total
0 coffee 100
1 food 350
2 shaving 350
3 transport 100
Read file using read_csv
The apply group by
df = pd.read_csv('test.txt', sep=" ", header=None)
df.rename(columns={0:'category',1:'Total'},inplace=True)
final_df = df.groupby(['category'],as_index=False)['Total'].sum()
print(final_df)
category Total
0 coffee 100
1 food 350
2 shaving 350
3 transport 100
I have a database with transfers orders between two cities. I have, in each record, a departure date, the amount to be delivered, a returning date and the amount to be returned.
The database is something like this:
df = pd.DataFrame({"dep_date":[201701,201701,201702,201703], "del_amount":[100,200,300,400],"ret_date":[201703,201702,201703,201705], "ret_amount":[50,75,150,175]})
df
dep_date del_amount ret_date ret_amount
0 201701 100 201703 50
1 201701 200 201702 75
2 201702 300 201703 150
3 201703 400 201705 175
I want to get a pivot table with dep_data as index, showing the sum of del_amount in that month and the returned amount scheduled for the same month of departure date.
It's an odd construction, cause it seems to has two indexes. The result that I need is:
del_amount ret_amount
dep_date
201701 300 0
201702 300 75
201703 400 200
Note that some returning dates does not match with any departure month. Does anyone know if it is possible to build a proper aggfunc in pivot_table enviroment to achieve this? If it is not possible, can anyone tell me the best approach?
Thanks in advance
You'll need two groupby + sum operations, followed by a reindex and concatenation -
i = df.groupby(df.dep_date % 100)['del_amount'].sum()
j = df.groupby(df.ret_date % 100)['ret_amount'].sum()
pd.concat([i, j.reindex(i.index, fill_value=0)], 1)
del_amount ret_amount
dep_date
1 300 0
2 300 75
3 400 200
If you want to group on the entire date (and not just the month number), change df.groupby(df.dep_date % 100) to df.groupby('dep_date').
Use
In [97]: s1 = df.groupby('dep_date')['del_amount'].sum()
In [98]: s2 = df.groupby('ret_date')['ret_amount'].sum()
In [99]: s1.to_frame().join(s2.rename_axis('dep_date')).fillna(0)
Out[99]:
del_amount ret_amount
dep_date
201701 300 0.0
201702 300 75.0
201703 400 200.0
split it into two df, then we calculation for each of them , then we do join
s=df.loc[:,df.columns.str.startswith('de')]
v=df.loc[:,df.columns.str.startswith('ret')]
s.set_index('dep_date').sum(level=0).join(v.set_index('ret_date').sum(level=0)).fillna(0)
Out[449]:
del_amount ret_amount
dep_date
201701 300 0.0
201702 300 75.0
201703 400 200.0
I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150