I have a dataset with data of a set of bands who play in the same two-city festival. I aim to create a pandas dataframe of two colums, Band and total attendance. but I am unable to use the df.groupby argument as the same bands are mentioned in two different columns. Any tips?
Below is an example of how the dataset looks.
F.ex
Time Band_in_London Band_in_Reading Attendance_London Attendance_Reading
12 Kasabian Queen 3000 4000
13 Foo Fighters Beatles 1000 5000
14 U2 A-ha 2500 2500
15 Queen Kasabian 2200 1000
16 Beatles Foo Fighters 1300 4700
17 A-Ha U2 4000 3100
18
Using pd.wide_to_long:
import pandas as pd
# Notice there is a typo for the A-ha.
df = df.replace({'A-ha':'A-Ha'})
attendance = (pd.wide_to_long(df, ['Attendance', 'Band_in'], 'Time', 'City', sep='_', suffix=r'\w+').
groupby('Band_in').sum()
)
print(attendance)
Attendance
Band_in
A-Ha 6500
Beatles 6300
Foo Fighters 5700
Kasabian 4000
Queen 6200
U2 5600
df = original_dataframe[['Band_in_London', 'Attendance_London']]
df = df.rename({'Band_in_London': 'Band', 'Attendance_London': 'Attendance'}, axis=1)
df = df.append(
original_dataframe[['Band_in_Reading', 'Attendance_Reading']].rename(
{'Band_in_Reading': 'Band', 'Attendance_Reading': 'Attendance'}, axis=1))
df = df.groupby(['Band']).sum()
Related
Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200
I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.
I have a dataframe which involves Vendor, Product, Price of various listings on a market among other column values.
I need a dataframe which has the unique vendors, number of products, sum of their product listings, average price/product and (average * no. of sales) as different columns.
Something like this -
What's the best way to make this new dataframe?
Thanks!
First multiple columns Number of Sales with Price, then use DataFrameGroupBy.agg by dictionary of columns names with aggregate functions, then flatten MultiIndex in columns by map and rename. :
df['Number of Sales'] *= df['Price']
d1 = {'Product':'size', 'Price':['sum', 'mean'], 'Number of Sales':'mean'}
df = df.groupby('Vendor').agg(d1)
df.columns = df.columns.map('_'.join)
d = {'Product_size':'No. of Product',
'Price_sum':'Sum of Prices',
'Price_mean':'Mean of Prices',
'Number of Sales_mean':'H Factor'
}
df = df.rename(columns=d).reset_index()
print (df)
Vendor No. of Product Sum of Prices Mean of Prices H Factor
0 A 4 121 30.25 6050.0
1 B 1 12 12.00 1440.0
2 C 2 47 23.50 587.5
3 H 1 45 45.00 9000.0
You can do it using groupby(), like this:
df.groupby('Vendor').agg({'Products': 'count', 'Price': ['sum', 'mean']})
That's just three columns, but you can work out the rest.
You can do this by using pandas pivot_table. Here is an example based on your data.
import pandas as pd
import numpy as np
>>> f = pd.pivot_table(d, index=['Vendor', 'Sales'], values=['Price', 'Product'], aggfunc={'Price': np.sum, 'Product':np.ma.count}).reset_index()
>>> f['Avg Price/Product'] = f['Price']/f['Product']
>>> f['H Factor'] = f['Sales']*f['Avg Price/Product']
>>> f.drop('Sales', axis=1)
Vendor Price Product Avg Price/Product H Factor
0 A 121 4 30.25 6050.0
1 B 12 1 12.00 1440.0
2 C 47 2 23.50 587.5
3 H 45 1 45.00 9000.0
I'm working with census data (using the Census package ). When I select variables with the census API, they pass through in their raw format (e.g. B01001_007) and I'd like to replace the column name with the label (e.g. male 18 to 19 years).
I know this can be done through df.columns = ['male 18 to 19 years',
'male 20 years',
'male 21 years']
but this is tedious.
Is there a way to do some type of mapping that will auto-query into the header in my df below?
Sample data:
import pandas as pd
from pandas import DataFrame
variables_table = pd.DataFrame({'variable': ['B01001_007E','B01001_008E','B01001_009E'],
'label': ['male 18 to 19 years','male 20 years','male 21 years']
})
variables_table
label variable
male 18 to 19 years B01001_007E
male 20 years B01001_008E
male 21 years B01001_009E
Unclean output:
df = pd.DataFrame({'B01001_007E': ['100','200','300'],
'B01001_008E': ['300','200','100'],
'B01001_009E': ['500','100','200']})
df
B01001_007E B01001_008E B01001_009E
100 300 500
200 200 100
300 100 200
df.rename(columns=variables_table.set_index('variable')['label'])
Out:
male 18 to 19 years male 20 years male 21 years
0 100 300 500
1 200 200 100
2 300 100 200
Note that variables_table.set_index('variable')['label'] is a Series whose index is 'variable`. It will do the mapping on that index.
This is not an inplace operation. If you want to change the actual dataframe, assign it back to df: df = df.rename(columns=variables_table.set_index('variable')['label']) or use the inplace parameter: df.rename(columns=variables_table.set_index('variable')['label'], inplace=True)
I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)