How can I convert rows to columns (with custom names) after grouping? - python

I'm trying to get some row data as columns with pandas.
My original dataframe is something like the following (with a lot more columns). Most data repeats for the same employee but some info changes, like salary in this example. Employees have different number of entries (in this case employee 1 has two entries, 2 has 4, and so on).
employee_id salary other1 other2 other3
1 50000 somedata1 somedata2 somedata3
1 48000 somedata1 somedata2 somedata3
2 80000 somedata20 somedata21 somedata22
2 77000 somedata20 somedata21 somedata22
2 75000 somedata20 somedata21 somedata22
2 74000 somedata20 somedata21 somedata22
3 60000 somedata30 somedata31 somedata32
I'm trying to get something like the following. Salary data should span a few columns and use the last available salary for employees with fewer entries (the repeated salary values in this example).
employee_id salary prevsalary1 prevsalary2 prevsalary3 other1 other2 other3
1 50000 48000 48000 48000 somedata1 somedata2 somedata3
2 80000 77000 75000 74000 somedata20 somedata21 somedata22
3 60000 60000 60000 60000 somedata30 somedata31 somedata32
I tried grouping
df.groupby(["employee_id"])['salary'].nlargest(3).reset_index()
But I dont get all columns. I can't find a way to preserve the rest of columns. Do I need to merge, concatenate or something like that with the original dataframe?
Also, I get a column named "level_1". I think I could get rid of it by using reset_index(level=1, drop=True) but I believe this doesn't return a dataframe.
And finally, I guess if I get this grouping right, there's one more step to get the columns... maybe using pivot or unstack?
I'm starting my journey into machine learning and I keep scratching my head with this one, I hope you can help me :)
Creating dataset:
df = pd.DataFrame({'emp_id':[1,1,2,2,2,2,3],'salary':[50000,48000,80000,77000,75000,74000,60000]})
df['other1'] =['somedata1','somedata1','somedata20','somedata20','somedata20','somedata20','somedata30']
df['other2'] = df['other1'].apply(lambda x: x+'1')
df['other3'] = df['other1'].apply(lambda x: x+'2')
df
Out[59]:
emp_id salary other1 other2 other3
0 1 50000 somedata1 somedata11 somedata12
1 1 48000 somedata1 somedata11 somedata12
2 2 80000 somedata20 somedata201 somedata202
3 2 77000 somedata20 somedata201 somedata202
4 2 75000 somedata20 somedata201 somedata202
5 2 74000 somedata20 somedata201 somedata202
6 3 60000 somedata30 somedata301 somedata302

One way is using pd.pivot_table with ffill:
g = df.groupby('employee_id')
cols = g.salary.cumcount()
out = df.pivot_table(index='employee_id', values='salary', columns=cols).ffill(1)
# Crete list of column names matching the expected output
out.columns = ['salary'] + [f'prevsalary{i}' for i in range(1,len(out.columns))]
print(out)
salary prevsalary1 prevsalary2 prevsalary3
employee_id
1 50000.0 48000.0 48000.0 48000.0
2 80000.0 77000.0 75000.0 74000.0
3 60000.0 60000.0 60000.0 60000.0
Now we just need to join with the unique other columns from the original dataframe:
out = out.join(df.filter(like='other').groupby(df.employee_id).first())
print(out)
salary prevsalary1 prevsalary2 prevsalary3 other1 \
employee_id
1 50000.0 48000.0 48000.0 48000.0 somedata1
2 80000.0 77000.0 75000.0 74000.0 somedata20
3 60000.0 60000.0 60000.0 60000.0 somedata30
other2 other3
employee_id
1 somedata2 somedata3
2 somedata21 somedata22
3 somedata31 somedata32

pivot the table of salaries first, then merge with the non-salary data
# first create a copy of the dataset without the salary column
dataset_without_salaries = df.drop('salary', axis=1).drop_duplicates()
# pivot only salary column
temp = pd.pivot_table(data=df[['salary']], index=df['employee_id'], aggfunc=list)
# expand the list
temp2 = temp.apply(lambda x: pd.Series(x['salary']), axis=1)
# merge the two together
final = pd.merge(temp2, dataset_without_salaries)

Related

how to sum multiple row data using pandas and data is excel formet

Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Aggregating column values of dataframe to a new dataframe

I have a dataframe which involves Vendor, Product, Price of various listings on a market among other column values.
I need a dataframe which has the unique vendors, number of products, sum of their product listings, average price/product and (average * no. of sales) as different columns.
Something like this -
What's the best way to make this new dataframe?
Thanks!
First multiple columns Number of Sales with Price, then use DataFrameGroupBy.agg by dictionary of columns names with aggregate functions, then flatten MultiIndex in columns by map and rename. :
df['Number of Sales'] *= df['Price']
d1 = {'Product':'size', 'Price':['sum', 'mean'], 'Number of Sales':'mean'}
df = df.groupby('Vendor').agg(d1)
df.columns = df.columns.map('_'.join)
d = {'Product_size':'No. of Product',
'Price_sum':'Sum of Prices',
'Price_mean':'Mean of Prices',
'Number of Sales_mean':'H Factor'
}
df = df.rename(columns=d).reset_index()
print (df)
Vendor No. of Product Sum of Prices Mean of Prices H Factor
0 A 4 121 30.25 6050.0
1 B 1 12 12.00 1440.0
2 C 2 47 23.50 587.5
3 H 1 45 45.00 9000.0
You can do it using groupby(), like this:
df.groupby('Vendor').agg({'Products': 'count', 'Price': ['sum', 'mean']})
That's just three columns, but you can work out the rest.
You can do this by using pandas pivot_table. Here is an example based on your data.
import pandas as pd
import numpy as np
>>> f = pd.pivot_table(d, index=['Vendor', 'Sales'], values=['Price', 'Product'], aggfunc={'Price': np.sum, 'Product':np.ma.count}).reset_index()
>>> f['Avg Price/Product'] = f['Price']/f['Product']
>>> f['H Factor'] = f['Sales']*f['Avg Price/Product']
>>> f.drop('Sales', axis=1)
Vendor Price Product Avg Price/Product H Factor
0 A 121 4 30.25 6050.0
1 B 12 1 12.00 1440.0
2 C 47 2 23.50 587.5
3 H 45 1 45.00 9000.0

Matching and adding column to data frame

I am going crazy about this one. I am trying to add a new column to a data frame DF1, based on values found in another data frame DF2. This is how they look,
DF1=
Date Amount Currency
0 2014-08-20 -20000000 EUR
1 2014-08-20 -12000000 CAD
2 2014-08-21 10000 EUR
3 2014-08-21 20000 USD
4 2014-08-22 25000 USD
DF2=
NAME OPEN
0 EUR 10
1 CAD 20
2 USD 30
Now, I would like to create a new column in DF1, named 'Amount (Local)', where each amount in 'Amount' is multipled with the correct matching value found in DF2 yielding a result,
DF1=
Date Amount Currency Amount (Local)
0 2014-08-20 -20000000 EUR -200000000
1 2014-08-20 -12000000 CAD -240000000
2 2014-08-21 10000 EUR 100000
3 2014-08-21 20000 USD 600000
4 2014-08-22 25000 USD 750000
If there exists a method for adding a column to DF1 based on a function, instead of just multiplication as the above problem, that would be very much appreciated also.
Thanks,
You can use map from a dict of your second df (in my case it is called df1. yours is DF2), and then multiply the result of this by the amount:
In [65]:
df['Amount (Local)'] = df['Currency'].map(dict(df1[['NAME','OPEN']].values)) * df['Amount']
df
Out[65]:
Date Amount Currency Amount (Local)
index
0 2014-08-20 -20000000 EUR -200000000
1 2014-08-20 -12000000 CAD -240000000
2 2014-08-21 10000 EUR 100000
3 2014-08-21 20000 USD 600000
4 2014-08-22 25000 USD 750000
So breaking this down, map will match the value against the value in the dict key, in this case we are matching Currency against the NAME key, the value in the dict is the OPEN values, the result of this would be:
In [66]:
df['Currency'].map(dict(df1[['NAME','OPEN']].values))
Out[66]:
index
0 10
1 20
2 10
3 30
4 30
Name: Currency, dtype: int64
We then simply multiply this series against the Amount column from df (DF1 in your case) to get the desired result.
Use fancy-indexing to create a currency array aligned with your data in df1, then use it in multiplication, and assign the result to a new column in df1:
import pandas as pd
ccy_series = pd.Series([10,20,30], index=['EUR', 'CAD', 'USD'])
df1 = pd.DataFrame({'amount': [-200, -120, 1, 2, 2.5], 'ccy': ['EUR', 'CAD', 'EUR', 'USD', 'USD']})
aligned_ccy = ccy_series[df1.ccy].reset_index(drop=True)
aligned_ccy
=>
0 10
1 20
2 10
3 30
4 30
dtype: int64
df1['amount_local'] = df1.amount *aligned_ccy
df1
=>
amount ccy amount_local
0 -200.0 EUR -2000
1 -120.0 CAD -2400
2 1.0 EUR 10
3 2.0 USD 60
4 2.5 USD 75

Pandas Pivot tables row subtotals

I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)

Categories