I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.
Related
Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200
I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.
I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.
What I tried so far is:
df_new = df.groupby(by = ['id'], as_index=False).agg(aggregate)
df_new = df.groupby(by = ['id','ref'], as_index=False).agg(aggregate)
Let me know if you have any idea what I can do in pandas, or otherwise in Python.
This is basically what I have,
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
2
SELL
4500
1
Alex
2
SELL
4500
1
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
What I am trying to achieve is the following,
Name
Side
Size
ID
Alex
BUY
5400
0
Alex
SELL
4500
1
Sam
BUY
1500
2
P.S. the records are not duplicates of each other, what I provide is a simplified version, but in reality 'Name' is 20 more columns identifying each row.
P.S. P.S. My solution was to first aggregate by Reference then by ID.
Use drop_duplicates, groupby, and agg:
new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()
Output:
>>> new_df
Name Side Size ID
0 Alex BUY 5400 0
1 Alex SELL 4500 1
2 Sam BUY 1500 2
Edit: richardec's solution is better as this will also sum the ID column.
This double groupby should achieve the output you want, as long as names are unique.
df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()
Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
1
BUY
3000
0
2
SELL
4500
1
Sam
3
BUY
1500
2
Then we group this data by Name and Side with a ".sum()" operation to get the final result.
Name
Side
Size
ID
Alex
BUY
5400
0
SELL
4500
1
Sam
BUY
1500
2
Just drop duplicates first and then aggregate with a list
something like this should do (not tested)
I always like to reset the index after
i.e
df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()
or
# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()
I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
I am new to using Python and Pandas, but have been trying to automate some of the data cleaning/merging for reports of mine.
So far I've had success in building up the combined file of all information I need to feed into my reporting summary but have
gotten stuck with grouping and merging data with matching prefixes.
I have a data set that is structured similar to this in a pandas dataframe:
Company_Num Company_Name 2019_Amt 2020_Amt Code Flag Manager
1 ABC Company Ltd 2000 400 A Y John
1 ABC Company Ltd 2000 400 A Y John
2 DEFGHIJ Company (London) 480 100 B N James
3 DEFGHIJ Company (Bristol) 600 700 B N James
4 DEFGHIJ Company (York) 1500 1000 B N James
5 KLM Services 9000 7000 A Y Jane
6 NOPQ Industries 300 400 C Y Jen
7 NOPQ Industries - London 7000 8000 C Y Jen
I'm wanting to get a summary set of data where there are no duplicates in my data and
instead of having rows for each office I have one summarised value for each company. Ultimately
with a dataframe like:
Company_Name 2019_Amt 2020_Amt Code Flag
ABC Company Ltd 2000 400 A Y
DEFGHIJ Company 2580 1800 B N
KLM Services 9000 7000 A Y
NOPQ Industries 7300 8400 C Y
So far I have managed to drop the duplicates using:
df.drop_duplicates(subset=['Company_Num', 'Company_Name', 'Code', '2019_Amt', '2020_Amt'])
With the resulting table:
Company_Num Company_Name 2019_Amt 2020_Amt Code Flag Manager
1 ABC Company Ltd 2000 400 A Y John
2 DEFGHIJ Company (London) 480 100 B N James
3 DEFGHIJ Company (Bristol) 600 700 B N James
4 DEFGHIJ Company (York) 1500 1000 B N James
5 KLM Services 9000 7000 A Y Jane
6 NOPQ Industries 300 400 C Y Jen
7 NOPQ Industries - London 7000 8000 C Y Jen
The solution that I have tried is to substring the first 9 characters of each company name and use a groupby
and sum on those, but that leaves me with the column being saved as the substring. This has also dropped the
columns Code and Flag from my dataframe, leaving me with table like this:
df['SubString_Company_Name'] = df['Company_Name'].str.slice(0,9)
df.groupby([df.SubString_Company_Name]).sum().reset_index()
SubString_Company_Name 2019_Amt 2020_Amt
ABC Compa 2000 400
DEFGHIJ C 2580 1800
KLM Servi 9000 7000
NOPQ Indu 7300 8400
I have tried to use the os.path.commonprefix function to get the company names, but can't find a way to use it in a dataframe,
and for multiple values. My understanding is it will look at the list as a whole and return the longest common prefix of the
whole list which wouldn't work. I have also considered extracting all duplicate substrings into new dataframes and summing
and renaming there before merging back into one data set, but I'm not sure if that would work. The solutions I've found online
have been centred around uniform data where lambda can be used with a delimiter or the prefix is always the same size, whereas
my data is not uniform and the prefixes are varying sizes.
My data is changed every month and so I want to design a dynamic solution that isn't relying on substrings since I could run into
issues with only taking 9 characters. My final consideration is to extract the SubString_Company_Name
into a list, convert that to the os.path.commonprefix of the Company_Name and then save the unique commonprefix value of each
Company_Name into a new list and for each item in that list create a new summary table. But I don't know if this would work and
I want to know if there's a better or more efficient way of doing this before trying.
you can use groupby.agg after dropping duplicates and use series.str.split with the first string from the split .str[0] as the grouper:
d= {'Company_Name':'first','2019_Amt':'sum','2019_Amt':'sum',
'2020_Amt':'sum','Code':'first','Flag':'first'}
grouper = df['Company_Name'].str.split().str[0]
out = df.drop_duplicates().groupby(grouper).agg(d).reset_index(drop=True)
print(out)
Company_Name 2019_Amt 2020_Amt Code Flag
0 ABC Company Ltd 2000 400 A Y
1 DEFGHIJ Company (London) 2580 1800 B N
2 KLM Services 9000 7000 A Y
3 NOPQ Industries 7300 8400 C Y
I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.