sum the values of a group by object

sum the values of a group by object - python

I'm having trouble with some pandas groupby object issue, which is the following:
so I have this dataframe:
Letter name num_exercises
A carl 1
A Lenna 2
A Harry 3
A Joe 4
B Carl 5
B Lenna 3
B Harry 3
B Joe 6
C Carl 6
C Lenna 3
C Harry 4
C Joe 7
And I want to add a column on it, called num_exercises_total , which contains the total sum of num_exercises for each letter. Please note that this value must be repeated for each row in the letter group.
The output would be as follows:
Letter name num_exercises num_exercises_total
A carl 1 15
A Lenna 2 15
A Harry 3 15
A Joe 4 15
B Carl 5 18
B Lenna 3 18
B Harry 3 18
B Joe 6 18
C Carl 6 20
C Lenna 3 20
C Harry 4 20
C Joe 7 20
I've tried adding the new column like this:
df['num_exercises_total'] = df.groupby(['letter'])['num_exercises'].sum()
But it returns the value NaN for all the rows.
Any help would be highly appreciated.
Thank you very much in advance!

You may want to check transform
df.groupby(['Letter'])['num_exercises'].transform('sum')
0 10
1 10
2 10
3 10
4 17
5 17
6 17
7 17
8 20
9 20
10 20
11 20
Name: num_exercises, dtype: int64
df['num_of_total']=df.groupby(['Letter'])['num_exercises'].transform('sum')

Transform works perfectly for this question. WenYoBen is right. I am just putting slightly different version here.
df['num_of_total']=df['num_excercises'].groupby(df['Letter']).transform('sum')
>>> df
Letter name num_excercises num_of_total
0 A carl 1 10
1 A Lenna 2 10
2 A Harry 3 10
3 A Joe 4 10
4 B Carl 5 17
5 B Lenna 3 17
6 B Harry 3 17
7 B Joe 6 17
8 C Carl 6 20
9 C Lenna 3 20
10 C Harry 4 20
11 C Joe 7 20

Related

How to pivot a DataFrame creating new columns, considering the max item repeated

I have the next pd.DataFrame:
Index ID Name Date Days
1 1 Josh 5-1-20 10
2 1 Josh 9-1-20 10
3 1 Josh 19-1-20 6
4 2 Mike 1-1-20 10
5 3 George 1-4-20 10
6 4 Rose 1-2-20 10
7 4 Rose 11-5-20 5
8 5 Mark 1-9-20 10
9 6 Joe 1-4-21 10
10 7 Jill 1-1-21 10
I'm needing to make a DataFrame where the ID is not repeated, for that, I want to creat new columns (Date y Days), considering the case with most repeatitions (3 in this case).
The desired output is the next DataFrame:
Index ID Name Date 1 Date 2 Date 3 Days1 Days2 Days3
1 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
2 2 Mike 1-1-20 10
3 3 George 1-4-20 10
4 4 Rose 1-2-20 11-5-20 10 5
5 5 Mark 1-9-20 10
6 6 Joe 1-4-21 10
7 7 Jill 1-1-21 10

Try:
df_out = df.set_index(['ID','Name',df.groupby('ID').cumcount()+1]).unstack()
df_out.columns = [f'{i} {j}' for i, j in df_out.columns]
df_out.fillna('').reset_index()
Output:
ID Name Index 1 Index 2 Index 3 Date 1 Date 2 Date 3 Days 1 Days 2 Days 3
0 1 Josh 1.0 2.0 3.0 5-1-20 9-1-20 19-1-20 10.0 10.0 6.0
1 2 Mike 4.0 1-1-20 10.0
2 3 George 5.0 1-4-20 10.0
3 4 Rose 6.0 7.0 1-2-20 11-5-20 10.0 5.0
4 5 Mark 8.0 1-9-20 10.0
5 6 Joe 9.0 1-4-21 10.0
6 7 Jill 10.0 1-1-21 10.0

Here is a solution using pivot with a helper column:
df2 = (df
.assign(col=df.groupby('ID').cumcount().add(1).astype(str))
.pivot(index=['ID','Name'], columns='col', values=['Date', 'Days'])
.fillna('')
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
Output:
ID Name Date_1 Date_2 Date_3 Days_1 Days_2 Days_3
0 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
1 2 Mike 1-1-20 10
2 3 George 1-4-20 10
3 4 Rose 1-2-20 11-5-20 10 5
4 5 Mark 1-9-20 10
5 6 Joe 1-4-21 10
6 7 Jill 1-1-21 10

How do I give an error message if condition is not met?

Im doing an assignment for a basic programming course.
I have a dataframe (csv-file) containing the columns:
StudentID Name Assignment1 Assignment2 Assignment3
0 s123456 Michael Andersen 7 7 4
1 s123789 Bettina Petersen 12 10 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
7 s123456 Michael Andersen 7 7 4
8 S184507 Andreas Døssing Mortensen 2 2 4
9 S129834 Jonas Jonassen 0 -3 4
10 S123481 Milad Mohammed 12 10 7
11 S128310 Abdul Jihad 10 4 7
12 S125493 Søren Sørensen 0 7 7
13 S128363 123 4 7 10
14 S127463 Jensen Jensen 5 2 10
15 S120987 Jeff Bezos 12 12 12
I need to make my program give an error message if a condition is not meet. In this instance if a student is in the dataframe more than once and if the grade given for an assignment is not on the scale of grades (-3, 0, 2, 4, 7, 10, 12):
The assignment is as follows:
If the user chooses to check for data errors, you must display a report of errors (if any) in the loaded data file. Your program must at least detect and display information about the following possible errors:
1. If two students in the data have the same student id.
2. If a grade in the data set is not one of the possible grades on the 7-step-scale.
How can I get around this?
I have tried to solve it like this, but no luck:
doubles = dataDuplicate["Name"].duplicated()
print(doubles)
grades = np.array([-3,0,2,4,7,10,12])
dataSortGrades = dataSortGrades.iloc[:,2:] #this gives
gradesNotInList = np.isin(dataSortGrades,grades)
if dataDuplicate["Name"] in doubles == True:
print("Error")
else:
print(#list of false values")

The standard approach is:
if condition:
print(message)
where condition (or not condition) and message should be adjusted to your specific needs.

You dont need to create 3 data frames. You can just create the dataframe and then perform you selections based on your conditions.
import pandas as pd
import re
data = """0 s123456 Michael Andersen 7 7 4
1 s123789 Bettina Petersen 12 10 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
7 s123456 Michael Andersen 7 7 4
8 S184507 Andreas Døssing Mortensen 2 2 4
9 S129834 Jonas Jonassen 0 -3 4
10 S123481 Milad Mohammed 12 10 7
11 S128310 Abdul Jihad 10 4 7
12 S125493 Søren Sørensen 0 7 7
13 S128363 123 4 7 10
14 S127463 Jensen Jensen 5 2 10
15 S120987 Jeff Bezos 12 12 12"""
#Make the data frame
data = [re.split(r"\s{2,}", line)[1:] for line in data.splitlines()]
df = pd.DataFrame(data, columns=['StudentID', 'Name', 'Assignment1', 'Assignment2', 'Assignment3'])
#print the duplicates
print(f'###Duplicate studentIDs###')
print(df[df['StudentID'].duplicated()])
#print invalid grades
valid_grades = ('-3', '0', '2', '4', '7', '10', '12')
print(f'###Invalid grades###')
print(df[
(df['Assignment1'].isin(valid_grades) == False) |
(df['Assignment2'].isin(valid_grades) == False) |
(df['Assignment3'].isin(valid_grades) == False)
])
OUTPUT
###Duplicate studentIDs###
StudentID Name Assignment1 Assignment2 Assignment3
4 s123579 Marie Hansen 10 12 12
7 s123456 Michael Andersen 7 7 4
###Invalid grades###
StudentID Name Assignment1 Assignment2 Assignment3
14 S127463 Jensen Jensen 5 2 10

Pandas groupby on one column witout losing others columns?

I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's

There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23

If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23

Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer

How to strip the string and replace the existing elements in DataFrame

I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?

Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob

Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)

Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')

Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])

You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob

Pandas intersection of groups

Hi I'm trying to find the unique Player which show up in every Team.
df =
Team Player Number
A Joe 8
A Mike 10
A Steve 11
B Henry 9
B Steve 19
B Joe 4
C Mike 18
C Joe 6
C Steve 18
C Dan 1
C Henry 3
and the result should be:
result =
Team Player Number
A Joe 8
A Steve 11
B Joe 4
B Steve 19
C Joe 6
C Steve 18
since Joe and Steve are the only Player in each Team

You can use a GroupBy.transform to get a count of unique teams that each player is a member of, and compare this to the overall count of unique teams. This will give you a Boolean array, which you can use to filter your DataFrame:
df = df[df.groupby('Player')['Team'].transform('nunique') == df['Team'].nunique()]
The resulting output:
Team Player Number
0 A Joe 8
2 A Steve 11
4 B Steve 19
5 B Joe 4
7 C Joe 6
8 C Steve 18

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sum the values of a group by object - python

You may want to check transform df.groupby(['Letter'])['num_exercises'].transform('sum') 0 10 1 10 2 10 3 10 4 17 5 17 6 17 7 17 8 20 9 20 10 20 11 20 Name: num_exercises, dtype: int64 df['num_of_total']=df.groupby(['Letter'])['num_exercises'].transform('sum')

Related

How to pivot a DataFrame creating new columns, considering the max item repeated

How do I give an error message if condition is not met?

Pandas groupby on one column witout losing others columns?

How to strip the string and replace the existing elements in DataFrame

Pandas intersection of groups

Categories

Resources