I'm having trouble with some pandas groupby object issue, which is the following:
so I have this dataframe:
Letter name num_exercises
A carl 1
A Lenna 2
A Harry 3
A Joe 4
B Carl 5
B Lenna 3
B Harry 3
B Joe 6
C Carl 6
C Lenna 3
C Harry 4
C Joe 7
And I want to add a column on it, called num_exercises_total , which contains the total sum of num_exercises for each letter. Please note that this value must be repeated for each row in the letter group.
The output would be as follows:
Letter name num_exercises num_exercises_total
A carl 1 15
A Lenna 2 15
A Harry 3 15
A Joe 4 15
B Carl 5 18
B Lenna 3 18
B Harry 3 18
B Joe 6 18
C Carl 6 20
C Lenna 3 20
C Harry 4 20
C Joe 7 20
I've tried adding the new column like this:
df['num_exercises_total'] = df.groupby(['letter'])['num_exercises'].sum()
But it returns the value NaN for all the rows.
Any help would be highly appreciated.
Thank you very much in advance!
You may want to check transform
df.groupby(['Letter'])['num_exercises'].transform('sum')
0 10
1 10
2 10
3 10
4 17
5 17
6 17
7 17
8 20
9 20
10 20
11 20
Name: num_exercises, dtype: int64
df['num_of_total']=df.groupby(['Letter'])['num_exercises'].transform('sum')
Transform works perfectly for this question. WenYoBen is right. I am just putting slightly different version here.
df['num_of_total']=df['num_excercises'].groupby(df['Letter']).transform('sum')
>>> df
Letter name num_excercises num_of_total
0 A carl 1 10
1 A Lenna 2 10
2 A Harry 3 10
3 A Joe 4 10
4 B Carl 5 17
5 B Lenna 3 17
6 B Harry 3 17
7 B Joe 6 17
8 C Carl 6 20
9 C Lenna 3 20
10 C Harry 4 20
11 C Joe 7 20
Related
I have the next pd.DataFrame:
Index ID Name Date Days
1 1 Josh 5-1-20 10
2 1 Josh 9-1-20 10
3 1 Josh 19-1-20 6
4 2 Mike 1-1-20 10
5 3 George 1-4-20 10
6 4 Rose 1-2-20 10
7 4 Rose 11-5-20 5
8 5 Mark 1-9-20 10
9 6 Joe 1-4-21 10
10 7 Jill 1-1-21 10
I'm needing to make a DataFrame where the ID is not repeated, for that, I want to creat new columns (Date y Days), considering the case with most repeatitions (3 in this case).
The desired output is the next DataFrame:
Index ID Name Date 1 Date 2 Date 3 Days1 Days2 Days3
1 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
2 2 Mike 1-1-20 10
3 3 George 1-4-20 10
4 4 Rose 1-2-20 11-5-20 10 5
5 5 Mark 1-9-20 10
6 6 Joe 1-4-21 10
7 7 Jill 1-1-21 10
Try:
df_out = df.set_index(['ID','Name',df.groupby('ID').cumcount()+1]).unstack()
df_out.columns = [f'{i} {j}' for i, j in df_out.columns]
df_out.fillna('').reset_index()
Output:
ID Name Index 1 Index 2 Index 3 Date 1 Date 2 Date 3 Days 1 Days 2 Days 3
0 1 Josh 1.0 2.0 3.0 5-1-20 9-1-20 19-1-20 10.0 10.0 6.0
1 2 Mike 4.0 1-1-20 10.0
2 3 George 5.0 1-4-20 10.0
3 4 Rose 6.0 7.0 1-2-20 11-5-20 10.0 5.0
4 5 Mark 8.0 1-9-20 10.0
5 6 Joe 9.0 1-4-21 10.0
6 7 Jill 10.0 1-1-21 10.0
Here is a solution using pivot with a helper column:
df2 = (df
.assign(col=df.groupby('ID').cumcount().add(1).astype(str))
.pivot(index=['ID','Name'], columns='col', values=['Date', 'Days'])
.fillna('')
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
Output:
ID Name Date_1 Date_2 Date_3 Days_1 Days_2 Days_3
0 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
1 2 Mike 1-1-20 10
2 3 George 1-4-20 10
3 4 Rose 1-2-20 11-5-20 10 5
4 5 Mark 1-9-20 10
5 6 Joe 1-4-21 10
6 7 Jill 1-1-21 10
Im doing an assignment for a basic programming course.
I have a dataframe (csv-file) containing the columns:
StudentID Name Assignment1 Assignment2 Assignment3
0 s123456 Michael Andersen 7 7 4
1 s123789 Bettina Petersen 12 10 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
7 s123456 Michael Andersen 7 7 4
8 S184507 Andreas Døssing Mortensen 2 2 4
9 S129834 Jonas Jonassen 0 -3 4
10 S123481 Milad Mohammed 12 10 7
11 S128310 Abdul Jihad 10 4 7
12 S125493 Søren Sørensen 0 7 7
13 S128363 123 4 7 10
14 S127463 Jensen Jensen 5 2 10
15 S120987 Jeff Bezos 12 12 12
I need to make my program give an error message if a condition is not meet. In this instance if a student is in the dataframe more than once and if the grade given for an assignment is not on the scale of grades (-3, 0, 2, 4, 7, 10, 12):
The assignment is as follows:
If the user chooses to check for data errors, you must display a report of errors (if any) in the loaded data file. Your program must at least detect and display information about the following possible errors:
1. If two students in the data have the same student id.
2. If a grade in the data set is not one of the possible grades on the 7-step-scale.
How can I get around this?
I have tried to solve it like this, but no luck:
doubles = dataDuplicate["Name"].duplicated()
print(doubles)
grades = np.array([-3,0,2,4,7,10,12])
dataSortGrades = dataSortGrades.iloc[:,2:] #this gives
gradesNotInList = np.isin(dataSortGrades,grades)
if dataDuplicate["Name"] in doubles == True:
print("Error")
else:
print(#list of false values")
The standard approach is:
if condition:
print(message)
where condition (or not condition) and message should be adjusted to your specific needs.
You dont need to create 3 data frames. You can just create the dataframe and then perform you selections based on your conditions.
import pandas as pd
import re
data = """0 s123456 Michael Andersen 7 7 4
1 s123789 Bettina Petersen 12 10 10
2 s123468 Thomas Nielsen -3 7 2
3 s123579 Marie Hansen 10 12 12
4 s123579 Marie Hansen 10 12 12
5 s127848 Andreas Nielsen 2 2 2
6 s120799 Mads Westergaard 12 12 10
7 s123456 Michael Andersen 7 7 4
8 S184507 Andreas Døssing Mortensen 2 2 4
9 S129834 Jonas Jonassen 0 -3 4
10 S123481 Milad Mohammed 12 10 7
11 S128310 Abdul Jihad 10 4 7
12 S125493 Søren Sørensen 0 7 7
13 S128363 123 4 7 10
14 S127463 Jensen Jensen 5 2 10
15 S120987 Jeff Bezos 12 12 12"""
#Make the data frame
data = [re.split(r"\s{2,}", line)[1:] for line in data.splitlines()]
df = pd.DataFrame(data, columns=['StudentID', 'Name', 'Assignment1', 'Assignment2', 'Assignment3'])
#print the duplicates
print(f'###Duplicate studentIDs###')
print(df[df['StudentID'].duplicated()])
#print invalid grades
valid_grades = ('-3', '0', '2', '4', '7', '10', '12')
print(f'###Invalid grades###')
print(df[
(df['Assignment1'].isin(valid_grades) == False) |
(df['Assignment2'].isin(valid_grades) == False) |
(df['Assignment3'].isin(valid_grades) == False)
])
OUTPUT
###Duplicate studentIDs###
StudentID Name Assignment1 Assignment2 Assignment3
4 s123579 Marie Hansen 10 12 12
7 s123456 Michael Andersen 7 7 4
###Invalid grades###
StudentID Name Assignment1 Assignment2 Assignment3
14 S127463 Jensen Jensen 5 2 10
I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer
I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?
Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)
Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')
Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])
You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Hi I'm trying to find the unique Player which show up in every Team.
df =
Team Player Number
A Joe 8
A Mike 10
A Steve 11
B Henry 9
B Steve 19
B Joe 4
C Mike 18
C Joe 6
C Steve 18
C Dan 1
C Henry 3
and the result should be:
result =
Team Player Number
A Joe 8
A Steve 11
B Joe 4
B Steve 19
C Joe 6
C Steve 18
since Joe and Steve are the only Player in each Team
You can use a GroupBy.transform to get a count of unique teams that each player is a member of, and compare this to the overall count of unique teams. This will give you a Boolean array, which you can use to filter your DataFrame:
df = df[df.groupby('Player')['Team'].transform('nunique') == df['Team'].nunique()]
The resulting output:
Team Player Number
0 A Joe 8
2 A Steve 11
4 B Steve 19
5 B Joe 4
7 C Joe 6
8 C Steve 18