Groupby and sum of multiple columns with the same value

Groupby and sum of multiple columns with the same value - python

I am working on Pandas data frame and have following dataframe:
data =pd.DataFrame()
data['HomeTeam'] = ['A','B','C','D','E']
data['AwayTeam'] = ['E','D','A','B','C']
data['HomePoint'] = [1,3,0,1,3]
data['AwayPoint'] = [1,0,3,1,0]
data ['Match'] = data['HomeTeam'].astype(str)+' Vs '+data['AwayTeam'].astype(str)
# I want to duplicate the match
Nsims = 2
data_Dub =pd.DataFrame((pd.np.tile(data,(Nsims,1))))
data_Dub.columns = data.columns
# Then I will assign the stage of the match
data_Dub['SimStage'] = data_Dub.groupby('Match').cumcount()
What i wanted to do is to sum homepoint and awaypoint obtained by each team and save it to new data frame.
my new dataframe will look like as follow:
It means that Homepoint and awaypoint will be added for same team as I have 5 teams in dataframe.
Can anyone advise how to do it.
I used following code and it does not work.
Point = data_Dub.groupby(['SimStage','HomeTeam','AwayTeam])['HomePoint','AwayPoint'].sum()
Thanks.

You can aggregate sum separately for HomeTeam and AwayTeam, then use add, last sort_index, reset_index for columns from MultiIndex, change column name and if necessary order of columns:
a = data_Dub.groupby(['AwayTeam', 'SimStage'])['AwayPoint'].sum()
b = data_Dub.groupby(['HomeTeam', 'SimStage'])['HomePoint'].sum()
s = a.add(b).rename('Point')
df = s.sort_index(level=[1, 0]).reset_index().rename(columns={'AwayTeam':'Team'})
df = df[['Team','Point','SimStage']]
print (df)
Team Point SimStage
0 A 4 0
1 B 4 0
2 C 0 0
3 D 1 0
4 E 4 0
5 A 4 1
6 B 4 1
7 C 0 1
8 D 1 1
9 E 4 1

Related

Create new columns based on distinct values and count them

Sorry if the title is not clear enough. Let me explain what I want to achieve.
I have this Data-Frame, let's call it df.
id | Area
A one
A two
A one
B one
B one
C one
C two
D one
D one
D two
D three
I would like to create a new Data-Frame based on the values in the existing Data-Frame. First, I would like to find a total sum of distinct id in df. Ex. id A has 3 entries, B has 2 entries, etc. Then create a new data frame out of it.
For our new Data-Frame, let's call it df_new
id | count
A 3
B 2
C 2
D 4
Next, I would like to create a new column based on values in df['Area'], for this example, df['Area'] contains 3 distinct values (one, two, three). I would like to count the number of times an id has been in which Area. For example, id A has been in area one twice, once in area two and zero in area three. Then, I will append those values into a new column called one, two and three.
df_new :
id | count | one | two | three
A 3 2 1 0
B 2 2 0 0
C 2 1 1 0
D 4 2 1 1
I have developed my own code which produces df_new, however I believe Pandas has a better function to perform this sort of data extraction. Here is my code.
#Read the data
df = pd.read_csv('test_data.csv', sep = ',')
df.columns = ['id', 'Area'] #Rename
# Count a total number of Area by Id
df_new = pd.DataFrame({'count' : df.groupby("id")["Area"].count()})
# Reset index
df_new = df_new.reset_index()
#For loop for counting and creating a new column for areas in df['Area']
for i in xrange(0, len(df)):
#Get the id
idx = df['id'][i]
#Get the areaname
area_name = str(df["Area"][i])
#Retrieve the index of a particular id
current_index = df_new.loc[df_new['id'] == idx, ].index[0]
#If area name exists in a column
if area_name in df_new.columns:
#Then +1 at the Location of the idx (Index)
df_new[area_name][current_index] += 1
#If not exists in the columns
elif area_name not in df_new.columns:
#Create an empty one with zeros
df_new[area_name] = 0
#Then +1 at the location of the idx (Index)
df_new[area_name][current_index] += 1
The code is long and hard to read. It also suffers from the warning "A value is trying to be set on a copy of a slice from a DataFrame". I would like to learn more on how to write this effectively.
Thank you

You can use df.groupby.count for the first part and pd.crosstab for the the second. Then, use pd.concat to join em:
In [1246]: pd.concat([df.groupby('id').count().rename(columns={'Area' : 'count'}),\
pd.crosstab(df.id, df.Area)], 1)
Out[1246]:
count one three two
id
A 3 2 0 1
B 2 2 0 0
C 2 1 0 1
D 4 2 1 1
Here's the first part using df.groupby:
df.groupby('id').count().rename(columns={'Area' : 'count'})
count
id
A 3
B 2
C 2
D 4
Here's the second part with pd.crosstab:
pd.crosstab(df.id, df.Area)
Area one three two
id
A 2 0 1
B 2 0 0
C 1 0 1
D 2 1 1
For the second part, you can also use pd.get_dummies and do a dot product:
(pd.get_dummies(df.id).T).dot(pd.get_dummies(df.Area))
one three two
A 2 0 1
B 2 0 0
C 1 0 1
D 2 1 1

Finding and mapping duplicates in a pandas groupby object

I have 2 columns, User_ID and Item_ID. Now I want to make a new column 'Reordered' which will contain values as either 0 or 1. 0 is when a particular user has ordered an item only once, and 1 is when a particular user orders an item more than once.
I think this can be done by grouping on User_ID and then using apply function to map duplicated items as 1 and non duplicated as 0 but I'm not able to figure out the correct python code for that.
If someone can please help me with this.

You can use Series.duplicated with parameter keep=False for all duplicates - output is Trues and Falses. Last convert to ints by astype:
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
Sample:
df = pd.DataFrame({'User_ID':list('aaabaccd'),
'Item_ID':list('eetyutyu')})
df['Reordered'] = df['User_ID'].duplicated(keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 1
3 y b 0
4 u a 1
5 t c 1
6 y c 1
7 u d 0
Or maybe need DataFrame.duplicated for check duplicates per each user:
df['Reordered'] = df.duplicated(['User_ID','Item_ID'], keep=False).astype(int)
print (df)
Item_ID User_ID Reordered
0 e a 1
1 e a 1
2 t a 0
3 y b 0
4 u a 0
5 t c 0
6 y c 0
7 u d 0

Count occurences of specific values in a data frame, where all possible values are defined by a list

I have two categories A and B that can take on 5 different states (values, names or categories) defined by the list abcde. Counting the occurence of each state and storing it in a data frame is fairly easy. However, I would also like the resulting data frame to include zeros for the possible values that have not occured in Category A or B.
First, here's a dataframe that matches the description:
In[1]:
import pandas as pd
possibleValues = list('abcde')
df = pd.DataFrame({'Category A':list('abbc'), 'Category B':list('abcc')})
print(df)
Out[1]:
Category A Category B
0 a a
1 b b
2 b c
3 c c
I've tried different approaches with df.groupby(...).size() and .count() , combined with the list of possible values and the names of the categories in a list, but with no success.
Here's the desired output:
Category A Category B
a 1 1
b 2 1
c 1 2
d 0 0
e 0 0
To go one step further, I'd also like to include a column with the totals for each possible state across all categories:
Category A Category B Total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0
SO has got many related questions and answers, but to my knowledge none that suggest a solution to this particular problem. Thank you for any suggestions!
P.S
I'd like to make the solution adjustable to the number of categories, possible values and number of rows.

Need apply + value_counts + reindex + sum:
cols = ['Category A','Category B']
df1 = df[cols].apply(pd.value_counts).reindex(possibleValues, fill_value=0)
df1['total'] = df1.sum(axis=1)
print (df1)
Category A Category B total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0
Another solution is convert columns to categorical and then 0 values are added without reindex:
cols = ['Category A','Category B']
df1 = df[cols].apply(lambda x: pd.Series.value_counts(x.astype('category',
categories=possibleValues)))
df1['total'] = df1.sum(axis=1)
print (df1)
Category A Category B total
a 1 1 2
b 2 1 3
c 1 2 3
d 0 0 0
e 0 0 0

cumsum pandas up to specific value - python pandas

Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,

Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4

Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1

Python Pandas working with dataframes in functions

I have a DataFrame which I want to pass to a function, derive some information from and then return that information. Originally I set up my code like:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
def test_function(df):
df['D'] = 0
df.D = np.random.rand(len(df))
grouped = df.groupby('A')
df = grouped.first()
df = df['D']
return df
Ds = test_function(df)
print(df)
print(Ds)
Which returns:
A B C D
0 1 5 1 0.582319
1 1 5 1 0.269779
2 1 6 1 0.421593
3 1 7 1 0.797121
4 2 5 1 0.366410
5 2 6 1 0.486445
6 2 6 1 0.001217
7 3 7 1 0.262586
8 3 7 1 0.146543
9 4 6 1 0.985894
10 4 7 1 0.312070
11 4 7 1 0.498103
A
1 0.582319
2 0.366410
3 0.262586
4 0.985894
Name: D, dtype: float64
My thinking was along the lines of, I don't want to copy my large dataframe, so I will add a working column to it, and then just return the information I want with out affecting the original dataframe. This of course doesn't work, because I didn't copy the dataframe so adding a column is adding a column. Currently I'm doing something like:
add column
results = Derive information
delete column
return results
which feels a bit kludgy to me, but I can't think of a better way to do it without copying the dataframe. Any suggestions?

If you do not want to add a column to your original DataFrame, you could create an independent Series and apply the groupby method to the Series instead:
def test_function(df):
ser = pd.Series(np.random.rand(len(df)))
grouped = ser.groupby(df['A'])
return grouped.first()
Ds = test_function(df)
yields
A
1 0.017537
2 0.392849
3 0.451406
4 0.234016
dtype: float64
Thus, test_function does not modify df at all. Notice that ser.groupby can be passed a sequence of values (such as df['A']) by which to group instead of the just the name of a column.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby and sum of multiple columns with the same value - python

Related

Create new columns based on distinct values and count them

Finding and mapping duplicates in a pandas groupby object

Count occurences of specific values in a data frame, where all possible values are defined by a list

cumsum pandas up to specific value - python pandas

Python Pandas working with dataframes in functions

Categories

Resources