For this question, I am using a FIFA dataset. I used a slicer/filter on df to view only players with 4+ skill moves and assigned it a variable. I then took a quick snapshot using value_counts() for seeing which teams held the most players with 4+ skill moves. Ultimately, I would like it if I could preserve this view if possible because the ranking is easy to understand.
My question is: what if I wanted to add new column based on the condition that it gives me the count of 4-skillers for each row/club_name, and similarly, anther column giving me the count of 5-skillers. For example, let's say Real Madrid had three 5-skillers and nine 4-skillers. The new columns would each show the counts accordingly. What would be the best way to do this?
*edit: df.skill_moves is an int column ranging 1-5.
You can have multiple named aggregates like so:
fourfive_skillers.groupby('club_name')['skill_moves'].agg(
total='count',
four_skills=lambda x: sum(x == 4),
five_plus_skills=lambda x: sum(x >= 5))
I have a different dataset than you, but the output would be similar to:
Out[52]:
total four_skills five_plus_skills
club_name
1. FC Kaiserslautern 1 1 0
1. FC Köln 1 1 0
1. FC Nürnberg 4 4 0
1. FC Union Berlin 1 1 0
1. FSV Mainz 05 2 1 1
... ... ... ...
Wolverhampton Wanderers 5 5 0
Yeni Malatyaspor 1 1 0
Yokohama F. Marinos 1 1 0
Çaykur Rizespor 1 1 0
Śląsk Wrocław 1 1 0
Another commonly done thing is to have percentages of the total for each additional column. You can do that like this:
fourfive_skillers.groupby('club_name')['skill_moves'].agg(
total='count',
four_skills=lambda x: sum(x == 4),
four_skills_pct=lambda x: sum(x == 4) / len(x),
five_plus_skills=lambda x: sum(x >= 5),
five_plus_skills_pct=lambda x: sum(x >= 5) / len(x))
Related
I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')
Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
I have a df named value of size 567 and it has a column index as follows:
index
96.875
96.6796875
96.58203125
96.38671875
95.80078125
94.7265625
94.62890625
94.3359375
58.88671875
58.7890625
58.69140625
58.59375
58.49609375
58.3984375
58.30078125
58.203125
I also have 2 additional variables:
mu = 56.80877955613938
sigma= 17.78935620293665
What I want is to check the values in the index column. If the value is greater than, say, mu+3*sigma, a new column named alarm must be added to the value df and a value of 4 must be added.
I tried:
for i in value['index']:
if (i >= mu+3*sigma):
value['alarm'] = 4
elif ((i < mu+3*sigma) and (i >= mu+2*sigma)):
value['alarm'] = 3
elif((i < mu+2*sigma) and (i >= mu+sigma)):
value['alarm'] = 2
elif ((i < mu+sigma) and (i >= mu)):
value['alarm'] = 1
But it creates an alarm column and fills it completely with 1.
What is the mistake I am doing here?
Expected output:
index alarm
96.875 3
96.6796875 3
96.58203125 3
96.38671875 3
95.80078125 3
94.7265625 3
94.62890625 3
94.3359375 3
58.88671875 1
58.7890625 1
58.69140625 1
58.59375 1
58.49609375 1
58.3984375 1
58.30078125 1
58.203125 1
If you have multiple conditions, you don't want to loop through your dataframe and use if, elif, else. A better solution would be to use np.select where we define conditions and based on those conditions we define choices:
conditions=[
value['index'] >= mu+3*sigma,
(value['index'] < mu+3*sigma) & (value['index'] >= mu+2*sigma),
(value['index'] < mu+2*sigma) & (value['index'] >= mu+sigma),
]
choices = [4, 3, 2]
value['alarm'] = np.select(conditions, choices, default=1)
value
alarm
index
96.875000 3
96.679688 3
96.582031 3
96.386719 3
95.800781 3
94.726562 3
94.628906 3
94.335938 3
58.886719 1
58.789062 1
58.691406 1
58.593750 1
58.496094 1
58.398438 1
58.300781 1
58.203125 1
If you have 10 min time, here's a good post by CS95 explaining why looping over a dataframe is bad practice.
I have the following data:
df = pd.DataFrame({'sound': ['A', 'B', 'B', 'A', 'B', 'A'],
'score': [10, 5, 6, 7, 11, 1]})
print(df)
sound score
0 A 10
1 B 5
2 B 6
3 A 7
4 B 11
5 A 1
If I standardize (i.e. Z score) the score variable, I get the following values. The mean of the new z column is basically 0, with SD of 1, both of which are expected for a standardized variable:
df['z'] = (df['score'] - df['score'].mean())/df['score'].std()
print(df)
print('Mean: {}'.format(df['z'].mean()))
print('SD: {}'.format(df['z'].std()))
sound score z
0 A 10 0.922139
1 B 5 -0.461069
2 B 6 -0.184428
3 A 7 0.092214
4 B 11 1.198781
5 A 1 -1.567636
Mean: -7.401486830834377e-17
SD: 1.0
However, what I'm actually interested in is calculating Z scores based on group membership (sound). For example, if a score is from sound A, then convert that value to a Z score using the mean and SD of sound A values only. Likewise, sound B Z scores will only use mean and SD from sound B. This will obviously produce different values compared to regular Z score calculation:
df['zg'] = df.groupby('sound')['score'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
print('Mean: {}'.format(df['zg'].mean()))
print('SD: {}'.format(df['zg'].std()))
sound score z zg
0 A 10 0.922139 0.872872
1 B 5 -0.461069 -0.725866
2 B 6 -0.184428 -0.414781
3 A 7 0.092214 0.218218
4 B 11 1.198781 1.140647
5 A 1 -1.567636 -1.091089
Mean: 3.700743415417188e-17
SD: 0.894427190999916
My question is: why is the mean of the group-based standardized values (zg) also basically equal to 0? Is this expected behaviour or is there an error in my calculation somewhere?
The z scores make sense because standardizing within a variable essentially forces the mean to 0. But the zg values are calculated using different means and SDs for each sound group, so I'm not sure why the mean of that new variable has also been set to 0.
The only situation where I can see this happening is if the sum of values > 0 is equal to sum of values < 0, which when averaged would cancel out to 0. This happens in a regular Z score calculation but I'm surprised that this also happens when operating across multiple groups like this...
I think it makes perfect sense. If E[abc | def ] is the expectation of abc given def), then in df['zg']:
m1 = E['zg' | sound = 'A'] = (0.872872 + 0.218218 -1.091089)/3 ~ 0
m2 = E['zg' | sound = 'B'] = (-0.725866 - 0.414781 + 1.140647)/3 ~ 0
and
E['zg'] = (m1+m2)/2 = (0.872872 + 0.218218 -1.091089 -0.725866 - 0.414781 + 1.140647)/6 ~ 0
Yes, this is expected behavior.
In fancy words, using the Law of Iterated Expectations,
And specifically, if groups Y are finite and thus countable,
where
However, by construction, every E[X|Y_j] is 0 for all values of Y in your set G of possible groups.
Thus, the total average will also be zero.
I have a dataset wherein I am trying to determine the number of risk factors per person. So I have the following data:
Person_ID Age Smoker Diabetes
001 30 Y N
002 45 N N
003 27 N Y
004 18 Y Y
005 55 Y Y
Each attribute (Age, Smoker, Diabetes) has its own condition to determine whether it is a risk factor. So if Age >= 45, it's a risk factor. Smoker and Diabetes are risk factors if they are "Y". What I would like is to add a column that adds up the number of risk factors for each person based on those conditions. So the data would look like this:
Person_ID Age Smoker Diabetes Risk_Factors
001 30 Y N 1
002 25 N N 0
003 27 N Y 1
004 18 Y Y 2
005 55 Y Y 3
I have a sample dataset that I was fooling around with in Excel, and the way I did it there was to use the COUNTIF formula like so:
=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")
However, the actual dataset that I will be using is way too large for Excel, so I'm learning pandas for python. I wish I could provide examples of what I've already tried, but frankly I don't even know where to start. I looked at this question, but it doesn't really address what to do about applying it to an entire new column using different conditions from multiple columns. Any suggestions?
I would do this the following way.
For each column, create a new boolean series using the column's condition
Add those series row-wise
(Note that this is simpler if your Smoker and Diabetes column is already boolean (True/False) instead of in strings.)
It might look like this:
df = pd.DataFrame({'Age': [30,45,27,18,55],
'Smoker':['Y','N','N','Y','Y'],
'Diabetes': ['N','N','Y','Y','Y']})
Age Diabetes Smoker
0 30 N Y
1 45 N N
2 27 Y N
3 18 Y Y
4 55 Y Y
#Step 1
risk1 = df.Age > 45
risk2 = df.Smoker == "Y"
risk3 = df.Diabetes == "Y"
risk_df = pd.concat([risk1,risk2,risk3],axis=1)
Age Smoker Diabetes
0 False True False
1 False False False
2 False False True
3 False True True
4 True True True
df['Risk_Factors'] = risk_df.sum(axis=1)
Age Diabetes Smoker Risk_Factors
0 30 N Y 1
1 45 N N 0
2 27 Y N 1
3 18 Y Y 2
4 55 Y Y 3
If you want to stick with pandas. You can use the following...
Solution
isY = lambda x:int(x=='Y')
countRiskFactors = lambda row: isY(row['Smoker']) + isY(row['Diabetes']) + int(row["Age"]>45)
df['Risk_Factors'] = df.apply(countRiskFactors,axis=1)
How it works
isY - is a stored lambda function that checks if the value of a cell is Y returns 1 if it is otherwise 0
countRiskFactors - adds up the risk factors
the final line uses the apply method, with the paramater key set to 1, which applies the method -first parameter - row wise along the DataFrame and Returns a Series which is appended to the DataFrame.
output of print df
Person_ID Age Smoker Diabetes Risk_Factors
0 1 30 Y N 1
1 2 45 N N 0
2 3 27 N Y 1
3 4 18 Y Y 2
4 5 55 Y Y 3
If you are starting from excel and want to go to the next evolution then I would recommend MS access. It will be a lot easier then learning Panda for python. You should just replace the CountIf() with:
Risk Factor: IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)