How I can implement the case_when function of R in a python code?
Here is the case_when function of R:
https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when
as a minimum working example suppose we have the following dataframe (python code follows):
import pandas as pd
import numpy as np
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:
if age < 10 then baby
if age >= 10 and age < 20 then kid
if age >=20 and age < 30 then young
if age >= 30 and age < 50 then mature
if age >= 50 then grandpa
Can someone help on this ?
You want to use np.select:
conditions = [
(df["age"].lt(10)),
(df["age"].ge(10) & df["age"].lt(20)),
(df["age"].ge(20) & df["age"].lt(30)),
(df["age"].ge(30) & df["age"].lt(50)),
(df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]
df["elderly"] = np.select(conditions, choices)
# Results in:
# name age preTestScore postTestScore elderly
# 0 Jason 42 4 25 mature
# 1 Molly 52 24 94 grandpa
# 2 Tina 36 31 57 mature
# 3 Jake 24 2 62 young
# 4 Amy 73 3 70 grandpa
The conditions and choices lists must be the same length.
There is also a default parameter that is used when all conditions evaluate to False.
np.select is great because it's a general way to assign values to elements in choicelist depending on conditions.
However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut method.
bin_cond = [-np.inf, 10, 20, 30, 50, np.inf] # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)
# name age preTestScore postTestScore elderly elderly2
# 0 Jason 42 4 25 mature mature
# 1 Molly 52 24 94 grandpa grandpa
# 2 Tina 36 31 57 mature mature
# 3 Jake 24 2 62 young young
# 4 Amy 73 3 70 grandpa grandpa
pyjanitor has a case_when implementation in dev that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable and fcase in R's data.table; under the hood, it uses pd.Series.mask:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa', # default if none of the conditions match
column_name = 'elderly') # column name to assign to
name age preTestScore postTestScore elderly
0 Jason 42 4 25 mature
1 Molly 52 24 94 grandpa
2 Tina 36 31 57 mature
3 Jake 24 2 62 young
4 Amy 73 3 70 grandpa
Alby's solution is more efficient for this use case than an if/else construct.
Steady of numpy you can create a function and use map or apply with lambda:
def elderly_function(age):
if age < 10:
return 'baby'
if age < 20:
return 'kid'
if age < 30
return 'young'
if age < 50:
return 'mature'
if age >= 50:
return 'grandpa'
df["elderly"] = df["age"].map(lambda x: elderly_function(x))
# Works with apply as well:
df["elderly"] = df["age"].apply(lambda x: elderly_function(x))
The solution with numpy is probably fast and might be preferable if your df is considerably large.
Just for Future reference, Nowadays you could use pandas cut or map with moderate to good speeed. If you need something faster It might not suit your needs, but is good enough for daily use and batches.
import pandas as pd
If you wanna choose map or apply mount your ranges and return something if in range
def calc_grade(age):
if 50 < age < 200:
return 'Grandpa'
elif 30 <= age <=50:
return 'Mature'
elif 20 <= age < 30:
return 'Young'
elif 10 <= age < 20:
return 'Kid'
elif age < 10:
return 'Baby'
%timeit df['elderly'] = df['age'].map(calc_grade)
name
age
preTestScore
postTestScore
elderly
0
Jason
42
4
25
Mature
1
Molly
52
24
94
Grandpa
2
Tina
36
31
57
Mature
3
Jake
24
2
62
Young
4
Amy
73
3
70
Grandpa
393 µs ± 8.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you wanna choose cut there should be many options.
One approach - We includes to left, exclude to the right . To each bin, one label.
bins = [0, 10, 20, 30, 50, 200] #200 year Vampires are people I guess...you could change to a date you belieave plausible.
labels = ['Baby','Kid','Young', 'Mature','Grandpa']
%timeit df['elderly'] = pd.cut(x=df.age, bins=bins, labels=labels , include_lowest=True, right=False, ordered=False)
name
age
preTestScore
postTestScore
elderly
0
Jason
42
4
25
Mature
1
Molly
52
24
94
Grandpa
2
Tina
36
31
57
Mature
3
Jake
24
2
62
Young
4
Amy
73
3
70
Grandpa
Related
I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!
IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult
Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)
To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young
Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110
I am a beginner. I've looked all over and read a bunch of related questions but can't quite figure this out. I know I am the problem and that I'm missing something, but I'm hoping someone will be kind and help me out. I am attempting to convert data from one video game (a college basketball simulation) into data consistent with another video game's (pro basketball simulation) format.
I have a DF that has columns:
Name, Pos, Height, Weight, Shot, Points
With values such as:
Jon Smith, C, 84, 235, Exc, 19.4
Greg Jones, PG, 72, 187, Fair, 12.0
I want create a new column for "InsideScoring". What I'd like to do is assign a player a randomly generated number within a certain range based on what position they played, height, weight, shot rating and points scored.
I tried a bunch of attempts like:
df1['InsideScoring'] = 0
df1.loc[(df1.Pos == "C") &
(df1.Height > 82) &
(df1.Points > 19.0) &
(df1.Weight > 229), 'InsideScoring'] = np.random.randint(85,100)
When I do this, all the players (row at column "InsideScoring") that meet the criteria get assigned the same value between 85 and 100 rather than a random mix of numbers between 85 and 100.
Eventually, what I want to do is go through the list of players and based on those four criteria, assign values from different ranges. Any ideas appreciated.
Pandas: Create a new column with random values based on conditional
Numpy "where" with multiple conditions
My recommendation would be to use np.select here. You set up your conditions, your outputs, and you're good to go. However, to avoid iteration, but also to avoid assigning the same random value to every column that meets the condition, create random values equal to the length of your DataFrame, and select from those:
Setup
df = pd.DataFrame({
'Name': ['Chris', 'John'],
'Height': [72, 84],
'Pos': ['PG', 'C'],
'Weight': [165, 235],
'Shot': ['Amazing', 'Fair'],
'Points': [999, 25]
})
Name Height Pos Weight Shot Points
0 Chris 72 PG 165 Amazing 999
1 John 84 C 235 Fair 25
Now set up your ranges and your conditions (Create as many of these as you like):
cond1 = df.Pos.eq('C') & df.Height.gt(80) & df.Weight.gt(200)
cond2 = df.Pos.eq('PG') & df.Height.lt(80) & df.Weight.lt(200)
range1 = np.random.randint(85, 100, len(df))
range2 = np.random.randint(50, 85, len(df))
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 72
1 John 84 C 235 Fair 25 89
Now to verify this doesn't assign values more than once:
df = pd.concat([df]*5)
... # Setup the ranges and conditions again
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 56
1 John 84 C 235 Fair 25 96
0 Chris 72 PG 165 Amazing 999 74
1 John 84 C 235 Fair 25 93
0 Chris 72 PG 165 Amazing 999 63
1 John 84 C 235 Fair 25 97
0 Chris 72 PG 165 Amazing 999 55
1 John 84 C 235 Fair 25 95
0 Chris 72 PG 165 Amazing 999 60
1 John 84 C 235 Fair 25 90
And we can see that random values are assigned, even though they all match one of two conditions. While this is less memory efficient than iterating and picking a random value, since we are creating a lot of unused numbers, it will still be faster as these are vectorized operations.
I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)
Have a question regarding using numpy's where condition. I am able to use where condition with == operator but not able to use where condition with "is one string substring of another string ?"
CODE:
import pandas as pd
import datetime as dt
import numpy as np
data = {'name': ['Smith, Jason', 'Bush, Molly', 'Smith, Tina',
'Clinton, Jake', 'Hamilton, Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore',
'postTestScore'])
print "BEFORE---- "
print df
print "AFTER----- "
df["Smith Family"]=np.where("Smith" in df['name'],'Y','N' )
print df
OUTPUT:
BEFORE-----
name age preTestScore postTestScore
0 Smith, Jason 42 4 25
1 Bush, Molly 52 24 94
2 Smith, Tina 36 31 57
3 Clinton, Jake 24 2 62
4 Hamilton, Amy 73 3 70
AFTER-----
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 N
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 N
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Why numpy.where condition does not work in the above case.
Had expected Smith Family to have values
Y
N
Y
N
N
But did not get that output. Output as seen above is all N,N,N,N,N
Instead of using condition "Smith" in df['name'] (also tried str(df['name']).find("Smith") >-1 ) but that did not work either.
Any idea what is wrong or what could I have done differently?
I think you need str.contains for boolean mask:
print (df['name'].str.contains("Smith"))
0 True
1 False
2 True
3 False
4 False
Name: name, dtype: bool
df["Smith Family"]=np.where(df['name'].str.contains("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Or str.startswith:
df["Smith Family"]=np.where(df['name'].str.startswith("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
If want use in working with scalars need apply:
This solution is faster, but doesnt work if NaN in column name.
df["Smith Family"]=np.where(df['name'].apply(lambda x: "Smith" in x),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
The behavior of np.where("Smith" in df['name'],'Y','N' ) depends on what df['name'] produces - I assume some sort of numpy array. The rest is numpy
In [733]: x=np.array(['one','two','three'])
In [734]: 'th' in x
Out[734]: False
In [744]: 'two' in np.array(['one','two','three'])
Out[744]: True
in is a whole string test, both for a list and an array of strings. It's not a substring test.
np.char has a bunch of functions that apply string functions to elements of an array. These are roughly the equivalent of np.array([x.fn() for x in arr]).
In [754]: x=np.array(['one','two','three'])
In [755]: np.char.startswith(x,'t')
Out[755]: array([False, True, True], dtype=bool)
In [756]: np.where(np.char.startswith(x,'t'),'Y','N')
Out[756]:
array(['N', 'Y', 'Y'],
dtype='<U1')
Or with find:
In [760]: np.char.find(x,'wo')
Out[760]: array([-1, 1, -1])
The pandas .str method appears to do something similar; applying string methods to elements of a data series.