Pandas categorizing age variable into groups

Pandas categorizing age variable into groups - python

I have a dataframe df with age and I am working on categorizing the file into age groups with 0s and 1s.
df:
User_ID | Age
35435 22
45345 36
63456 18
63523 55
I tried the following
df['Age_GroupA'] = 0
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
but get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
To avoid it, I am going for .loc
df['Age_GroupA'] = 0
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
However, this marks all ages as 1
This is what I get
User_ID | Age | Age_GroupA
35435 22 1
45345 36 1
63456 18 1
63523 55 1
while this is the goal
User_ID | Age | Age_GroupA
35435 22 1
45345 36 0
63456 18 1
63523 55 0
Thank you

Due to peer pressure (#DSM), I feel compelled to breakdown your error:
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
this is chained indexing/assignment
so what you tried next:
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
is incorrect form, when using loc you want:
df.loc[<boolean mask>, cols of interest] = some scalar or calculated value
like this:
df.loc[(df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 'Age_GroupA'] = 1
You could also have done this using np.where:
df['Age_GroupA'] = np.where( (df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 1, 0)
To do this in 1 line, there are many ways to do this

You can convert boolean mask to int - True are 1 and False are 0:
df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int)
print (df)
User ID Age Age_GroupA
0 35435 22 1
1 45345 36 0
2 63456 18 1
3 63523 55 0

This worked for me. Jezrael already explained it.
dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)

Related

How to add columns and data of a new dataframe to an already existing one in python

I have a dataframe called "df" and in that dataframe there is a column called "Year_Birth", instead of that column I want to create multiple columns of specific age categories and on each element in that dataframe I calculate the age using the previous Year_Birth column and then put value of "True" or "1" on the age category that the element belongs to.
I am doing this manually as you can see:
#Splitting the Year and Income Attribute to categories
from datetime import date
df_year = pd.DataFrame(columns=['18_29','30_39','40_49','50_59','60_plus'])
temp = df.Year_Birth
current_year = current_year = date.today().year
for x in temp:
l = [0,0,0,0,0]
age = current_year - x
if (age<=29): l[0] = 1
elif (age<=39): l[1] = 1
elif (age<=49): l[2] = 1
elif (age<=59): l[3] = 1
else: l[4] = 1
df_length = len(df_year)
df_year.loc[df_length] = l
if there's an automatic or simpler way to do this please tell me, anyway, Now I want to replace the "Year_Birth" column with the whole "df_year" dataframe ! Can you help me with that ?

You can definitely do this using vectorized operations on each column. You can start by creating an age column from the year of birth:
In [15]: age = date.today().year - df.year_birth
now, this can be used with boolean operators to create arrays of True/False values, which can be coerced to 0/1 with .astype(int):
In [20]: df_year = pd.DataFrame({
...: '18_29': (age >= 18) & (age <= 29),
...: '30_39': (age >= 30) & (age <= 39),
...: '40_49': (age >= 40) & (age <= 49),
...: '50_59': (age >= 50) & (age <= 59),
...: '60_plus': (age >= 60),
...: }).astype(int)
In [21]: df_year
Out[21]:
18_29 30_39 40_49 50_59 60_plus
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
.. ... ... ... ... ...
77 0 0 0 0 1
78 0 0 0 0 1
79 0 0 0 0 1
80 0 0 0 0 1
81 0 0 0 0 1
[82 rows x 5 columns]

Creating a new column in dataframe with range of values

I need to divide range of my passengers age onto 5 parts and create a new column where will be values from 0 to 4 respectively for every part(For 1 range value 0 for 2 range value 1 etc)
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset['Age_bin'] = a1.apply(0 for a in range(a))
Here what i tried to do but it does not work. I also pin dataset picture
DATASET
I expect to get result where i'll see a new column named 'Age_bin' and values 0 in it for Age from 0 to 16 inclusively, values 1 for age from 17 to 33 and other 3 rangers

Binning with pandas cut is appropriate here, try:
titset['Age_bin'] = titset['Age'].cut(bins=[0,17,34,51,68,81], include_lowest=True, labels=False)

First of all, the variable a is a range object, which you are calling range(a) again, which is equivalent to range(range(0, 17)), hence the error.
Secondly, even if you fixed the above problem, you will run into an error again since .apply takes in a callable (i.e., a function be it defined with def or a lambda function).
If your goal is to assign a new column that represents the age group that each row is in, you can just filter with your result and assign them:
titset = pd.DataFrame({'Age': range(1, 81)})
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset.loc[a1.index, 'Age_bin'] = 0
titset.loc[a2.index, 'Age_bin'] = 1
titset.loc[a3.index, 'Age_bin'] = 2
titset.loc[a4.index, 'Age_bin'] = 3
titset.loc[a5.index, 'Age_bin'] = 4
Or better yet, use a for loop:
age_groups = [0, 17, 34, 51, 68, 81]
for i in range(len(age_groups) - 1):
subset = titset.query(f'Age >= {age_groups[i]} & Age < {age_groups[i+1]}')
titset.loc[subset.index, 'Age_bin'] = i

Python: Add a complex conditional column without for loop

I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0

Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)

Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1

How to use loc from pandas?

I have this code to replace ages from numeric data to categorical data. I'm trying to do it that way, but it's not working. Can anybody help me?
for df in treino_teste:
df.loc[df['Age'] <= 13, 'Age'] = 0,
df.loc[(df['Age'] > 13) & (df['Age'] <= 18), 'Age'] = 1,
df.loc[(df['Age'] > 18) & (df['Age'] <= 25), 'Age'] = 2,
df.loc[(df['Age'] > 25) & (df['Age'] <= 35), 'Age'] = 3,
df.loc[(df['Age'] > 35) & (df['Age'] <= 60), 'Age'] = 4,
df.loc[df['Age'] > 60, 'Age'] = 5
Error:

there is capability for categorising continuous data
for purpose of example I've assign the bin to a new column. I could have assigned it back to Age
for ease of reading results I have sorted, this is not needed
df = pd.DataFrame({"Age":np.random.randint(1,65,10)}).sort_values(["Age"])
bins = [0,13,18,25,35,60,100]
df.assign(AgeB=pd.cut(df.Age, bins=bins, labels=[i for i,v in enumerate(bins[:-1])]))
Age
AgeB
5
12
0
3
13
0
8
18
1
7
25
2
9
25
2
1
27
3
2
30
3
4
57
4
0
59
4
6
64
5

You can use numpy.digitize()
bins = [0,13,18,25,35,60,100]
df['AgeC'] =numpy.digitize(df['Age'],bins)

Making a new column in pandas based on values of other columns?

Here is the head of my dataframe
df_s['makes'] = df_s['result']
df_s['misses'] = df_s['result']
df_s.loc[(df_s['team'] == 'BOS') & (df_s['shot_distance'] >= 23) &(df_s['result'] == 'made'), 'makes'] = 1
df_s.loc[(df_s['team'] != 'BOS') | (df_s['shot_distance'] < 23) | (df_s['result'] == 'missed') | (df_s['makes'] == 'made'), 'makes'] = 0
df_s.fillna(0, inplace=True)
df_s.loc[(df_s['team'] == 'BOS') & (df_s['shot_distance'] >= 23) & (df_s['result'] == 'missed'), 'misses'] = 1
df_s.loc[(df_s['team'] != 'BOS') | (df_s['shot_distance'] < 23) | (df_s['result'] == 'made'), 'misses'] = 0
df_s.fillna(0, inplace=True)
Is the following a better way to do this, or is there an easier solution?:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False

A more readable way is to create masks
mask1 = df_s['team'] == 'BOS'
mask2 = df_s['shot_distance'] >= 23
mask3 = df_s['result'] == 'made'
df_s.loc[(mask1 & mask2 & mask3), 'makes'] = 1
df_s.loc[(~mask1 | ~mask2 | ~mask3), 'makes'] = 0
df_s.fillna(0, inplace=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas categorizing age variable into groups - python

You can convert boolean mask to int - True are 1 and False are 0: df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int) print (df) User ID Age Age_GroupA 0 35435 22 1 1 45345 36 0 2 63456 18 1 3 63523 55 0

This worked for me. Jezrael already explained it. dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)

Related

How to add columns and data of a new dataframe to an already existing one in python

Creating a new column in dataframe with range of values

Python: Add a complex conditional column without for loop

How to use loc from pandas?

Making a new column in pandas based on values of other columns?

Categories

Resources