Creating a new column in dataframe with range of values - python

I need to divide range of my passengers age onto 5 parts and create a new column where will be values from 0 to 4 respectively for every part(For 1 range value 0 for 2 range value 1 etc)
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset['Age_bin'] = a1.apply(0 for a in range(a))
Here what i tried to do but it does not work. I also pin dataset picture
DATASET
I expect to get result where i'll see a new column named 'Age_bin' and values 0 in it for Age from 0 to 16 inclusively, values 1 for age from 17 to 33 and other 3 rangers

Binning with pandas cut is appropriate here, try:
titset['Age_bin'] = titset['Age'].cut(bins=[0,17,34,51,68,81], include_lowest=True, labels=False)

First of all, the variable a is a range object, which you are calling range(a) again, which is equivalent to range(range(0, 17)), hence the error.
Secondly, even if you fixed the above problem, you will run into an error again since .apply takes in a callable (i.e., a function be it defined with def or a lambda function).
If your goal is to assign a new column that represents the age group that each row is in, you can just filter with your result and assign them:
titset = pd.DataFrame({'Age': range(1, 81)})
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset.loc[a1.index, 'Age_bin'] = 0
titset.loc[a2.index, 'Age_bin'] = 1
titset.loc[a3.index, 'Age_bin'] = 2
titset.loc[a4.index, 'Age_bin'] = 3
titset.loc[a5.index, 'Age_bin'] = 4
Or better yet, use a for loop:
age_groups = [0, 17, 34, 51, 68, 81]
for i in range(len(age_groups) - 1):
subset = titset.query(f'Age >= {age_groups[i]} & Age < {age_groups[i+1]}')
titset.loc[subset.index, 'Age_bin'] = i

Related

How to add columns and data of a new dataframe to an already existing one in python

I have a dataframe called "df" and in that dataframe there is a column called "Year_Birth", instead of that column I want to create multiple columns of specific age categories and on each element in that dataframe I calculate the age using the previous Year_Birth column and then put value of "True" or "1" on the age category that the element belongs to.
I am doing this manually as you can see:
#Splitting the Year and Income Attribute to categories
from datetime import date
df_year = pd.DataFrame(columns=['18_29','30_39','40_49','50_59','60_plus'])
temp = df.Year_Birth
current_year = current_year = date.today().year
for x in temp:
l = [0,0,0,0,0]
age = current_year - x
if (age<=29): l[0] = 1
elif (age<=39): l[1] = 1
elif (age<=49): l[2] = 1
elif (age<=59): l[3] = 1
else: l[4] = 1
df_length = len(df_year)
df_year.loc[df_length] = l
if there's an automatic or simpler way to do this please tell me, anyway, Now I want to replace the "Year_Birth" column with the whole "df_year" dataframe ! Can you help me with that ?
You can definitely do this using vectorized operations on each column. You can start by creating an age column from the year of birth:
In [15]: age = date.today().year - df.year_birth
now, this can be used with boolean operators to create arrays of True/False values, which can be coerced to 0/1 with .astype(int):
In [20]: df_year = pd.DataFrame({
...: '18_29': (age >= 18) & (age <= 29),
...: '30_39': (age >= 30) & (age <= 39),
...: '40_49': (age >= 40) & (age <= 49),
...: '50_59': (age >= 50) & (age <= 59),
...: '60_plus': (age >= 60),
...: }).astype(int)
In [21]: df_year
Out[21]:
18_29 30_39 40_49 50_59 60_plus
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
.. ... ... ... ... ...
77 0 0 0 0 1
78 0 0 0 0 1
79 0 0 0 0 1
80 0 0 0 0 1
81 0 0 0 0 1
[82 rows x 5 columns]

How do I normalise a Pandas data column with multiple conditionals?

I am trying to create a new pandas column which is normalised data from another column.
I created three separate series and then merged them into one.
While this approache has provided me with the desired result, I was wondering whether there's a better way to do this.
x = df["Data Col"].copy()
#if the value is between 70 and 30 find the difference of the previous value.
#Positive difference = 1 & Negative difference = -1
btw = pd.Series(np.where(x.between(30, 70, inclusive=False), x.diff(), 0))
btw[btw < 0] = -1
btw[btw > 0] = 1
#All values above 70 are -1
up = pd.Series(np.where(x.gt(70), -1, 0))
#All values below 30 are 1
dw = pd.Series(np.where(x.lt(30), 1, 0))
combined = up + dw + btw
df["Normalised Col"] = np.array(combined)
I tried to use functions and loops directly on the Pandas Data Column but I couldn't figure out how to get the .diff()
Use numpy.select with chain masks by & for bitwise AND and | for bitwise OR:
np.random.seed(2019)
df = pd.DataFrame({'Data Col':np.random.randint(10, 100, size=10)})
#print (df)
d = df["Data Col"].diff()
m1 = df["Data Col"].between(30, 70, inclusive=False)
m2 = d < 0
m3 = d > 0
m4 = df["Data Col"].gt(70)
m5 = df["Data Col"].lt(30)
df["Normalised Col1"] = np.select([(m1 & m2) | m4, (m1 & m3) | m5], [-1, 1], default=0)
print (df)
Data Col Normalised Col1
0 82 -1
1 41 -1
2 47 1
3 98 -1
4 72 -1
5 34 -1
6 39 1
7 25 1
8 22 1
9 26 1

multiple if else conditions in pandas dataframe and derive multiple columns

I have a dataframe like below.
import pandas as pd
import numpy as np
raw_data = {'student':['A','B','C','D','E'],
'score': [100, 96, 80, 105,156],
'height': [7, 4,9,5,3],
'trigger1' : [84,95,15,78,16],
'trigger2' : [99,110,30,93,31],
'trigger3' : [114,125,45,108,46]}
df2 = pd.DataFrame(raw_data, columns = ['student','score', 'height','trigger1','trigger2','trigger3'])
print(df2)
I need to derive Flag column based on multiple conditions.
i need to compare score and height columns with trigger 1 -3 columns.
Flag Column:
if Score greater than equal trigger 1 and height less than 8 then Red --
if Score greater than equal trigger 2 and height less than 8 then Yellow --
if Score greater than equal trigger 3 and height less than 8 then Orange --
if height greater than 8 then leave it as blank
How to write if else conditions in pandas dataframe and derive columns?
Expected Output
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
For other column Text1 in my original question I have tried this one but the integer columns not converting the string when concatenation using astype(str) any other approach?
def text_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger1'].astype(str) + " and less than height 5"
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger2'].astype(str) + " and less than height 5"
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger3'].astype(str) + " and less than height 5"
elif (df['height'] > 8):
return np.nan
You need chained comparison using upper and lower bound
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return 'Red'
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return 'Yellow'
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return 'Orange'
elif (df['height'] > 8):
return np.nan
df2['Flag'] = df2.apply(flag_df, axis = 1)
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
Note: You can do this with a very nested np.where but I prefer to apply a function for multiple if-else
Edit: answering #Cecilia's questions
what is the returned object is not strings but some calculations, for example, for the first condition, we want to return df['height']*2
Not sure what you tried but you can return a derived value instead of string using
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
what if there are 'NaN' values in osome columns and I want to use df['xxx'] is None as a condition, the code seems like not working
Again not sure what code did you try but using pandas isnull would do the trick
def flag_df(df):
if pd.isnull(df['height']):
return df['height']
elif (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
Here is a way to use numpy.select() for doing this with neat code, scalable and faster:
conditions = [
(df2['trigger1'] <= df2['score']) & (df2['score'] < df2['trigger2']) & (df2['height'] < 8),
(df2['trigger2'] <= df2['score']) & (df2['score'] < df2['trigger3']) & (df2['height'] < 8),
(df2['trigger3'] <= df2['score']) & (df2['height'] < 8),
(df2['height'] > 8)
]
choices = ['Red','Yellow','Orange', np.nan]
df['Flag1'] = np.select(conditions, choices, default=np.nan)
you can use also apply with a custom function on axis 1 like this :
def color_selector(x):
if (x['trigger1'] <= x['score'] < x['trigger2']) and (x['height'] < 8):
return 'Red'
elif (x['trigger2'] <= x['score'] < x['trigger3']) and (x['height'] < 8):
return 'Yellow'
elif (x['trigger3'] <= x['score']) and (x['height'] < 8):
return 'Orange'
elif (x['height'] > 8):
return ''
df2 = df2.assign(flag=df2.apply(color_selector, axis=1))
you will get something like this :

Increasing column value pandas

I have a dataframe of 143999 rows which contains position and time data.
I already made a column "dt" which calulates the time difference between rows.
Now I want to create a new column which gives the dt values a group number.
So it starts with group = 0 and when dt > 60 the group number should increase by 1.
I tried the following:
def group(x):
c = 0 #
if densdata["dt"] < 60:
densdata["group"] = c
elif densdata["dt"] >= 60:
c += 1
densdata["group"] = c
densdata["group"] = densdata.apply(group, axis=1)'
The error that I get is: The truth value of a Series is ambiguous.
Any ideas how to fix this problem?
This is what I want:
dt group
0.01 0
2 0
0.05 0
300 1
2 1
60 2
You can take advantage of the fact that True evaluates to 1 and use .cumsum().
densdata = pd.DataFrame({'dt': np.random.randint(low=50,high=70,size=20),
'group' : np.zeros(20, dtype=np.int32)})
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 0
3 55 0
4 63 0
densdata['group'] = (densdata.dt >= 60).cumsum()
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 1
3 55 1
4 63 2
If you want to guarantee that the first value of group will be 0, even if the first value of dt is >= 60, then use
densdata['group'] = (densdata.dt.replace(densdata.dt[0],np.nan) >= 60).cumsum()

Pandas categorizing age variable into groups

I have a dataframe df with age and I am working on categorizing the file into age groups with 0s and 1s.
df:
User_ID | Age
35435 22
45345 36
63456 18
63523 55
I tried the following
df['Age_GroupA'] = 0
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
but get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
To avoid it, I am going for .loc
df['Age_GroupA'] = 0
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
However, this marks all ages as 1
This is what I get
User_ID | Age | Age_GroupA
35435 22 1
45345 36 1
63456 18 1
63523 55 1
while this is the goal
User_ID | Age | Age_GroupA
35435 22 1
45345 36 0
63456 18 1
63523 55 0
Thank you
Due to peer pressure (#DSM), I feel compelled to breakdown your error:
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
this is chained indexing/assignment
so what you tried next:
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
is incorrect form, when using loc you want:
df.loc[<boolean mask>, cols of interest] = some scalar or calculated value
like this:
df.loc[(df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 'Age_GroupA'] = 1
You could also have done this using np.where:
df['Age_GroupA'] = np.where( (df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 1, 0)
To do this in 1 line, there are many ways to do this
You can convert boolean mask to int - True are 1 and False are 0:
df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int)
print (df)
User ID Age Age_GroupA
0 35435 22 1
1 45345 36 0
2 63456 18 1
3 63523 55 0
This worked for me. Jezrael already explained it.
dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)

Categories