I have this code to replace ages from numeric data to categorical data. I'm trying to do it that way, but it's not working. Can anybody help me?
for df in treino_teste:
df.loc[df['Age'] <= 13, 'Age'] = 0,
df.loc[(df['Age'] > 13) & (df['Age'] <= 18), 'Age'] = 1,
df.loc[(df['Age'] > 18) & (df['Age'] <= 25), 'Age'] = 2,
df.loc[(df['Age'] > 25) & (df['Age'] <= 35), 'Age'] = 3,
df.loc[(df['Age'] > 35) & (df['Age'] <= 60), 'Age'] = 4,
df.loc[df['Age'] > 60, 'Age'] = 5
Error:
there is capability for categorising continuous data
for purpose of example I've assign the bin to a new column. I could have assigned it back to Age
for ease of reading results I have sorted, this is not needed
df = pd.DataFrame({"Age":np.random.randint(1,65,10)}).sort_values(["Age"])
bins = [0,13,18,25,35,60,100]
df.assign(AgeB=pd.cut(df.Age, bins=bins, labels=[i for i,v in enumerate(bins[:-1])]))
Age
AgeB
5
12
0
3
13
0
8
18
1
7
25
2
9
25
2
1
27
3
2
30
3
4
57
4
0
59
4
6
64
5
You can use numpy.digitize()
bins = [0,13,18,25,35,60,100]
df['AgeC'] =numpy.digitize(df['Age'],bins)
Related
I have a column with numbers (some of them are infinite).
What I must do is the following:
All numbers greater than 15 or under -15 must be assigned the value 1, otherwise (between -15 and 15) it is assigned the value 0.
I have tried with:
df['B'] = df['B'].mask((df['B'] > 15, 1) | (df['B'] < -15, 1))
df['B'] = df['B'].where(df['B'] == 1, 0)
But got:
TypeError: unsupported operand type(s) for |: 'tuple' and 'tuple'
You could do this with the .between() method:
>>> df # example DF
B
0 1
1 9
2 -27
3 15
4 45
5 -6
>>> df["B"][df["B"].between(-15, 15)] = 0
>>> df["B"][~df["B"].between(-15, 15)] = 1
>>> df
B
0 0
1 0
2 1
3 0
4 1
5 0
You can also use .apply() method:
df = pd.DataFrame({'B': [-20, -12, 8, 11, 24]})
df['B'] = df['B'].apply(lambda x: 0 if x > 15 or x < -15 else 1)
I felt bad spamming sj95126, so I'll just provide some extra solutions here.
If you actually need 0s and 1s:
(~df["B"].between(-15, 15)).astype(int)
If you're already using numpy, but need more generic replacement (not 0s and 1s):
np.where(df["B"].between(-15, 15), val_if_between, val_if_not_between)
If you're not using numpy but still need more generic replacement:
df["B"].between(-15, 15).replace({True: val_if_between, False: val_if_not_between})
To closely follow your thinking process:
df.loc[
(df['B'] > 15) | (df['B'] < -15),
'B',
] = 1
df.loc[...] is a mandatory syntax to master for pandas
I would like to replace certain value-thresholds in a df with another value.
For example all values between 1 and <3.3 should be summarized as 1.
After that all values between >=3.3 and <10 should be summarized as 2 and so on.
I tried it like this:
tndf is my df and tnn the column
tndf.loc[(tndf.tnn < 1), 'tnn'] = 0
tndf.loc[((tndf.tnn >= 1) | (tndf.tnn < 3.3)), 'tnn'] = 1
tndf.loc[((tndf.tnn >=3.3) | (tndf.tnn < 10)), 'tnn'] = 2
tndf.loc[((tndf.tnn >=10) | (tndf.tnn < 20)), 'tnn'] = 3
tndf.loc[((tndf.tnn >=20) | (tndf.tnn < 33.3)), 'tnn'] = 4
tndf.loc[((tndf.tnn >=33.3) | (tndf.tnn < 50)), 'tnn'] = 5
tndf.loc[((tndf.tnn >=50) | (tndf.tnn < 100)), 'tnn'] = 6
tndf.loc[(tndf.tnn == 100), 'tnn'] = 7
But every value at the end will be summarized as a 6. I think that's why because of the second part of each condition. But I don't know how to tell the program to only look in a specific range (for example from >=3.3 and <10).
i will use np.where() here is the documentation:
np.where()
import numpy as np
tnddf0=np.where((tndf.tnn < 1),0,"tnn")
tnddf1=np.where(((tndf.tnn >= 1) & (tndf.tnn < 3.3)),1,"tnn")
#and so on....
To form categories like these use pd.cut
pd.cut(df.tnn, [0, 1, 3.3, 10, 20, 33.3, 50, 100], right=False, labels=range(0, 7))
Sample output of pd.cut
tnn cat
0 76.518227 6
1 44.808386 5
2 46.798994 5
3 70.798699 6
4 67.301112 6
5 13.701745 3
6 47.310570 5
7 74.048936 6
8 37.904632 5
9 38.617358 5
OR
Use np.select. It is meant exactly for your use-case.
conditions = [tndf.tnn < 1, (tndf.tnn >= 1) | (tndf.tnn < 3.3)]
values = [0, 1]
np.select(conditions, values, default="unknown")
I have the follwing DataFrame
import pandas as pd
data = {"hours": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
"values": [0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1]}
df = pd.DataFrame(data)
I have been trying to add an extra column to df including the values by groupby values and the follwing list:
[2, 4, 6, 8, 10, 16, 18, 21, 23]
this list represents hours after which the gruoping should be conducted. E.g. in the new column category it gives 1 for those values between 2 and 4 gives 1 and else where gives 0 and for hours between 6 and 8 gives 2 where the values are 1 and else where 0 and so on..
I tried the following:
df.groupby(["values", "hours"])
and I could not come forward with it.
The expected result looks like:
Updated to answer question. You'd have to create individual queries (as below). This should work for the specific ranges
df['category'] = 0
df.loc[(df['hours'] >= 2) & (df['hours'] <= 4), 'category'] = df['values']
df.loc[(df['hours'] >= 6) & (df['hours'] <= 8), 'category'] = df['values'] * 2
df.loc[df['hours'] == 10, 'category'] = df['values'] * 3
df.loc[(df['hours'] >= 16) & (df['hours'] <= 18), 'category'] = df['values'] * 4
df.loc[(df['hours'] >= 21) & (df['hours'] <= 23), 'category'] = df['values'] * 5
There is something wrong with your question So I will assume what Epsi95 has commented. So you can try something like this:
This will work when you have list having even size. You can modify this for your case also.
df['category']=0
x = list(zip(bins[::2], bins[1::2]))
rng = { range(i[0], i[1]+1):idx+1 for idx,i in enumerate(x)}
df.loc[df['values'].eq(1), 'category'] = df.loc[df['values'].eq(1), 'hours'].apply(lambda x: next((v for k, v in rng.items() if x in k), 0))
Edit:
df['category']=0
bins = [(2, 4), (6, 8), (10), (16, 18), (21, 23)]
rng = {}
for idx,i in enumerate(bins, start=1):
if not isinstance(i, int):
rng[range(i[0],i[1]+1)]=idx
else:
rng[i] = idx
def func(val):
print(val)
for k, v in rng.items():
if isinstance(k, int):
if val==k:
return v
else:
if val in k:
return v
df.loc[df['values'].eq(1), 'category'] = df.loc[df['values'].eq(1), 'hours'].apply(func)
df:
hours values category
0 1 0 0
1 2 1 1
2 3 1 1
3 4 1 1
4 5 0 0
5 6 1 2
6 7 0 0
7 8 1 2
8 9 0 0
9 10 1 3
10 11 0 0
11 12 0 0
12 13 0 0
13 14 0 0
14 15 0 0
15 16 1 4
16 17 1 4
17 18 1 4
18 19 0 0
19 20 0 0
20 21 1 5
21 22 0 0
22 23 1 5
I am trying to create another label column which is based on multiple conditions in my existing data
df
ind group people value value_50 val_minmax
1 1 5 100 1 10
1 2 2 90 1 na
2 1 10 80 1 80
2 2 20 40 0 na
3 1 7 10 0 10
3 2 23 30 0 na
import pandas as pd
import numpy as np
df = pd.read_clipboard()
Then trying to put label on rows as per below conditions
df['label'] = np.where(np.logical_and(df.group == 2, df.value_50 == 1, df.value > 50), 1, 0)
but it is giving me an error
TypeError: return arrays must be of ArrayType
How to perform it in python?
Use & between masks:
df['label'] = np.where((df.group == 2) & (df.value_50 == 1) & (df.value > 50), 1, 0)
Alternative:
df['label'] = ((df.group == 2) & (df.value_50 == 1) & (df.value > 50)).astype(int)
Your solution should working if use reduce with list of boolean masks:
mask = np.logical_and.reduce([df.group == 2, df.value_50 == 1, df.value > 50])
df['label'] = np.where(mask, 1, 0)
#alternative
#df['label'] = mask.astype(int)
I have a dataframe df with age and I am working on categorizing the file into age groups with 0s and 1s.
df:
User_ID | Age
35435 22
45345 36
63456 18
63523 55
I tried the following
df['Age_GroupA'] = 0
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
but get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
To avoid it, I am going for .loc
df['Age_GroupA'] = 0
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
However, this marks all ages as 1
This is what I get
User_ID | Age | Age_GroupA
35435 22 1
45345 36 1
63456 18 1
63523 55 1
while this is the goal
User_ID | Age | Age_GroupA
35435 22 1
45345 36 0
63456 18 1
63523 55 0
Thank you
Due to peer pressure (#DSM), I feel compelled to breakdown your error:
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
this is chained indexing/assignment
so what you tried next:
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
is incorrect form, when using loc you want:
df.loc[<boolean mask>, cols of interest] = some scalar or calculated value
like this:
df.loc[(df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 'Age_GroupA'] = 1
You could also have done this using np.where:
df['Age_GroupA'] = np.where( (df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 1, 0)
To do this in 1 line, there are many ways to do this
You can convert boolean mask to int - True are 1 and False are 0:
df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int)
print (df)
User ID Age Age_GroupA
0 35435 22 1
1 45345 36 0
2 63456 18 1
3 63523 55 0
This worked for me. Jezrael already explained it.
dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)