change column value with arthimatic sequences using df.loc in pandas - python

suppose I have following dataframe :
data = {"age":[2,3,2,5,9,12,20,43,55,60],'alpha' : [0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
I want to change value of column alpha based on column age using df.loc and an arithmetic sequences but I got syntax error:
df.loc[((df.age <=4)) , "alpha"] = ".4"
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df$age - 4)/(20 - 4))
df.loc[((df.age > 20)) , "alpha"] = "1"
thank you in davance.

Reference the age column using a . not a $
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df.age - 4)/(20 - 4))

Instead of multiple .loc assignments you can combine all conditions at once using chained np.where clauses:
df['alpha'] = np.where(df.age <= 4, ".4", np.where((df.age >= 5) & (df.age <= 20),
0.4 + (1 - 0.4) *((df.age - 4)/(20 - 4)),
np.where(df.age > 20, "1", df.alpha)))
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1

Besides the synthax error (due to $), to reduce visible noise, I would go for numpy.select :
import numpy as np
​
conditions = [df["age"].le(4),
df["age"].gt(4) & df["age"].le(20),
df["age"].gt(20)]
​
values = [".4", 0.4 + (1 - 0.4) * ((df["age"] - 4) / (20 - 4)), 1]
​
df["alpha"] = np.select(condlist= conditions, choicelist= values)
​
Output :
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1

Related

Lambda with if elif else [duplicate]

I want to apply a lambda function to a DataFrame column using if...elif...else within the lambda function.
The df and the code are something like:
df=pd.DataFrame({"one":[1,2,3,4,5],"two":[6,7,8,9,10]})
df["one"].apply(lambda x: x*10 if x<2 elif x<4 x**2 else x+10)
Obviously, this doesn't work. Is there a way to apply if....elif....else to a lambda?
How can I get the same result with List Comprehension?
Nest if .. elses:
lambda x: x*10 if x<2 else (x**2 if x<4 else x+10)
I do not recommend the use of apply here: it should be avoided if there are better alternatives.
For example, if you are performing the following operation on a Series:
if cond1:
exp1
elif cond2:
exp2
else:
exp3
This is usually a good use case for np.where or np.select.
numpy.where
The if else chain above can be written using
np.where(cond1, exp1, np.where(cond2, exp2, ...))
np.where allows nesting. With one level of nesting, your problem can be solved with,
df['three'] = (
np.where(
df['one'] < 2,
df['one'] * 10,
np.where(df['one'] < 4, df['one'] ** 2, df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
numpy.select
Allows for flexible syntax and is easily extensible. It follows the form,
np.select([cond1, cond2, ...], [exp1, exp2, ...])
Or, in this case,
np.select([cond1, cond2], [exp1, exp2], default=exp3)
df['three'] = (
np.select(
condlist=[df['one'] < 2, df['one'] < 4],
choicelist=[df['one'] * 10, df['one'] ** 2],
default=df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
and/or (similar to the if/else)
Similar to if-else, requires the lambda:
df['three'] = df["one"].apply(
lambda x: (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10)
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
List Comprehension
Loopy solution that is still faster than apply.
df['three'] = [x*10 if x<2 else (x**2 if x<4 else x+10) for x in df['one']]
# df['three'] = [
# (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10) for x in df['one']
# ]
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
For readability I prefer to write a function, especially if you are dealing with many conditions. For the original question:
def parse_values(x):
if x < 2:
return x * 10
elif x < 4:
return x ** 2
else:
return x + 10
df['one'].apply(parse_values)
You can do it using multiple loc operators. Here is a newly created column labelled 'new' with the conditional calculation applied:
df.loc[(df['one'] < 2), 'new'] = df['one'] * 10
df.loc[(df['one'] < 4), 'new'] = df['one'] ** 2
df.loc[(df['one'] >= 4), 'new'] = df['one'] + 10

Replacing values in a df between certain values (replace >1 to 4 with 1)

I would like to replace certain value-thresholds in a df with another value.
For example all values between 1 and <3.3 should be summarized as 1.
After that all values between >=3.3 and <10 should be summarized as 2 and so on.
I tried it like this:
tndf is my df and tnn the column
tndf.loc[(tndf.tnn < 1), 'tnn'] = 0
tndf.loc[((tndf.tnn >= 1) | (tndf.tnn < 3.3)), 'tnn'] = 1
tndf.loc[((tndf.tnn >=3.3) | (tndf.tnn < 10)), 'tnn'] = 2
tndf.loc[((tndf.tnn >=10) | (tndf.tnn < 20)), 'tnn'] = 3
tndf.loc[((tndf.tnn >=20) | (tndf.tnn < 33.3)), 'tnn'] = 4
tndf.loc[((tndf.tnn >=33.3) | (tndf.tnn < 50)), 'tnn'] = 5
tndf.loc[((tndf.tnn >=50) | (tndf.tnn < 100)), 'tnn'] = 6
tndf.loc[(tndf.tnn == 100), 'tnn'] = 7
But every value at the end will be summarized as a 6. I think that's why because of the second part of each condition. But I don't know how to tell the program to only look in a specific range (for example from >=3.3 and <10).
i will use np.where() here is the documentation:
np.where()
import numpy as np
tnddf0=np.where((tndf.tnn < 1),0,"tnn")
tnddf1=np.where(((tndf.tnn >= 1) & (tndf.tnn < 3.3)),1,"tnn")
#and so on....
To form categories like these use pd.cut
pd.cut(df.tnn, [0, 1, 3.3, 10, 20, 33.3, 50, 100], right=False, labels=range(0, 7))
Sample output of pd.cut
tnn cat
0 76.518227 6
1 44.808386 5
2 46.798994 5
3 70.798699 6
4 67.301112 6
5 13.701745 3
6 47.310570 5
7 74.048936 6
8 37.904632 5
9 38.617358 5
OR
Use np.select. It is meant exactly for your use-case.
conditions = [tndf.tnn < 1, (tndf.tnn >= 1) | (tndf.tnn < 3.3)]
values = [0, 1]
np.select(conditions, values, default="unknown")

Return df containing points within radius - python

There's a few questions on this but I'm getting stuck. I have a df that contains coordinates for various scatter points. I want to generate a radius around one of these points and return the points that are within this radius for each point in time. Using the df below, I want to return a df that contains all the points within the radius around A for each point in time.
import pandas as pd
df = pd.DataFrame({
'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],
'Label' : ['A','B','C','D','E','A','B','C','D','E'],
'X' : [8,4,3,8,7,7,3,3,4,6],
'Y' : [3,3,3,4,3,2,1,5,4,2],
})
x_data = (df.groupby(['Time'])['X'].apply(list))
y_data = (df.groupby(['Time'])['Y'].apply(list))
AX_data = (df.loc[df['Label'] == 'A']['X'])
AY_data = (df.loc[df['Label'] == 'A']['Y'])
def countPoints(df, center_x, center_y, x, y, radius):
'''
Count number of points within radius for label A
'''
# Determine square distance
square_dist = (center_x - x) ** 2 + (center_y - y) ** 2
# Return df of rows within radius
df = df[square_dist <= radius ** 2].copy()
return df
df = countPoints(df, AX_data, AY_data, x_data, y_data, radius = 1)
Intended Output:
Time Label X Y
0 09:00:00.1 A 8 3
1 09:00:00.1 D 8 4
2 09:00:00.1 E 7 3
3 09:00:00.2 A 7 2
4 09:00:00.2 E 6 2
Here my take on it using np.linalg.norm
def calc_dist(gp, a_label, r=1):
dist_df = gp[['X', 'Y']] - gp.loc[gp.Label.eq(a_label), ['X', 'Y']].values
dist_arr = np.linalg.norm(dist_df, axis=1)
return gp[dist_arr <= r]
df_A = df.groupby('Time').apply(calc_dist, a_label='A', r=1).reset_index(drop=True)
Out[2159]:
Time Label X Y
0 09:00:00.1 A 8 3
1 09:00:00.1 D 8 4
2 09:00:00.1 E 7 3
3 09:00:00.2 A 7 2
4 09:00:00.2 E 6 2
Method 2:
df1 = df.where(df.Label.eq('A')).groupby(df.Time).apply(lambda x: x.ffill().bfill())
m = np.linalg.norm(df[['X', 'Y']] - df1[['X', 'Y']], axis=1) <= 1
df_A = df[m]
Out[2262]:
Time Label X Y
0 09:00:00.1 A 8 3
3 09:00:00.1 D 8 4
4 09:00:00.1 E 7 3
5 09:00:00.2 A 7 2
9 09:00:00.2 E 6 2

How do I normalise a Pandas data column with multiple conditionals?

I am trying to create a new pandas column which is normalised data from another column.
I created three separate series and then merged them into one.
While this approache has provided me with the desired result, I was wondering whether there's a better way to do this.
x = df["Data Col"].copy()
#if the value is between 70 and 30 find the difference of the previous value.
#Positive difference = 1 & Negative difference = -1
btw = pd.Series(np.where(x.between(30, 70, inclusive=False), x.diff(), 0))
btw[btw < 0] = -1
btw[btw > 0] = 1
#All values above 70 are -1
up = pd.Series(np.where(x.gt(70), -1, 0))
#All values below 30 are 1
dw = pd.Series(np.where(x.lt(30), 1, 0))
combined = up + dw + btw
df["Normalised Col"] = np.array(combined)
I tried to use functions and loops directly on the Pandas Data Column but I couldn't figure out how to get the .diff()
Use numpy.select with chain masks by & for bitwise AND and | for bitwise OR:
np.random.seed(2019)
df = pd.DataFrame({'Data Col':np.random.randint(10, 100, size=10)})
#print (df)
d = df["Data Col"].diff()
m1 = df["Data Col"].between(30, 70, inclusive=False)
m2 = d < 0
m3 = d > 0
m4 = df["Data Col"].gt(70)
m5 = df["Data Col"].lt(30)
df["Normalised Col1"] = np.select([(m1 & m2) | m4, (m1 & m3) | m5], [-1, 1], default=0)
print (df)
Data Col Normalised Col1
0 82 -1
1 41 -1
2 47 1
3 98 -1
4 72 -1
5 34 -1
6 39 1
7 25 1
8 22 1
9 26 1

Python dataframe groupby binning statistics

For each "acat" unique value, I want to count how many occurrences there are of each "data" category (call this "bins"), and then calc the mean and skew of "bins"
possible values of data = 1,2,3,4,5
df = pd.DataFrame({'acat':[1,1,2,3,1,3],
'data':[1,1,2,1,3,1]})
df
Out[45]:
acat data
0 1 1
1 1 1
2 2 2
3 3 1
4 1 3
5 3 1
for acat = 1:
bins = (2 + 0 + 1 + 0 + 0)
average = bins / 5 = 0.6
for acat = 2:
bins = (0 + 1 + 0 + 0 + 0)
average = bins / 5 = 0.2
for acat = 3:
bins = (2 + 0 + 0 + 0 + 0)
average = bins / 5 = 0.4
bin_average_col
0.6
0.6
0.2
0.4
0.6
0.4
Also I would like a bin_skew_col.
I have a solution that uses crosstab, but this blows my PC memory when the number of acat is large.
I have tried extensively with groupby and transform but this is beyond me!
Many thanks in advance.

Categories