Vectorize step-wise function for column in pandas dataframe - python

I have a slightly complex function that assigns a quality level to given data by a pre-defined step-wise logic (dependent on fixed borders and also on relative borders based on the real value). The function 'get_quality()' below does this for each row and using pandas DataFrame.apply is quite slow for huge datasets. So I'd like to vectorize this calculation. Obviously I could do something like df.groupby(pd.cut(df.ground_truth, [-np.inf, 10.0, 20.0, 50.0, np.inf])) for the inner if-logic and then apply a similar sub-grouping within each group (based on the borders of each group), but how would I do that for the last bisect that depends on the given real/ground_truth value in each row?
Using df['quality'] = np.vectorize(get_quality)(df['measured'], df['ground_truth']) is a lot faster already, but is there a real vectorized way to calculate the same 'quality' column?
import pandas as pd
import numpy as np
from bisect import bisect
quality_levels = ['WayTooLow', 'TooLow', 'OK', 'TooHigh', 'WayTooHigh']
# Note: to make the vertical borders always lead towards the 'better' score we use a small epsilon around them
eps = 0.000001
def get_quality(measured_value, real_value):
diff = measured_value - real_value
if real_value <= 10.0:
i = bisect([-4.0-eps, -2.0-eps, 2.0+eps, 4.0+eps], diff)
return quality_levels[i]
elif real_value <= 20.0:
i = bisect([-14.0-eps, -6.0-eps, 6.0+eps, 14.0+eps], diff)
return quality_levels[i]
elif real_value <= 50.0:
i = bisect([-45.0-eps, -20.0-eps, 20.0+eps, 45.0+eps], diff)
return quality_levels[i]
else:
i = bisect([-0.5*real_value-eps, -0.25*real_value-eps,
0.25*real_value+eps, 0.5*real_value+eps], diff)
return quality_levels[i]
N = 100000
df = pd.DataFrame({'ground_truth': np.random.randint(0, 100, N),
'measured': np.random.randint(0, 100, N)})
df['quality'] = df.apply(lambda row: get_quality((row['measured']), (row['ground_truth'])), axis=1)
print(df.head())
print(df.quality2.value_counts())
# ground_truth measured quality
#0 51 1 WayTooLow
#1 7 25 WayTooHigh
#2 38 95 WayTooHigh
#3 76 32 WayTooLow
#4 0 18 WayTooHigh
#OK 30035
#WayTooHigh 24257
#WayTooLow 18998
#TooLow 14593
#TooHigh 12117

This is possible with np.select.
import numpy as np
quality_levels = ['WayTooLow', 'TooLow', 'OK', 'TooHigh', 'WayTooHigh']
def get_quality_vectorized(df):
# Prepare the first 4 conditions, to match the 4 sets of boundaries.
gt = df['ground_truth']
conds = [gt <= 10, gt <= 20, gt <= 50, True]
lo = np.select(conds, [2, 6, 20, 0.25 * gt])
hi = np.select(conds, [4, 14, 45, 0.5 * gt])
# Prepare inner 5 conditions, to match the 5 quality levels.
diff = df['measured'] - df['ground_truth']
quality_conds = [diff < -hi-eps, diff < -lo-eps, diff < lo+eps, diff < hi+eps, True]
df['quality'] = np.select(quality_conds, quality_levels)
return df

Related

Replace outlier values with NaN in numpy? (preserve length of array)

I have an array of magnetometer data with artifacts every two hours due to power cycling.
I'd like to replace those indices with NaN so that the length of the array is preserved.
Here's a code example, adapted from https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html.
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = [x for x in y if (x > mean - 2 * sd)]
final_list = [x for x in final_list if (x < mean + 2 * sd)]
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
px.line(y=y, x=x)
# px.scatter(y) # It looks like the outliers are successfully dropped.
# px.line(y=reject_outliers(y), x=x) # This is the line I'd like to see work.
When I run 'px.scatter(reject_outliers(y))', it looks like the outliers are successfully getting dropped:
...but that's looking at the culled y vector relative to the index, rather than the datetime vector x as in the above plot. As the debugging text indicates, the vector is shortened because the outlier values are dropped rather than replaced.
How can I edit my 'reject_outliers()` function to assign those values to NaN, or to adjacent values, in order to keep the length of the array the same so that I can plot my data?
Use else in the list comprehension along the lines of:
[x if x_condition else other_value for x in y]
Got a less compact version to work. Full code:
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
mean = np.mean(y)
sd = np.std(y)
final_list = np.copy(y)
for n in range(len(y)):
final_list[n] = y[n] if y[n] > mean - 5 * sd else np.nan
final_list[n] = final_list[n] if final_list[n] < mean + 5 * sd else np.nan
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
# px.line(y=y, x=x)
px.line(y=reject_outliers(y), x=x) # This is the line I wanted to get working - check!
More compact answer, sent via email by a friend:
In numpy you can select/index based on a Boolean array, and then make assignment with it:
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = y.copy()
final_list[np.abs(y - mean) > n * sd] = np.nan
return final_list
I also noticed that you didn’t use the value of n in your example code.
Alternatively, you can use the where method (https://numpy.org/doc/stable/reference/generated/numpy.where.html)
np.where(np.abs(y - mean) > n * sd, np.nan, y)
You don’t need the .copy() if you don’t mind modifying the input array.
Replace np.mean and np.std with np.nanmean and np.nanstd if you want the function to work on arrays that already contain nans, i.e. if you want to use this function recursively.
The answer about using if else in a list comprehension would work, but avoiding the list comprehension makes the function much faster if the arrays are large.

Need to use apply or broadcasting and masking to iterate over a DataFrame

I have a data frame that I need to iterate over. I want to use either apply or broadcasting and masking. This is the pseudocode I am trying to improve upon.
2 The algorithm
Algorithm 1: The algorithm
initialize the population (of size n) uniformly randomly, obeying the bounds;
while a pre-determined number of iterations is not complete do
set the random parameters (two independent parameters for each of the d
variables); find the best and the worst vectors in the population;
for each vector in the population do create a new vector using the
current vector, the best vector, the worst vector, and the random
parameters;
if the new vector is at least as good as the current vector then
current vector = new vector;
This is the code I have so far.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
pd.set_option('display.max_columns', 500)
df
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best = final_func.idxmin()
xti_worst = final_func.idxmax()
print(final_func)
print(df.head())
print(df.tail())
*#for loop of pseudocode
#for row in df.iterrows():
#implement equation from assignment
#define in array math
#xi_new = row.to_numpy() + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(row.to_numpy())) - np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_worst].values - np.absolute(row.to_numpy()))
#print(xi_new)*
df2 = df.apply(lambda row: 0 if row == 0 else row + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(axis = 1)))
print(df2)
The formula I am trying to use for xi_new is:
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
I'm not sure I'm implementing your formula correctly, but hopefully this helps
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best_idx = final_func.idxmin()
xti_worst_idx = final_func.idxmax()
xti_best = df.loc[xti_best_idx]
xti_worst = df.loc[xti_worst_idx]
#Calculate random values for the whole df for the two different areas where you need randomness
nrows,ncols = df.shape
r1 = np.random.uniform(0, 1, size = (nrows, ncols))
r2 = np.random.uniform(0, 1, size = (nrows, ncols))
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
df= df+r1*xti_best.sub(df.abs())-r2*xti_worst.sub(df.abs())
df

Calculate weight using normal distribution in Python

I have to add a weight column in the titanic dataset to calculate adult passengers' weight using a normal distribution with std = 20 and mean = 70 kg. I have tried this code:
df['Weight'] = np.random.normal(20, 70, size=891)
df['Weight'].fillna(df['Weight'].iloc[0], inplace=True)
but I am concerned about two things:
It generates negative values, not just positive; how can this be considered normal weight value, is there anything that I can change in code to generate just positive values.
Since I am targeting the adults' age group, what about children. Some of them also have abnormal weight values, such as 7 kg for adults or 30 kg for a child; how can this be solved.
I appreciate any help you can provide.
Edit:
This code worked for me
Weight = np.random.normal(80, 20, 718)
adults['Weight'] = Weight
Now I have to calculate probability for people weighted less than 70
and who is between 70 and 100.
I have tried the following code but it raise an error: TypeError: unsupported operand type(s) for -: 'str' and 'int'.
import pandas as pd
import numpy as np
import scipy.stats
adults = df[(df['Age'] >= 20) & (df['Age'] <= 70)]
Weight = np.random.normal(80, 20, 718)
adults['Weight'] = Weight
p1 = adults['Weight'] < 70
p2 = adults[(adults['Weight'] > 70) & (adults['Weight'] < 100)]
scipy.stats.norm.pdf(p1)
scipy.stats.norm.pdf(p2)
Range of a Normal distribution is not restricted. It spans all across real numbers. If you want to restrict it, you should do it manually or use other distributions.
df['Weight'] = np.random.normal(20, 70, size=891)
df.loc[df['Weight'] < min_value, 'Weight'] = min_value
df.loc[df['Weight'] > max_value, 'Weight'] = max_value
Since weights of children and adults are not iid's you should sample it from different distributions
# use different distributions
df.loc[df['person_type'] == 'child', 'Weight'] = np.random.normal(x1, y1, size=children_size)
df.loc[df['person_type'] == 'adult', 'Weight'] = np.random.normal(x2, y2, size=adult_size)
You can use a truncated normal distribution, if you want to avoid negative values, for example, to get a vector of with mean 70 and sd 20, you can do:
myclip_a = 0
myclip_b = +np.Inf
my_mean = 70
my_std = 20
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
We set the seed, and you can see the truncated normal has all values above zero, but not the normal u have used:
np.random.seed(100)
x1 = truncnorm.rvs(a= a,b=b,size=50000,loc=70,scale=20)
np.sum(x1<0)
0
x2 = norm.rvs(loc=70,scale=20,size=50000)
np.sum(x2<0)
10
Not very sure how you are filling in the nas.. Will need the data frame to address that but I suspect its another question altogether.

Identifying and removing outliers based on more than one condition in a dataset using Python

I am preparing a dataset for regression modelling. I would like to remove all outliers prior to doing so. The dataset has 7 variables which are continuous in nature. Five of the variables can be addressed universally. However, two variables need to be divided between male and female participants first, these two variables are height and weight. Clearly these two measurements will differ between males and females, therefore to acquire the outliers I need to differentiate the data by male and females, then assess/remove the outliers across both height and weight for each, then incorporate this data back with the data I have already prepared. Is there a simple way of doing this ? I have been using the inter quartile range thus far on the adjacent 5 variables which do not need to be divided by males and females, using this code for each variable...
Q1 = df["Variable"].quantile(0.25)
Q3 = df["Variable"].quantile(0.75)
IQR = Q3-Q1
Lower_Fence = Q1 - (1.5*IQR)
Upper_Fence = Q3 + (1.5*IQR)
print(Lower_Fence)
print(Upper_Fence)
df[((df["Variable"] < Lower_Fence) | (df["Variable"] > Upper_Fence))] # Detection of outliers
df[~((df["Variable"] < Lower_Fence) | (df["Variable"] > Upper_Fence))]` # Removal of outliers
I am relatively new to python.
You can define a function for your "outlier" logic, then apply that repeatedly for all columns, with or without groupby:
def is_outlier(s, quantiles=[.25, .75], thresholds=[-.5, .5]):
# change the thresholds to [-1.5, 1.5] to reflect IQR as per your question
a, b = s.quantile(quantiles)
iqr = b - a
lo, hi = np.array(thresholds) * iqr + [a, b]
return (s < lo) | (s > hi)
Simple test:
n = 20
np.random.seed(0)
df = pd.DataFrame(dict(
status=np.random.choice(['dead', 'alive'], n),
gender=np.random.choice(['M', 'F'], n),
weight=np.random.normal(150, 40, n),
diastolic=np.random.normal(80, 10, n),
cholesterol=np.random.normal(200, 20, n),
))
Example usage:
mask = is_outlier(df['diastolic']) # overall outliers
# or
mask = df.groupby('gender')['weight'].apply(is_outlier) # per gender group
Usage to filter out data:
mask = False
# overall outliers
for k in ['diastolic', 'cholesterol']: # etc
mask |= is_outlier(df[k])
# per-gender outliers
gb = df.groupby('gender')
for k in ['weight']: # and any other columns needed for per-gender
mask |= gb[k].apply(is_outlier)
# finally, select the non-outliers
df_filtered = df.loc[~mask]
BTW, note how per-gender outliers are different than overall, e.g. for 'weight':
df.groupby('gender')['weight'].apply(is_outlier) == is_outlier(df['weight'])

Create and pass random values to Pandas dataframes with hard bounds

I am trying to simulate a pandas dataframe, using random values, with a combination of hard upper/lower values. I am using np.random.normal, as the original data is fairly normally distributed.
The code I am using to create the dataframe is:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
})
In the above example, I would like there to be a hard lower and upper bound for all three values. For example, Rel. Hum. could not go below 0, or above 100. Edit: all three values would not have the same bounds, either upper or lower. Temp can go negative, while sun would be bounded at 0, and 24)
How can I force these values, while creating a relatively normally distribution, and passing them to the dataframe at the same time?
Edit : Note that this samples from a truncated normal for the given parameters and will most likely not be truly normally distributed, sorry for the confusion.
Use scipy truncated normal defined as :
"The standard form of this distribution is a standard normal truncated to the range [a, b]"
from scipy.stats import truncnorm
low_bound = 0
upper_bound = 100
mean = 8
std = 2
a, b = (low_bound - mean) / std, (upper_bound - mean) / std
n_samples = 1000
samples = truncnorm.rvs(a = a, b = b,
loc = mean, scale = std,
size = n_samples)
Thanks to ALollz for the corrections !
Try clip() function to bound the values, example:
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
Name: Rel Hum, Length: 93, dtype: float64
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
>>> df['Rel Hum'].clip(0, 100, inplace=True) # assigns values outside boundary to 0 and 100
>>> df.head()
Temp Sun Rel Hum
0 9.714943 6.255931 93.105135
1 0.551001 3.063972 85.923184
2 7.780588 3.580514 79.124139
3 3.766066 3.684801 84.543149
4 8.541507 -3.066196 83.598925
>>> df[df['Rel Hum']>100].head()
Empty DataFrame
Columns: [Temp, Sun, Rel Hum]
Index: []
Just do a clip:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
}).clip(0,100)
And plot:
df.plot.density(subplots=True);
gives:
You can clip, though this leaves you with a spike at the edges:
import pandas as pd
import numpy as np
N = 10**5
df = pd.DataFrame({"Rel Hum": np.random.normal(87.153118,5.529958, N)})
df['Rel Hum'].clip(lower=0, upper=100).plot(kind='hist', bins=np.arange(60,101,1))
If you want to avoid that spike redraw out of bounds points until everything is within bounds:
while not df['Rel Hum'].between(0, 100).all():
m = ~df['Rel Hum'].between(0, 100)
df.loc[m, 'Rel Hum'] = np.random.normal(87.153118, 5.529958, m.sum())
df['Rel Hum'].plot(kind='hist', bins=np.arange(60,101,1))

Categories