How to calculate multi-variable nonlinear regression in python?

How to calculate multi-variable nonlinear regression in python? - python

I am trying to create a program about non-linear regression. I have three parameters [R,G,B] and I want to obtain the temperature of any pixel on image with respect to my reference color code. For example:
Reference Files R,G,B,Temperature = [(157,158,19,300),(146,55,18,320),(136,57,22,340),(133,88,25,460),(141,105,27,500),(210,195,3,580),(203,186,10,580),(214,195,4,600),(193,176,10,580)]
You can see above, all RGB values change as non-linear. Now, I use "minimum error algorithm" to obtain temperature w.r.t. RGB color codes but I want to obtain a value that not exist in the reference file (i.e. If I have (155,200,40) and it is not exist in reference file, I must obtain this three codes is equal to which temperature).
Here is the code to select the closest reference temperature given a RGB value:
from math import sqrt
referenceColoursRGB =[(157,158,19),
(146,55,18),
(136,57,22),
(133,88,25),
(141,105,27),
(203,186,10),
(214,195,4)]
referenceTemperatures = [
300,
320,
340,
460,
500,
580,
600]
def closest_color(rgb):
r, g, b = rgb
color_diffs = []
counter = 0
for color in referenceColoursRGB:
cr, cg, cb = color
color_diff = sqrt(abs(r - cr)**2 + abs(g - cg)**2 + abs(b - cb)**2)
color_diffs.append((color_diff, color))
minErrorIndex =color_diffs.index(min(color_diffs))
return minErrorIndex
temperatureLocation = closest_color((149, 60, 25))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 320
temperatureLocation = closest_color((220, 145, 4))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 580
I really want to calculate temperature values that don't appear in the reference list, but I am having problems using all RGB values and calculating/predicting reasonable/accurate temperatures from them.
I tried to obtain 1 parameter after that used polyfit but there is some problem because every variable have same effect on this one parameter. Therefore I can't realize which color code is highest (i.e. "oneParameter = 1000 *R + 100 *G + 10 *B" , in this situation if I have a parameter that color code is (2,20,50) and another color code is (2,5,200). As a result they are equal w.r.t. "oneParameter" equation)
I hope I explain my problem clearly. I am waiting for your helps !
Thank you.

from math import sqrt
referenceColoursRGB =[(157,158,19),
(146,55,18),
(136,57,22),
(133,88,25),
(141,105,27),
(203,186,10),
(214,195,4)]
referenceTemperatures = [
300,
320,
340,
460,
500,
580,
600]
def closest_color(rgb):
r, g, b = rgb
color_diffs = []
counter = 0
for color in referenceColoursRGB:
cr, cg, cb = color
color_diff = sqrt(abs(r - cr)**2 + abs(g - cg)**2 + abs(b - cb)**2)
color_diffs.append((color_diff, color))
minErrorIndex =color_diffs.index(min(color_diffs))
return minErrorIndex
temperatureLocation = closest_color((149, 60, 25))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 320
temperatureLocation = closest_color((220, 145, 4))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 580

N.B.: I can't vouch for the physical accuracy of this prediction, but this might be along the lines of what you're looking for. I.e., this makes the predictions match your reference data exactly, but I have no idea how accurate the temperature predictions might be for non-reference RGB colors. If I knew the exact physics of the mapping from RGB to temperature, I'd use that.
Bad Model 1
One simple way to do nonlinear regression is to preprocess your data so that you have nonlinear terms for your regression. sklearn has a builtin preprocessing function to do this by generating powers and interactions of the original input data.
referenceColoursRGB =[(157,158,19),
(146,55,18),
(136,57,22),
(133,88,25),
(141,105,27),
(203,186,10),
(214,195,4)]
referenceTemperatures = [
300,
320,
340,
460,
500,
580,
600]
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_RGB = poly.fit_transform(referenceColoursRGB)
ols = linear_model.LinearRegression()
ols.fit(poly_RGB, referenceTemperatures)
ols.predict(poly_RGB)
# array([300., 320., 340., 460., 500., 580., 600.])
To make non-reference RGB predictions, you would do something like:
ols.predict(poly.transform([(149, 60, 25)]))
# array([369.68980598])
ols.predict(poly.transform([(220, 145, 4)]))
# array([949.34548347])
EDIT: Bad Model 2
So, before I picked something simple to implement a nonlinear fit using PolynomialFeatures without regard to any real physics that might be going on at the RGB sensor. You can decide if it fits your needs. Well, here's another model that uses RGB ratios without any regard to whatever physics is happening. Again, you can decide if this model is appropriate.
rat_RGB = [(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in referenceColoursRGB]
rat_ols = linear_model.LinearRegression()
rat_ols.fit(rat_RGB, referenceTemperatures)
rat_ols.predict(rat_RGB)
# array([300., 320., 340., 460., 500., 580., 600.])
You can see that this model can also be fit perfectly to the reference data. It's interesting, and probably important to note that the other example predictions produce different temperatures with this model.
rat_ols.predict([(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in [(149, 60, 25)]])
# array([481.79424789])
rat_ols.predict([(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in [(220, 145, 4)]])
# array([653.06116368])
I hope you can find/develop a RGB/temp model that is physics based. I am wondering if the manufacturer of your RGB sensor has some specifications and/or engineering notes that might help.

Related

Given an existing distribution, how can I draw samples of size N with std of X?

I have a existing distribution of values and I want to draw samples of size 5, but those 5 samples need to have a std of X within some tolerance. For example, I need 5 samples that have a std of 10 (even though the overall distribution is std=~32).
The example code below somewhat works, but is quite slow for large dataset. It randomly samples the distribution until it finds something close to the target std, then removes those elements so they can't be drawn again.
Is there a smarter way to do this properly and faster? It works ok for some target_std (above 6), but it isn't accurate below 6.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(23)
# Create a distribution
d1 = np.random.normal(95, 5, 200)
d2 = np.random.normal(125, 5, 200)
d3 = np.random.normal(115, 10, 200)
d4 = np.random.normal(70, 10, 100)
d5 = np.random.normal(160, 5, 200)
d6 = np.random.normal(170, 20, 100)
dist = np.concatenate((d1, d2, d3, d4, d5, d6))
print(f"Full distribution: len={len(dist)}, mean={np.mean(dist)}, std={np.std(dist)}")
plt.hist(dist, bins=100)
plt.title("Full Distribution")
plt.show();
batch_size = 5
num_batches = math.ceil(len(dist)/batch_size)
target_std = 10
tolerance = 1
# how many samples to search
num_samples = 100
result = []
# Find samples of batch_size that are closest to target_std
for i in range(num_batches):
samples = []
idxs = np.arange(len(dist))
for j in range(num_samples):
indices = np.random.choice(idxs, size=batch_size, replace=False)
sample = dist[indices]
std = sample.std()
err = abs(std - target_std)
samples.append((sample, indices, std, err, np.mean(sample), max(sample), min(sample)))
if err <= tolerance:
# close enough, stop sampling
break
# sort by smallest err first, then take the first/best result
samples = sorted(samples, key=lambda x: x[3])
best = samples[0]
if i % 100 == 0:
pass
print(f"{i}, std={best[2]}, err={best[3]}, nsamples={num_samples}")
result.append(best)
# remove the data from our source
dist = np.delete(dist, best[1])
df_samples = pd.DataFrame(result, columns=["sample", "indices", "std", "err", "mean", "max", "min"])
df_samples["err"].plot(title="Errors (target_std - batch_std)")
batch_std = df_samples["std"].mean()
batch_err = df_samples["err"].mean()
print(f"RESULT: Target std: {target_std}, Mean batch std: {batch_std}, Mean batch err: {batch_err}")

Since your problem is not restricted to a certain distribution, I use a normally random distribution, but this should work for any distribution. However the run time will depend on the population size.
population = np.random.randn(1000)*32
std = 10.
tol = 1.
n_samples = 5
samples = list(np.random.choice(population, n_samples))
while True:
center = np.mean(samples)
dis = [abs(i-center) for i in samples]
if np.std(samples)>(std+tol):
samples.pop(dis.index(max(dis)))
elif np.std(samples)<(std-tol):
samples.pop(dis.index(min(dis)))
else:
break
samples.append(np.random.choice(population, 1)[0])
Here is how the code works.
First, draw n_samples, probably the std is not in the range you want, so we calculate the mean and absolute distance of each sample to the mean. Then if the std is larger than the desired value plus tolerance, we kick the furthest sample and draw a new one and vice versa.
Note that if this takes too much time to calculate for your data, after kicking the outlier out, you can calculate what should be the range of the next element that should be drawn in the population, instead of randomly taking one. Hopefully this works for you.
DISCLAIMER: This is not a random draw anymore, and you should be aware that the draw is biased and is not representative of the population.

Special function definition issue in Python vs Mathematica

I have a Mathematica code that calculates the 95% confidence intervals of a Cumulative Distribution Function (CDF) obtained from a specific Probability Distribution Function (PDF). The PDF is ugly, as it contains an Hypergeometric 2F1 function, and I need to calculate the 2-sigma errorbars of a data set of 15 values.
I want to translate this code to Python, but I get a very significant divergence on the second half of the values.
Mathematica code
results are the lower and upper 2-sigma confidence level for the values in xdata. That is, xdata should always fall between the two corresponding results values.
navs = {10, 10, 18, 30, 52, 87, 147, 245, 410, 684, 1141, 1903, 3173, 5290, 8816};
freqs = {0.00002, 0.00004, 0.0000666667, 0.000111111, 0.000185185, 0.000308642, 0.000514403, 0.000857339, 0.00142893, 0.00238166, 0.00396944, 0.00661594, 0.0165426, 0.0220568, 0.027571}
xdata = {0.578064980346793, 0.030812200935204, 0.316777979844816,
0.353718150091612, 0.287659600326548, 0.269254388840293,
0.16545714457921, 0.138759871084825, 0.0602382519940077,
0.10120771961, 0.065311134782518, 0.105235790998594,
0.124642033979457, 0.0271909963701794, 0.0686653810421847};
data = MapThread[{#1, #2, #3} &, {navs, freqs, xdata}]
post[x_, n_, y_] =
(n - 1) (1 - x)^n (1 - y)^(n - 2) Hypergeometric2F1[n, n, 1, x*y]
integral = Map[(values = #; mesh = Subdivide[0, 1, 1000];
Interpolation[
DeleteDuplicates[{Map[
SetPrecision[post[#, values[[1]], values[[3]]^2], 100] &,
mesh] // (Accumulate[#] - #/2 - #[[1]]/
2) & // #/#[[-1]] &,
mesh}\[Transpose], (#1[[1]] == #2[[1]] &)],
InterpolationOrder -> 1]) &, data];
results =
MapThread[{Sqrt[#1[.025]], Sqrt[#1[0.975]]} &, {integral, data}]
{{0.207919, 0.776508}, {0.0481485, 0.535278}, {0.0834002, 0.574447},
{0.137742, 0.551035}, {0.121376, 0.455097}, {0.136889, 0.403306},
{0.0674029, 0.279408}, {0.0612534, 0.228762}, {0.0158357, 0.134521},
{0.0525374, 0.156055}, {0.0270589, 0.108861}, {0.0740978, 0.137691},
{0.100498, 0.149646}, {0.00741129, 0.0525161}, {0.0507748, 0.0850961}}
Python code
Here's my translation: results are the same quantity as before, truncated to the 7th digit to increase readability.
The results values I get start diverging from the 7th pair of values on, and the last four points of xdata do not fall between the two corresponding results values.
import numpy as np
from scipy.integrate import cumtrapz
from scipy.interpolate import interp1d
from mpmath import *
mesh = list(np.linspace(0,1,1000));
navs = [10, 10, 18, 30, 52, 87, 147, 245, 410, 684, 1141, 1903, 3173, 5290, 8816]
freqs = [0.00002, 0.00004, 0.0000666667, 0.000111111, 0.000185185, 0.000308642, 0.000514403, 0.000857339, 0.00142893, 0.00238166, 0.00396944, 0.00661594, 0.0165426, 0.0220568, 0.027571]
xdata = [0.578064980346793, 0.030812200935204, 0.316777979844816,
0.353718150091612,0.287659600326548, 0.269254388840293,
0.16545714457921, 0.138759871084825, 0.0602382519940077,
0.10120771961, 0.065311134782518, 0.105235790998594,
0.124642033979457, 0.0271909963701794, 0.0686653810421847]
def post(x,n,y):
post = (n-1)*((1-x)**n)*((1-y)**(n-2))*hyp2f1(n,n,1,x*y)
return post
# setting the numeric precision to 100 as in Mathematica
# trying to get the most precise hypergeometric function values
mp.dps = 100
mp.pretty = True
results = []
for i in range(len(navs)):
postprob = [];
for j in range(len(mesh)):
posterior = post(mesh[j], navs[i], xdata[i]**2)
postprob.append(posterior)
# calculate the norm of the pdf for integration
norm = np.trapz(np.array(postprob),mesh);
# integrate pdf/norm to obtain cdf
integrate = list(np.unique(cumtrapz(np.array(postprob)/norm, mesh, initial=0)));
mesh2 = list(np.linspace(0,1,len(integrate)));
# interpolate inverse cdf to obtain the 2sigma quantiles
icdf = interp1d(integrate, mesh2, bounds_error=False, fill_value='extrapolate');
results.append(list(np.sqrt(icdf([0.025, 0.975]))))
results
[[0.2079198, 0.7765088], [0.0481485, 0.5352773], [0.0834, 0.5744489],
[0.1377413, 0.5510352], [0.1218029, 0.4566994], [0.1399324, 0.4122767],
[0.0733743, 0.3041607], [0.0739691, 0.2762597], [0.0230135, 0.1954886],
[0.0871462, 0.2588804], [0.05637, 0.2268962], [0.1731199, 0.3217401],
[0.2665897, 0.3969059], [0.0315915, 0.2238736], [0.2224567, 0.3728803]]
Thanks to the comments to this question, I found out that:
The hypergeometric function gives different results in the two languages. With the same input values i get that: In Mathematica Hypergeometric2F1 gives me as a result 1.0588267, while in Python mpmath.hyp2f1 gives 1.0588866. This is the very second point of the mesh, and the difference in in the fifth decimal place.
Is there somewhere a better definition of this special function I was not able to find?
I still don't know if this is only due to the Hypergeometric function or also to the integration method, but that is definitely a starting point.
(I am fairly new to Python, maybe the code is a bit naive)

pymc with observations on multiple variables

I'm using an example of linear regression from bayesian methods for hackers but having trouble expanding it to my usage.
I have observations on a random variable, an assumed distribution on that random variable, and finally another assumed distribution on that random variable for which I have observations. How I have tried to model it is with intermediate distributions on a and b, but it complains Wrong number of dimensions: expected 0, got 1 with shape (788,).
To describe the actual model, I am predicting the conversion rate for a certain amount (n) of cultivating emails. My prior is that the conversion rate (described by a Beta function on alpha and beta) will be updated by having alpha and beta scaled by some factors (0,inf] a and b, which start at 1 for n=0 and increase to their max value at some threshold.
# Generate predictive data, X and target data, Y
data = [
{'n': 0 , 'trials': 120, 'successes': 1},
{'n': 5 , 'trials': 111, 'successes': 2},
{'n': 10, 'trials': 78 , 'successes': 1},
{'n': 15, 'trials': 144, 'successes': 3},
{'n': 20, 'trials': 280, 'successes': 7},
{'n': 25, 'trials': 55 , 'successes': 1}]
X = np.empty(0)
Y = np.empty(0)
for dat in data:
X = np.insert(X, 0, np.ones(dat['trials']) * dat['n'])
target = np.zeros(dat['trials'])
target[:dat['successes']] = 1
Y = np.insert(Y, 0, target)
with pm.Model() as model:
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
a_slope = pm.Deterministic('a_slope', 1 + (X/n_sat)*(a_gamma-1))
b_slope = pm.Deterministic('b_slope', 1 + (X/n_sat)*(b_gamma-1))
a = pm.math.switch(X >= n_sat, a_gamma, a_slope)
b = pm.math.switch(X >= n_sat, b_gamma, b_slope)
p = pm.Beta("p", alpha=alpha*a, beta=beta*b)
observed = pm.Bernoulli("observed", p, observed=Y)
Is there a way to get this to work?

Data
First, note that the total likelihood of repeated Bernoulli trials is exactly a binomial likelihood, so there is no need to expand to individual trials in your data. I'd also suggest using a Pandas DataFrame to manage your data - it's helps to keep things tidy:
import pandas as pd
df = pd.DataFrame({
'n': [0, 5, 10, 15, 20, 25],
'trials': [120, 111, 78, 144, 280, 55],
'successes': [1, 2, 1, 3, 7, 1]
})
Solution
This will help simplify the model, but the solution really is to add a shape argument to the p random variable so that PyMC3 knows to how to interpret the one dimensional parameters. The fact is that you do want a different p distribution for each n case you have, so there is nothing conceptually wrong here.
with pm.Model() as model:
# conversion rate hyperparameters
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
# switchpoint prior
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
# NB: I removed pm.Deterministic b/c (a|b)_slope[0] is constant
# and this causes issues when using ArViZ
a_slope = 1 + (df.n.values/n_sat)*(a_gamma-1)
b_slope = 1 + (df.n.values/n_sat)*(b_gamma-1)
a = pm.math.switch(df.n.values >= n_sat, a_gamma, a_slope)
b = pm.math.switch(df.n.values >= n_sat, b_gamma, b_slope)
# conversion rates
p = pm.Beta("p", alpha=alpha*a, beta=beta*b, shape=len(df.n))
# observations
pm.Binomial("observed", n=df.trials, p=p, observed=df.successes)
trace = pm.sample(5000, tune=10000)
This samples nicely
and yields reasonable intervals on the conversion rates
but the fact that the posteriors for alpha_n and beta_n go right up to your prior boundaries is a bit concerning:
I think the reason for this is that, for each condition you only do 55-280 trials, which, if the conditions were independent (worst case), conjugacy would tells us that your Beta hyperparameters should be in that range. Since you are doing a regression, then the best case scenario for information sharing across the trials would put your hyperparameters in the range of the sum of trials (788) - but that's an upper limit. Because you're outside this range, the concern here is that you're forcing the model to be more precise in its estimates than you really have the evidence to support. However, one can justify this is if the prior is based on strong independent evidence.
Otherwise, I'd suggest expanding the ranges on those priors that affect the final alpha*a and beta*b numbers (the sums of those should be close to your trial counts in the posterior).
Alternative Model
I'd probably do something along the following lines, which I think has a more transparent parameterization, though it's not completely identical to your model:
with pm.Model() as model_br_sp:
# regression coefficients
alpha = pm.Normal("alpha", mu=0, sd=1)
beta = pm.Normal("beta", mu=0, sd=1)
# saturation parameters
saturation_point = pm.Gamma("saturation_point", alpha=20, beta=2)
max_success_rate = pm.Beta("max_success_rate", 1, 9)
# probability of conversion
success_rate = pm.Deterministic("success_rate",
pm.math.switch(df.n.values > saturation_point,
max_success_rate,
max_success_rate*pm.math.sigmoid(alpha + beta*df.n)))
# observations
pm.Binomial("successes", n=df.trials, p=success_rate, observed=df.successes)
trace_br_sp = pm.sample(draws=5000, tune=10000)
Here we map the predictor space to probability space through a sigmoid that maxes out at the maximum success rate. The prior on the saturation point is identical to yours, while that on the maximum success rate is weakly informative (Beta[1,9] - though I will say it runs on a flat prior nearly as well). This also samples well,
and gives similar intervals (though the switchpoint seems to dominate more):
We can compare the two models and see that there isn't a significant difference in their explanatory power:
import arviz as az
model_compare = az.compare({'Binomial Regression w/ Switchpoint': trace_br_sp,
'Original Model': trace})
az.plot_compare(model_compare)

Finding anomalous values from sinusoidal data

How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise

I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:

Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.

Convert RGB color to the nearest color in palette (web safe color)?

I want to convert a color either in RGB/Hex format to its nearest web-safe color.
Details about a websafe color can be found here: http://en.wikipedia.org/wiki/Web_safe_color
This website(http://www.colortools.net/color_make_web-safe.html) is able to do the way I want to, but I am not sure how to go about it in Python. Can anyone help me out here?

Despite being somewhat of a misnomer, the web safe color palette is indeed quite useful for color quantization. It's simple, fast, flexible, and ubiquitous. It also allows for RGB hex shorthand such as #369 instead of #336699. Here's a walkthrough:
Web safe colors are RGB triplets, with each value being one of the following six: 00, 33, 66, 99, CC, FF. So we can divide the max RGB value 255 by five (one less than the total possible values) to get a multiple value, 51.
Normalize the channel value by dividing by 255 (this makes it a value from 0-1 instead of 0-255).
Multiply by 5, and round the result to make sure it stays exact.
Multiply by 51 to get the final web safe value. All together, this looks something like:
def getNearestWebSafeColor(r, g, b):
r = int(round( ( r / 255.0 ) * 5 ) * 51)
g = int(round( ( g / 255.0 ) * 5 ) * 51)
b = int(round( ( b / 255.0 ) * 5 ) * 51)
return (r, g, b)
print getNearestWebSafeColor(65, 135, 211)
No need to go crazy comparing colors or creating huge lookup tables, as others have suggested. :-)

import scipy.spatial as sp
input_color = (100, 50, 25)
websafe_colors = [(200, 100, 50), ...] # list of web-save colors
tree = sp.KDTree(websafe_colors) # creating k-d tree from web-save colors
ditsance, result = tree.query(input_color) # get Euclidean distance and index of web-save color in tree/list
nearest_color = websafe_colors[result]
Or add loop for several input_colors
About k-dimensional tree

You can use colorir for this:
>>> from colorir import Palette
>>> css = Palette.load("css")
>>> web_safe = css.most_similar("fe0000") # input can be in any format
>>> web_safe
HexColor(#ff0000)
If you need its name:
>>> css.get_names(web_safe)
['red']
If you need it as RGB:
>>> web_safe.rgb()
sRGB(255, 0, 0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate multi-variable nonlinear regression in python? - python

Related

Given an existing distribution, how can I draw samples of size N with std of X?

Special function definition issue in Python vs Mathematica

pymc with observations on multiple variables

Finding anomalous values from sinusoidal data

Convert RGB color to the nearest color in palette (web safe color)?

Categories

Resources