Python Joint Distribution of N Variables

Python Joint Distribution of N Variables - python

So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:
import numpy
import math
feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);
#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;
#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));
#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
allRanges[f,0] = numpy.amin(allFeatures[f,:]);
allRanges[f,1] = numpy.amax(allFeatures[f,:]);
#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));
#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
allBins[f,:] = binedges;
allIndividualProbs[f,:] = freqhist;
#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));
#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
for b2 in range(numBins):
for f1 in range(numFrames):
for f2 in range(numFrames):
if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
jointProbs[b1,b2] += 1;
numElements += 1;
jointProbs /= numElements;
#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);
How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?

Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:
import numpy as np
numBins = 10 # number of bins in each dimension
data = np.random.randn(100000, 3) # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()

Related

Generate values in separate dataframe

I trying to generate random data with Pandas.
Data is need to be stored in two columns. The first column needs to contain categorical variables (from Stratum_1 until Stratum_19) each of these stratums can contain a random number of values.
Second column needs to have data in the range between 1 to 180000000 with a standard deviation of 453210, a mean of 170000, and a number of rows 100000.
I try to
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}
desired_mean = 170000
desired_std_dev = 453210
df = pd.DataFrame(np.random.randint(0,180000000,size=(100000, 1)),columns=list('1'))
I tried with this code above but don't know how to implement categorical and numerical values together with desired mean and standard deviation. So can anybody help how to solve this problem and generate?

I decided to use the gamma distribution to generate your desired sample after thinking that the given parameters are not suitable for the normal distribution.
Code
import numpy as np
import pandas as pd
# desired parameters
n_rows = 100000
lower, upper = 1, 180000000
mu, sigma = 170000, 453210
# amount of shift
delta = lower
# parameters for the gamma distribution
shape = ((mu - delta) / sigma) ** 2
scale = sigma**2 / (mu - delta)
# Create a dataframe
categories = {'name': [f'Stratum_{i}' for i in range(1, 19 + 1)]}
df = pd.DataFrame(categories).sample(n=n_rows, replace=True).reset_index(drop=True)
# Generate samples along with your desired parameters
generator = np.random.default_rng()
while True:
df['value'] = generator.gamma(shape=shape, scale=scale, size=n_rows) + delta
if df.value.max() <= upper:
break
# Show statistics
print(df.describe())
Output
value
count
100,000
mean
169,403 (Target: 170,000)
std
449,668 (Target: 453,210)
min
1
25%
39.4267
50%
5529.28
75%
105,748
max
9.45114e+06

Try:
import numpy as np
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}
desired_mean = 170000
desired_std_dev = 453210
df = pd.DataFrame({'num':np.random.normal(170000, 453210,size=(300000, 1)).reshape(-1), 'cat':np.random.choice(categorical['name'], 300000)})
df[(0<df['num'])&(df['num']<180000000)].sample(100000)
result:

Given an existing distribution, how can I draw samples of size N with std of X?

I have a existing distribution of values and I want to draw samples of size 5, but those 5 samples need to have a std of X within some tolerance. For example, I need 5 samples that have a std of 10 (even though the overall distribution is std=~32).
The example code below somewhat works, but is quite slow for large dataset. It randomly samples the distribution until it finds something close to the target std, then removes those elements so they can't be drawn again.
Is there a smarter way to do this properly and faster? It works ok for some target_std (above 6), but it isn't accurate below 6.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(23)
# Create a distribution
d1 = np.random.normal(95, 5, 200)
d2 = np.random.normal(125, 5, 200)
d3 = np.random.normal(115, 10, 200)
d4 = np.random.normal(70, 10, 100)
d5 = np.random.normal(160, 5, 200)
d6 = np.random.normal(170, 20, 100)
dist = np.concatenate((d1, d2, d3, d4, d5, d6))
print(f"Full distribution: len={len(dist)}, mean={np.mean(dist)}, std={np.std(dist)}")
plt.hist(dist, bins=100)
plt.title("Full Distribution")
plt.show();
batch_size = 5
num_batches = math.ceil(len(dist)/batch_size)
target_std = 10
tolerance = 1
# how many samples to search
num_samples = 100
result = []
# Find samples of batch_size that are closest to target_std
for i in range(num_batches):
samples = []
idxs = np.arange(len(dist))
for j in range(num_samples):
indices = np.random.choice(idxs, size=batch_size, replace=False)
sample = dist[indices]
std = sample.std()
err = abs(std - target_std)
samples.append((sample, indices, std, err, np.mean(sample), max(sample), min(sample)))
if err <= tolerance:
# close enough, stop sampling
break
# sort by smallest err first, then take the first/best result
samples = sorted(samples, key=lambda x: x[3])
best = samples[0]
if i % 100 == 0:
pass
print(f"{i}, std={best[2]}, err={best[3]}, nsamples={num_samples}")
result.append(best)
# remove the data from our source
dist = np.delete(dist, best[1])
df_samples = pd.DataFrame(result, columns=["sample", "indices", "std", "err", "mean", "max", "min"])
df_samples["err"].plot(title="Errors (target_std - batch_std)")
batch_std = df_samples["std"].mean()
batch_err = df_samples["err"].mean()
print(f"RESULT: Target std: {target_std}, Mean batch std: {batch_std}, Mean batch err: {batch_err}")

Since your problem is not restricted to a certain distribution, I use a normally random distribution, but this should work for any distribution. However the run time will depend on the population size.
population = np.random.randn(1000)*32
std = 10.
tol = 1.
n_samples = 5
samples = list(np.random.choice(population, n_samples))
while True:
center = np.mean(samples)
dis = [abs(i-center) for i in samples]
if np.std(samples)>(std+tol):
samples.pop(dis.index(max(dis)))
elif np.std(samples)<(std-tol):
samples.pop(dis.index(min(dis)))
else:
break
samples.append(np.random.choice(population, 1)[0])
Here is how the code works.
First, draw n_samples, probably the std is not in the range you want, so we calculate the mean and absolute distance of each sample to the mean. Then if the std is larger than the desired value plus tolerance, we kick the furthest sample and draw a new one and vice versa.
Note that if this takes too much time to calculate for your data, after kicking the outlier out, you can calculate what should be the range of the next element that should be drawn in the population, instead of randomly taking one. Hopefully this works for you.
DISCLAIMER: This is not a random draw anymore, and you should be aware that the draw is biased and is not representative of the population.

Finding anomalous values from sinusoidal data

How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise

I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:

Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.

How to remove outliers correctly and define predictors for linear model?

I am learning how to build a simple linear model to find a flat price based on its squared meters and the number of rooms. I have a .csv data set with several features and of course 'Price' is one of them, but it contains several suspicious values like '1' or '4000'. I want to remove these values based on mean and standard deviation, so I use the following function to remove outliers:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
Then I construct function to build linear regression:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
Next step is to define the data/predictors. I set my features:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
And here is my question. How can I get the same set of observations for my X and Y? Now I have inconsistent numbers of samples (5000 for my X and 4995 for my Y after removing outliers). Thank you for any help in this topic.

The features and labels should have the same length
and you should pass the whole data object to reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
You can use it in this way:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target

Following works for Pandas DataFrames (data):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered

matplotlib argrelmax doesn't find all maxes

I have a project where I'm sampling analog data and attempting to analyze with matplotlib. Currently, my analog data source is a potentiometer hooked up to a microcontroller, but that's not really relevant to the issue. Here's my code
arrayFront = RunningMean(array(dataFront), 15)
arrayRear = RunningMean(array(dataRear), 15)
x = linspace(0, len(arrayFront), len(arrayFront)) # Generate x axis
y = linspace(0, len(arrayRear), len(arrayRear)) # Generate x axis
min_vals_front = scipy.signal.argrelmin(arrayFront, order=2)[0] # Min
min_vals_rear = scipy.signal.argrelmin(arrayRear, order=2)[0] # Min
max_vals_front = scipy.signal.argrelmax(arrayFront, order=2)[0] # Max
max_vals_rear = scipy.signal.argrelmax(arrayRear, order=2)[0] # Max
maxvalfront = max(arrayFront[max_vals_front])
maxvalrear = max(arrayRear[max_vals_rear])
minvalfront = min(arrayFront[min_vals_front])
minvalrear = min(arrayRear[min_vals_rear])
plot(x, arrayFront, label="Front Pressures")
plot(y, arrayRear, label="Rear Pressures")
plot(x[min_vals_front], arrayFront[min_vals_front], "x")
plot(x[max_vals_front], arrayFront[max_vals_front], "o")
plot(y[min_vals_rear], arrayRear[min_vals_rear], "x")
plot(y[max_vals_rear], arrayRear[max_vals_rear], "o")
xlim(-25, len(arrayFront) + 25)
ylim(-1000, 7000)
legend(loc='upper left')
show()
dataFront and dataRear are python lists that hold the sampled data from 2 potentiometers. RunningMean is a function that calls:
convolve(x, ones((N,)) / N, mode='valid')
The problem is that the argrelmax (and min) functions don't always find all the maxes and mins. Sometimes it doesn't find ANY max or mins, and that causes me problems in this block of code
maxvalfront = max(arrayFront[max_vals_front])
maxvalrear = max(arrayRear[max_vals_rear])
minvalfront = min(arrayFront[min_vals_front])
minvalrear = min(arrayRear[min_vals_rear])
because the [min_vals_(blank)] variables are empty. Does anyone have any idea what is happening here, and what I can do to fix the problem? Thanks in advance.
Here's one of graphs of data where not all the maxes and mins are found:

signal.argrelmin is a thin wrapper around signal.argrelextrema with comparator=np.less. np.less(a, b) returns the truth value of a < b element-wise. Notice that np.less requires a to be strictly less than b for it to be True.
Your data has the same minimum value at a lot of neighboring locations. At the local minima, the inequality between local minimum and its neighbors does not satisfy a strictly less than relationship; instead it only satisfies a strictly less than or equal to relationship.
Therefore, to find these extrema use signal.argrelmin with comparator=np.less_equal. For example, using a snippet from your data:
import numpy as np
from scipy import signal
arrayRear = np.array([-624.59309896, -624.59309896, -624.59309896,
-625., -625., -625.,])
print(signal.argrelmin(arrayRear, order=2)[0])
# []
print(signal.argrelextrema(arrayRear, np.less_equal)[0])
# [0 1 3 4 5]
print(signal.argrelextrema(arrayRear, np.less_equal, order=2)[0])
# [0 3 4 5]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Joint Distribution of N Variables - python

Related

Generate values in separate dataframe

Given an existing distribution, how can I draw samples of size N with std of X?

Finding anomalous values from sinusoidal data

How to remove outliers correctly and define predictors for linear model?

matplotlib argrelmax doesn't find all maxes

Categories

Resources