Python: define individual binning - python

I am trying to define my own binning and calculate the mean value of some other columns of my dataframe over these bins. Unfortunately, it only works with integer inputs as you can see below. In this particular case "step_size" defines the step of one bin and I would like to use float values like 0.109 which corresponds to 0.109 seconds. Do you have any idea how I can do this? I think the problem is in the definition of "create_bins" but I cannot fix it...
The goal should be to get this: [(0,0.109),(0.109,0,218),(0.218,0.327) ......]
Greets
# =============================================================================
# Define parameters
# =============================================================================
seconds_min = 0
seconds_max = 9000
step_size = 1
bin_number = int((seconds_max-seconds_min)/step_size)
# =============================================================================
# Define function to create your own individual binning
# lower_bound defines the lowest value of the binning interval
# width defines the width of the binning interval
# quantity defines the number of bins
# =============================================================================
def create_bins(lower_bound, width, quantity):
bins = []
for low in range(lower_bound,
lower_bound + quantity * width + 1, width):
bins.append((low, low+width))
return bins
# =============================================================================
# Create binning list
# =============================================================================
bin_list = create_bins(lower_bound=seconds_min,
width=step_size,
quantity=bin_number)
print(bin_list)

The problem lies in the fact that the range function does not allow for float ranges.
You can use the numeric_range function in more_itertools for this:
from more_itertools import numeric_range
seconds_min = 0
seconds_max = 9
step_size = 0.109
bin_number = int((seconds_max-seconds_min)/step_size)
def create_bins(lower_bound, width, quantity):
bins = []
for low in numeric_range(lower_bound,
lower_bound + quantity * width + 1, width):
bins.append((low, low+width))
return bins
bin_list = create_bins(lower_bound=seconds_min,
width=step_size,
quantity=bin_number)
print(bin_list)
# (0.0, 0.109), (0.109, 0.218), (0.218, 0.327) ... ]

Here's a simple way to do it using zip and numpy's arange. I've put the upper limit at 5, but you can, of course, choose other numbers.
list(zip(np.arange(0, 5, .109), np.arange(.109, 5, .109)))
The result is:
[(0.0, 0.109),
(0.109, 0.218),
(0.218, 0.327),
(0.327, 0.436),
(0.436, 0.545),
(0.545, 0.654),
(0.654, 0.763),
...

no numpy:
max_bin=100
min_bin=0
step_size=0.109
number_of_bins = int(1+((max_bin-min_bin)/step_size)) # +1 to cover the whole interval
bins= []
for a in range(number_of_bins):
bins.append((a*step_size, (a+1)*step_size))

Related

Python weighted quantile as R wtd.quantile()

I want to convert the R package Hmisc::wtd.quantile() into python.
Here is the example in R:
I took this as reference and it seems that the logics are different than R:
# First function
def weighted_quantile(values, quantiles, sample_weight = None,
values_sorted = False, old_style = False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
# weighted_quantiles = np.cumsum(sample_weight)
# weighted_quantiles /= np.sum(sample_weight)
weighted_quantiles = np.cumsum(sample_weight)/np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
weighted_quantile(values = [0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319],
quantiles = np.arange(0, 1 + 1 / 5, 1 / 5),
sample_weight = [1,1,1,1,1])
>> array([0.2136325, 0.2136325, 0.4079128, 0.4890342, 0.5083345, 0.6197319])
# Second function
def weighted_percentile(data, weights, perc):
"""
perc : percentile in [0-1]!
"""
data = np.array(data)
weights = np.array(weights)
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
return np.interp(perc, cdf, data)
weighted_percentile([0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319], [1,1,1,1,1], np.arange(0, 1 + 1 / 5, 1 / 5))
>> array([0.2136325 , 0.31077265, 0.4484735 , 0.49868435, 0.5640332 ,
0.6197319 ])
Both are different with R. Any idea?
I am Python-illiterate, but from what I see and after some quick checks I can tell you the following.
Here you use uniform (sampling) weights, so you could also directly use the quantile() function. Not surprisingly, it gives the same results as wtd.quantile() with uniform weights:
x <- c(0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319)
n <- length(x)
x <- sort(x)
quantile(x, probs = seq(0,1,0.2))
# 0% 20% 40% 60% 80% 100%
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
The R quantile() function get the quantiles in a 'textbook' way, i.e. by determining the index i of the obs to use with i = q(n+1).
In your case:
seq(0,1,0.2)*(n+1)
# 0.0 1.2 2.4 3.6 4.8 6.0
Of course since you have 5 values/obs and you want quintiles, the indices are not integers. But you know for example that the first quintile (i = 1.2) lies between obs 1 and obs 2. More precisely, it is a linear combination of the two observations (the 'weights' are derived from the value of the index):
0.2*x[1] + 0.8*x[2]
# 0.3690567
You can do the same for the all the quintiles, on the basis of the indices:
q <-
c(min(x), ## 0: actually, the first obs
0.2*x[1] + 0.8*x[2], ## 1.2: quintile lies between obs 1 and 2
0.4*x[2] + 0.6*x[3], ## 2.4: quintile lies between obs 2 and 3
0.6*x[3] + 0.4*x[4], ## 3.6: quintile lies between obs 3 and 4
0.8*x[4] + 0.2*x[5], ## 4.8: quintile lies between obs 4 and 5
max(x) ## 6: actually, the last obs
)
q
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
You can see that you get exactly the output of quantile() and wtd.quantile().
If instead of 0.2*x[1] + 0.8*x[2] we consider the following:
0.5*x[1] + 0.5*x[2]
# 0.3107726
We get the output of your second Python function. It appears that your second function considers uniform 'weights' (obviously I am not talking about the sampling weights here) when combining the two observations. The issue (at least for the second Python function) seems to come from this. I know these are just insights, but I hope they will help.
EDIT: note that the difference between the two is not necessary an 'issue' with the python code. There are different quantile estimators (and their weighted versions) and the python functions could simply rely on a different estimator than Hmisc::wtd.quantile(). I think that the latter uses the weighted version of the Harrell-Davis quantile estimator. If you really want to implement this one, you should check the source code of Hmisc::wtd.quantile() and try to 'directly' translate this into Python.

Values of each bin

i got following problem:
hist, edges = np.histogram(data, bins=50)
How can i access the values of each bin? I wanted to calculate the avg of each bin.
Thanks
I think this function does what you want:
import numpy as np
def binned_mean(values, edges):
values = np.asarray(values)
# Classify values into bins
dig = np.digitize(values, edges)
# Mask values out of bins
m = (dig > 0) & (dig < len(edges))
values = values[m]
dig = dig[m] - 1
# Binned sum of values
nbins = len(edges) - 1
s = np.zeros(nbins, dtype=values.dtype)
np.add.at(s, dig, values)
# Binned count of values
count = np.zeros(nbins, dtype=np.int32)
np.add.at(count, dig, 1)
# Means
return s / count.clip(min=1)
Example:
print(binned_mean([1.2, 1.8, 2.1, 2.4, 2.7], [1, 2, 3]))
# [1.5 2.4]
There is a slight difference with the histogram in this function though, as np.digitize considers all bins to be half-closed (either right or left), unlike np.histogram which considers the last edge to be closed.

Finding anomalous values from sinusoidal data

How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise
I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:
Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.

Random list of ones and zeros with minimum distance between ones

I would like to have a random list where the occurence of ones is 10% and the rest of the items are zeros. The length of this list is 1000. I would like for the values to be in a random order so that there is an adjustable minimum distance between ones. So for example if I choose a value of 3, the list would look something like this:
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...]
What is the most elegant way to achieve this?
Edit. I was asked for more information and to show some effort.
This is for a study where 0 signifies one type of stimulus and 1 an other kind of stimulus and we want to have a minimum distance between stimulus type 1.
So far I have achieved this with:
trials = [0]*400
trials.extend([1]*100)
random.shuffle(trials)
#Make sure a fixed minimum number of standard runs follow each deviant
i = 0
while i < len(trials):
if trials[i] == 1:
trials[i+1:i+1] = 5*[0]
i = i + 6
else:
i = i + 1
This gives me a list of length 1000 but to me seems a little clumsy so out of curiosity I was wondering if there is a better way to do this.
You have essentially a binomial random variable. The waiting time between successes for a binomial random variable is given by the negative binomial distribution. Using this distribution, we can get a random sequence of intervals between successes for a binomial variable with the specified success rate. Then we simply add your "refractory period" to all intervals and create a binary representation.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import nbinom
min_failures = 3 # refractory period
total_successes = 100
total_time = 1000
# create a negative binomial distribution to model the waiting times to the next success for a Bernoulli RV;
rv = nbinom(1, total_successes / float(total_time))
# get interval lengths between successes;
intervals = rv.rvs(size=total_successes)
# get event times
events = np.cumsum(intervals)
# rescale event times to fit into the total time - refractory time
total_refractory = total_successes * min_failures
remaining_time = total_time - total_refractory
events = events.astype(np.float) / np.max(events) * remaining_time
# add refractory periods
intervals = np.diff(np.r_[0, events])
intervals += min_failures
events = np.r_[0, np.cumsum(intervals[:-1])] # series starts with success
# create binary representation
binary = np.zeros((total_time), dtype=np.uint8)
binary[events.astype(np.int)] = 1
To check that the inter-event intervals match your expectations, plot a histogram:
# check that intervals match our expectations
fig, ax = plt.subplots(1,1)
ax.hist(intervals, bins=20, normed=True);
ax.set_xlabel('Interval length')
ax.set_ylabel('Normalised frequency')
xticks = ax.get_xticks()
ax.set_xticks(np.r_[xticks, min_failures])
plt.show()
My approach to this problem is to maintain a list of candidate positions from which the next position is chosen randomly. Then, the surrounding range of positions is checked to be empty. If so, this position is chosen and the whole range around it in which no future position is allowed is removed from the list of available candidates. This ensures a minimum number of loops.
It may happen (if mindist is big compared to the number of positions) that less than the required positions are returned. In this case, the function needs to be called again, like shown.
import random
def array_ones(mindist, length_array, numones):
result = [0]*length_array
candidates = range(length_array)
while sum(result) < numones and len(candidates) > 0:
# choose one position randomly from candidates
pos = candidates[random.randint(0, len(candidates)-1)]
L = pos-mindist if pos >= mindist else 0
U = pos+mindist if pos <= length_array-1-mindist else length_array-1
if sum(result[L:U+1]) == 0: # no taken positions around
result[pos] = 1
# remove all candidates around this position
no_candidates = set(range(L, U+1))
candidates = list(set(candidates).difference(no_candidates))
return result, sum(result)
def main():
numones = 5
numtests = 50
mindist = 4
while True:
arr, ones = array_ones(mindist, numtests, numones)
if ones == numones:
break
print arr
if __name__ == '__main__':
main()
The function returns the array of ones and it's number of ones. Set difference is used to remove a range of candidate positions noniteratively.
Seems that there wasn't a very simple one-line answer to this problem. I finally came up with this:
import numpy as np
def construct_list(n_zeros, n_ones, min_distance):
if min_distance > (n_zeros + n_ones) / n_ones:
raise ValueError("Minimum distance too high.")
initial_zeros = n_zeros - min_distance * n_ones
block = np.random.permutation(np.array([0]*initial_zeros + [1]*n_ones))
ones = np.where(block == 1)[0].repeat(min_distance)
#Insert min_distance number of 0s after each 1
block = np.insert(block, ones+1, 0)
return block.tolist()
This seems simpler than the other answers although Paul's answer was just a little faster with values n_zeros=900, n_ones=100, min_distance=3

Python Joint Distribution of N Variables

So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:
import numpy
import math
feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);
#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;
#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));
#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
allRanges[f,0] = numpy.amin(allFeatures[f,:]);
allRanges[f,1] = numpy.amax(allFeatures[f,:]);
#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));
#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
allBins[f,:] = binedges;
allIndividualProbs[f,:] = freqhist;
#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));
#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
for b2 in range(numBins):
for f1 in range(numFrames):
for f2 in range(numFrames):
if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
jointProbs[b1,b2] += 1;
numElements += 1;
jointProbs /= numElements;
#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);
How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?
Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:
import numpy as np
numBins = 10 # number of bins in each dimension
data = np.random.randn(100000, 3) # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()

Categories