How to group/round/quantize similar values in a numpy array - python

Is there a numpy method that let's me recover a numpy array's quantized structure if I don't know in advance what the quantized values/levels are, but do know for example that they are spaced > 1.0 apart?
For example:
import numpy as np
x = np.array([0.5, 0.5, 1.75, 1.75, 1.75,6.45,6.45,0.5, 11.1, 0.5, 6.45])
x_noise = x + np.random.randn(len(x))/100
Is there a way to solve for x given just x_noise?

If you don't know anything else about the original quantized values, I think the best you can do is to average over the noise:
sort_idx = np.argsort(x_noise)
# group values that are less than 1.0 apart
splits = np.where(np.diff(x_noise[sort_idx]) > 1.0)[0] + 1
groups = np.split(x_noise[sort_idx], splits)
# reconstruct x with the average values
x_approx = np.empty_like(x_noise)
for idx, group in zip(np.split(sort_idx, splits), groups):
x_approx[idx] = np.mean(group)

Related

Replace outlier values with NaN in numpy? (preserve length of array)

I have an array of magnetometer data with artifacts every two hours due to power cycling.
I'd like to replace those indices with NaN so that the length of the array is preserved.
Here's a code example, adapted from https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html.
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = [x for x in y if (x > mean - 2 * sd)]
final_list = [x for x in final_list if (x < mean + 2 * sd)]
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
px.line(y=y, x=x)
# px.scatter(y) # It looks like the outliers are successfully dropped.
# px.line(y=reject_outliers(y), x=x) # This is the line I'd like to see work.
When I run 'px.scatter(reject_outliers(y))', it looks like the outliers are successfully getting dropped:
...but that's looking at the culled y vector relative to the index, rather than the datetime vector x as in the above plot. As the debugging text indicates, the vector is shortened because the outlier values are dropped rather than replaced.
How can I edit my 'reject_outliers()` function to assign those values to NaN, or to adjacent values, in order to keep the length of the array the same so that I can plot my data?
Use else in the list comprehension along the lines of:
[x if x_condition else other_value for x in y]
Got a less compact version to work. Full code:
import numpy as np
import plotly.express as px
# For pulling data from CDAweb:
from ai import cdas
import datetime
# Import data:
start = datetime.datetime(2016, 1, 24, 0, 0, 0)
end = datetime.datetime(2016, 1, 25, 0, 0, 0)
data = cdas.get_data(
'sp_phys',
'THG_L2_MAG_'+ 'PG2',
start,
end,
['thg_mag_'+ 'pg2']
)
x =data['UT']
y =data['VERTICAL_DOWN_-_Z']
def reject_outliers(y): # y is the data in a 1D numpy array
mean = np.mean(y)
sd = np.std(y)
final_list = np.copy(y)
for n in range(len(y)):
final_list[n] = y[n] if y[n] > mean - 5 * sd else np.nan
final_list[n] = final_list[n] if final_list[n] < mean + 5 * sd else np.nan
return final_list
px.scatter(reject_outliers(y))
print('Length of y: ')
print(len(y))
print('Length of y with outliers removed (should be the same): ')
print(len(reject_outliers(y)))
# px.line(y=y, x=x)
px.line(y=reject_outliers(y), x=x) # This is the line I wanted to get working - check!
More compact answer, sent via email by a friend:
In numpy you can select/index based on a Boolean array, and then make assignment with it:
def reject_outliers(y): # y is the data in a 1D numpy array
n = 5 # 5 std deviations
mean = np.mean(y)
sd = np.std(y)
final_list = y.copy()
final_list[np.abs(y - mean) > n * sd] = np.nan
return final_list
I also noticed that you didn’t use the value of n in your example code.
Alternatively, you can use the where method (https://numpy.org/doc/stable/reference/generated/numpy.where.html)
np.where(np.abs(y - mean) > n * sd, np.nan, y)
You don’t need the .copy() if you don’t mind modifying the input array.
Replace np.mean and np.std with np.nanmean and np.nanstd if you want the function to work on arrays that already contain nans, i.e. if you want to use this function recursively.
The answer about using if else in a list comprehension would work, but avoiding the list comprehension makes the function much faster if the arrays are large.

How can binned events be identified based on a common condition for the bins? (scipy.binned_statistic)

I am using scipy.binned_statistic to get the frequency of points within a bin, such that:
h, xedge, yedge, binindex = scipy.stats.binned_statistic_2d(X, Y, Y, statistic='mean', bins=160)
I am able to filter out certain bins using the following:
filter = list(np.argwhere(h > 5).flatten())
From this I can get the bin edges/center from edges and yedges for the data I am interested in.
What is the most pythonic way to get the original data from these bins of interest? For example, how do I get the original data that is contained within the bins which have more than 5 points?
Yes, that's possible with some indexing magic. I am not sure if it is the most Pythonic way, but it should be close.
The solution for 1d using stats.binned_statistic:
from scipy import stats
import numpy as np
values = np.array([1.0, 1.0, 2.0, 1.5, 3.0]) # not used with 'count'
x = np.array([1, 1, 1, 4, 7, 7, 7])
statistic, bin_edges, binnumber = stats.binned_statistic(x, values, 'count', bins=3)
print(statistic)
print(bin_edges)
print(binnumber)
# find the bins with equal or more than three events
# if you are using custom bins where events can be lower or
# higher than your specified bins -> handle this
# get the bin numbers according to some condition
idx_bin = np.where(statistic >= 3)[0]
print(idx_bin)
# A binnumber of i means the corresponding value is
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one
idx_bin += 1
print(idx_bin)
# the rest is easy, get the boolean mask and apply it
is_event = np.in1d(binnumber, idx_bin)
events = x[is_event]
print(events)
For 2d or nd you could use the solution above multiple times and combine the is_event masks for each dimension using np.logical_and (2d) or np.logical_and.reduce((x, y, z)) (nd, see here).
The solution for 2d using stats.binned_statistic_2d is basically the same:
from scipy import stats
import numpy as np
x = np.array([1, 1.5, 2.0, 4, 5.5, 1.5, 7, 1])
y = np.array([1.0, 7.0, 1.0, 3, 7, 7, 7, 1])
values = np.ones_like(x) # not used with 'count'
# check keyword expand_binnumbers, use non-linearized
# as they can be used as indices without flattening
ret = stats.binned_statistic_2d(x,
y,
values,
'count',
bins=2,
expand_binnumbers=True)
print(ret.statistic)
print('binnumber', ret.binnumber)
binnumber = ret.binnumber
statistic = ret.statistic
# find the bins with equal or more than three events
# if you are using custom bins where events can be lower or
# higher than your specified bins -> handle this
# get the bin numbers according to some condition
idx_bin_x, idx_bin_y = np.where(statistic >= 3)#[0]
print(idx_bin_x)
print(idx_bin_y)
# A binnumber of i means the corresponding value is
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one
idx_bin_x += 1
idx_bin_y += 1
print(idx_bin_x)
print(idx_bin_y)
# the rest is easy, get the boolean mask and apply it
is_event_x = np.in1d(binnumber[0], idx_bin_x)
is_event_y = np.in1d(binnumber[1], idx_bin_y)
is_event_xy = np.logical_and(is_event_x, is_event_y)
events_x = x[is_event_xy]
events_y = y[is_event_xy]
print('x', events_x)
print('y', events_y)

Values of each bin

i got following problem:
hist, edges = np.histogram(data, bins=50)
How can i access the values of each bin? I wanted to calculate the avg of each bin.
Thanks
I think this function does what you want:
import numpy as np
def binned_mean(values, edges):
values = np.asarray(values)
# Classify values into bins
dig = np.digitize(values, edges)
# Mask values out of bins
m = (dig > 0) & (dig < len(edges))
values = values[m]
dig = dig[m] - 1
# Binned sum of values
nbins = len(edges) - 1
s = np.zeros(nbins, dtype=values.dtype)
np.add.at(s, dig, values)
# Binned count of values
count = np.zeros(nbins, dtype=np.int32)
np.add.at(count, dig, 1)
# Means
return s / count.clip(min=1)
Example:
print(binned_mean([1.2, 1.8, 2.1, 2.4, 2.7], [1, 2, 3]))
# [1.5 2.4]
There is a slight difference with the histogram in this function though, as np.digitize considers all bins to be half-closed (either right or left), unlike np.histogram which considers the last edge to be closed.

Scatter plot with logical indexing

I have a 100x2 array D and a 100x1 array c (with entries +/- 1) I'm trying to make a scatter plot of the columns in D corresponding to c = 1.
I tried something like this: plt.scatter(D[0][c==1],D[1][c==1]) but it throws up IndexError: too many indices for array
I'm aware that I've use list comprehension or something of that sort. I'm fairly new to Python and hence struggling with the format.
Thanks a lot.
Concept
You can use np.where to select only rows from D that are 1 in your array C:
D = np.array([[0.25, 0.25], [0.75, 0.75]])
C = np.array([1, 0])
Using np.where, we can select only rows that are 1 in C:
>>> D[np.where(C==1)]
array([[0.25, 0.25]])
Example On your actual data:
D = np.random.randn(100, 2)
C = np.random.randint(0, 2, (100, 1))
valid = D[np.where(C.ravel()==1)]
import matplotlib.pyplot as plt
plt.scatter(valid[:, 0], valid[:, 1])
Output:
You can use numpy for this (assuming you have two numpy arrays, otherwise you can convert them into numpy arrays):
import numpy as np
c_ones = np.where(c == 1) # Finds all indices where c == 1
d_0 = D[0][c_ones]
d_1 = D[1][c_ones]
Then you can plot d_0, d_1 as normal.
For converting your lists if needed,
C_np = np.asarray(c)
D_np = np.asarray(D)
And then perform np.where on C_np as shown above.
Would this solve your issue?

Theano - Sum by group

I'm working on a custom likelihood function for Theano (Attempting to fit a conditional logistic regression.)
The likelihood requires summing values by group. In R we have the "ave()" function, in Python Pandas we have "groupby()". How would I do something similar in Theano?
Edited for more detail
I want to create a cox proportional hazards model (same as conditional logistic regression.) The log likelihood requires the sum of values by group:
In Pandas, this would be:
temp = df.groupby('groupid')['eta'].aggregate(np.sum)
denominator = np.log(temp).sum()
In the data, we have a column with group ID, and the values to be summed
group eta
1 2.1
1 1.8
1 0.9
2 1.2
2 0.75
2 1.42
The output for the group sums would then be:
group sum
1 4.8
2 3.37
Then, the sum of the log of the sums:
log(4.8) + log(3.37) = 2.7835
This is quick and easy to do in Pandas. How can I do something similar in Thano? Sure, could write a nexted loop, but that seems slow and I try to avoid manually coded loops when possible as they are slow.
Thanks!
Let say you have "X" (a list of all your etas), with the dim. Nx1 (I guess) and a matrix H. H is a NxG matrix that has a on-hot encoding of the groups.
The you you write something like:
import numpy as np
from numpy import newaxis as na
import theano.tensor as T
X = T.vector()
H = T.matrix()
tmp = T.sum(X[:, na] * H, axis=0)
O = T.sum(T.log(tmp))
x = np.array([5, 10, 10, 0.5, 5, 0.5])
# create a 1-hot encoding
g = np.array([1, 2, 2, 0, 1, 0])
h = np.zeros(shape=(len(x), 3))
for i,j in enumerate(g):
h[i,j] = 1.0
O.eval({X:x, H: h})
This should work as long as there is at least one eta per point (or else -inf).

Categories