Python Numpy: Replace duplicate values with mean value

Python Numpy: Replace duplicate values with mean value - python

I have two measurements, position and temperature, which are sampled at a fixed sampling rate. Some positions might occour multiple times in the data. Now I want to plot the temperature over the position and not over the time. Instead of displaying two points at the same position, I want to replace the temperature measurements with the mean value for the given location. How can this be done nicely in python with numpy?
My solution so far looks like this:
import matplotlib.pyplot as plt
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Get correct order
idx = np.argsort(x)
x, y = x[idx], y[idx]
plt.plot(x, y) # Plot with multiple points at same location
# Calculate means for dupplicates
new_x = []
new_y = []
skip_next = False
for idx in range(len(x)):
if skip_next:
skip_next = False
continue
if idx < len(x)-1 and x[idx] == x[idx+1]:
new_x.append(x[idx])
new_y.append((y[idx] + y[idx+1]) / 2)
skip_next = True
else:
new_x.append(x[idx])
new_y.append(y[idx])
skip_next = False
x, y = np.array(new_x), np.array(new_y)
plt.plot(x, y) # Plots desired output
This solution does not take into account that some positions might occoure more than twice in the data. To replace all values, the loop must be run multiple times. I know there must be a better solution to this!

One approach using np.bincount -
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Find unique sorted values for x
x_new = np.unique(x)
# Use bincount to get the accumulated summation for each unique x, and
# divide each summation by the respective count of each unique value in x
y_new_mean= np.bincount(x, weights=y)/np.bincount(x)
Sample run -
In [16]: x
Out[16]: array([7, 0, 2, 8, 5, 4, 1, 9, 6, 8, 1, 3, 5])
In [17]: y
Out[17]:
array([ 6.7 , 0.12, 2.33, 8.19, 5.19, 3.68, 0.62, 9.46, 6.01,
8. , 1.07, 3.07, 5.01])
In [18]: x_new
Out[18]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [19]: y_new_mean
Out[19]:
array([ 0.12 , 0.845, 2.33 , 3.07 , 3.68 , 5.1 , 6.01 , 6.7 ,
8.095, 9.46 ])

If I understand what you're asking, here's one way to do it that is a lot simpler.
Given some dataset that is randomly arranged, but each position is connected with each temperature:
data = np.random.permutation([(1, 5.6), (1, 3.4), (1, 4.5), (2, 5.3), (3, 2.2), (3, 6.8)])
>> array([[ 3. , 2.2],
[ 3. , 6.8],
[ 1. , 3.4],
[ 1. , 5.6],
[ 2. , 5.3],
[ 1. , 4.5]])
We can sort and put each position in a dictionary as its key while keeping track of the temperatures for that position in an array in the dictionary. We use some error handling here, if the key (position) is not yet in our dictionary python will complain with a KeyError so we add it.
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
print(results)
>> {1.0: [3.3999999999999999, 5.5999999999999996, 4.5],
2.0: [5.2999999999999998],
3.0: [2.2000000000000002, 6.7999999999999998]}
And with a final list comprehension we can flatten this and get the resulting array.
np.array([[key, np.mean(results[key])] for key in results.keys()])
>> array([[ 1. , 4.5],
[ 2. , 5.3],
[ 3. , 4.5]])
This can be put in a function:
def flatten_by_position(data):
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
return np.array([[key, np.mean(results[key])] for key in results.keys()])
Tested with a variety of inputs this solution should be fast enough for datasets under 1000000 entries.

Related

Calculating the averages of elements in one array based on data in another array

I need to average the Y values corresponding to the values in the X array...
X=np.array([ 1, 1, 2, 2, 2, 2, 3, 3 ... ])
Y=np.array([ 10, 30, 15, 10, 16, 10, 15, 20 ... ])
In other words, the equivalents of the 1 values in the X array are 10 and 30 in the Y array, and the average of this is 20, the equivalents of the 2 values are 15, 10, 16, and 10, and their average is 12.75, and so on...
How can I calculate these average values?

One option is to use a property of linear regression (with categorical variables):
import numpy as np
x = np.array([ 1, 1, 2, 2, 2, 2, 3, 3 ])
y = np.array([ 10, 30, 15, 10, 16, 10, 15, 20 ])
x_dummies = x[:, None] == np.unique(x)
means = np.linalg.lstsq(x_dummies, y, rcond=None)[0]
print(means) # [20. 12.75 17.5 ]

You can try using pandas
import pandas as pd
import numpy as np
N = pd.DataFrame(np.transpose([X,Y]),
columns=['X', 'Y']).groupby('X')['Y'].mean().to_numpy()
# array([20. , 12.75, 17.5 ])

import numpy as np
X = np.array([ 1, 1, 2, 2, 2, 2, 3, 3])
Y = np.array([ 10, 30, 15, 10, 16, 10, 15, 20])
# Only unique values
unique_vals = np.unique(X);
# Loop for every value
for val in unique_vals:
# Search for proper indexes in Y
idx = np.where(X == val)
# Mean for finded indexes
aver = np.mean(Y[idx])
print(f"Average for {val}: {aver}")
Result:
Average for 1: 20.0
Average for 2: 12.75
Average for 3: 17.5

you can use something like the below code :
import numpy as np
X=np.array([ 1, 1, 2, 2, 2, 2, 3, 3])
Y=np.array([ 10, 30, 15, 10, 16, 10, 15, 20])
def groupby(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
return out
group_by_array=groupby(Y,X)
for item in group_by_array:
print(np.average(item))
I use the information in the below link to answer the question:
Group numpy into multiple sub-arrays using an array of values

I think this solution should work:
avg_arr = []
i = 1
while i <= np.max(x):
inds = np.where(x == i)
my_val = np.average(y[inds[0][0]:inds[0][-1]])
avg_arr.append(my_val)
i+=1
Definitely, not the cleanest, but I was able to test it quickly and it does indeed work.

Split numpy array into contiguous sections using numpy.where()

The function numpy.where() can be used to obtain an array of indices into a numpy array where a logical condition is true. How do we generate a list of arrays of the indices where each represents a contiguous region where the logical condition is true?
For example,
import numpy as np
a = np.array( [0.,.1,.2,.3,.4,.5,.4,.3,.2,.1,0.] )
idx = np.where( (np.abs(a-.2) <= .1) )
print( 'idx =', idx)
print( 'a[idx] =', a[idx] )
produces the following output,
idx = (array([1, 2, 3, 7, 8, 9]),)
a[idx] = [0.1 0.2 0.3 0.3 0.2 0.1]
and then
The question is, in a simple way, how do we obtain a list of arrays of indices, one such array for each contiguous section? For example, like this:
idx = (array([1, 2, 3]),), (array([7, 8, 9]),)
a[idx[0]] = [0.1 0.2 0.3]
a[idx[1]] = [0.3 0.2 0.1]

You can simply use np.split() to split your idx[0] into contiguous runs:
ia = idx[0]
out = np.split(ia, np.where(ia[1:] != ia[:-1] + 1)[0] + 1)
>>> out
[array([1, 2, 3]), array([7, 8, 9])]

This should work:
a = np.array([0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0])
idx = np.nonzero(np.abs(a - 0.2) <= 0.1)[0]
splits = np.split(idx, np.nonzero(np.diff(idx) > 1)[0] + 1)
print(splits)
It gives:
[array([1, 2, 3]), array([7, 8, 9])]

You can check if the difference between the shifted idx arrays is 1, then split the array at the corresponding indices.
import numpy as np
a = np.array( [0.,.1,.2,.3,.4,.5,.4,.3,.2,.1,0.] )
idx = np.where( np.abs(a-.2) <= .1 )[0]
# Get the indices where the increment of values is larger than 1.
split_idcs = np.argwhere( idx[1:]-idx[:-1] > 1 ) + 1
# Split the array at the corresponding indices.
result = np.split(idx, split_idcs[0])
print(result)
# [array([1, 2, 3], dtype=int64), array([7, 8, 9], dtype=int64)]
It works for your example, however I am unsure if this implementation works for arbitrary sequences.

You can achieved the goal by:
diff = np.diff(idx[0], prepend=idx[0][0]-1)
result = np.split(idx[0], np.where(diff != 1)[0])
or
idx = np.where((np.abs(a - .2) <= .1))[0]
diff = np.diff(idx, prepend=idx[0]-1)
result = np.split(idx, np.where(diff != 1)[0])

Numpy: concatenation and sorting of matrices

I have the following problem: I have two vectors containing time moments:
a = np.array((0.23, 1.70))
a_ = np.array((0, 0.5, 1, 1.5, 2))
and two vectors corresponding to the values of the function at these points of time
b = np.array((3, -1.2))
b_ = np.array((0, 3, 3, 3, -1.2))
It want to combine the vectors a, a_ and b, b_ into one and sort the time in ascending order. The final effect should look like this:
A = np.array((0, 0.23, 0.5, 1, 1.5, 1.70, 2))
B = np.array((0, 3, 3, 3, 3, -1.2, -1.2))
How to do it? Because here I gave a simple example, but in general I will work with longer vectors. I thought to connect the vectors a, a_ and b, b_, then make them a matrix and sort them over time (i.e. the first row), but if I sort after the first row, the values in the second row doesnt change their position :( Then I also want to access them and count the differences between successive elements (time and value increments)

Here's what I tried, I first convert them into key value pairs and then sort them based on the keys.
Here's the Code:
import numpy as np
a = np.array((0.23, 1.70))
a_ = np.array((0, 0.5, 1, 1.5, 2))
b = np.array((3, -1.2))
b_ = np.array((0, 3, 3, 3, -1.2))
sol = {}
for i, j in zip(list(a), list(b)):
sol[i] = j
for i, j in zip(list(a_), list(b_)):
sol[i] = j
sol = dict(sorted(sol.items(), key = lambda kv:(kv[0], kv[1])))
A = np.array(list(sol.keys()))
B = np.array(list(sol.values()))
print(f'{A}\n{B}')
Result:
[0. 0.23 0.5 1. 1.5 1.7 2. ]
[ 0. 3. 3. 3. 3. -1.2 -1.2]

Efficient way of getting average area in numpy

Is there a more efficient way in determining the averages of a certain area in a given numpy array? For simplicity, lets say I have a 5x5 array:
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
I would like to get the averages of each coordinate, with a specified area size, assuming the array wraps around. Lets say the certain area is size 2, thus anything around a certain point within distance 2 will be considered. For example, to get the average of the area from coordinate (2,2), we need to consider
2,
2, 3, 4,
2, 3, 4, 5, 6
4, 5, 6,
6,
Thus, the average will be 4.
For coordinate (4, 4) we need to consider:
6,
6, 7, 3,
6, 7, 8, 4, 5
3, 4, 0,
5,
Thus the average will be 4.92.
Currently, I have the following code below. But since I have a for loop I feel like it could be improved. Is there a way to just use numpy built in functions?
Is there a way to use np.vectorize to gather the subarrays (area), place it all in an array, then use np.einsum or something.
def get_average(matrix, loc, dist):
sum = 0
num = 0
size, size = matrix.shape
for y in range(-dist, dist + 1):
for x in range(-dist + abs(y), dist - abs(y) + 1):
y_ = (y + loc.y) % size
x_ = (x + loc.x) % size
sum += matrix[y_, x_]
num += 1
return sum/num
class Coord():
def __init__(self, x, y):
self.x = x
self.y = y
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
height, width = values.shape
averages = np.zeros((height, width), dtype=np.float16)
for r in range(height):
for c in range(width):
loc = Coord(c, r)
averages[r][c] = get_average(values, loc, 2)
print(averages)
Output:
[[ 3.07617188 2.92382812 3.5390625 4.15234375 4. ]
[ 2.92382812 2.76953125 3.38476562 4. 3.84570312]
[ 3.5390625 3.38476562 4. 4.6171875 4.4609375 ]
[ 4.15234375 4. 4.6171875 5.23046875 5.078125 ]
[ 4. 3.84570312 4.4609375 5.078125 4.921875 ]]

This solution is less efficient (slower) than yours but is just an example using numpy.ma module.
Required libraries:
import numpy as np
import numpy.ma as ma
Define methods to do the job:
# build the shape of the area as a rhomboid
def rhomboid2(dim):
size = 2*dim + 1
matrix = np.ones((size,size))
for y in range(-dim, dim + 1):
for x in range(-dim + abs(y), dim - abs(y) + 1):
matrix[(y + dim) % size, (x + dim) % size] = 0
return matrix
# build a mask using the area shaped
def mask(matrix_shape, rhom_dim):
mask = np.zeros(matrix_shape)
bound = 2*rhom_dim+1
rhom = rhomboid2(rhom_dim)
mask[0:bound, 0:bound] = rhom
# roll to set the position of the rhomboid to 0,0
mask = np.roll(mask,-rhom_dim, axis = 0)
mask = np.roll(mask,-rhom_dim, axis = 1)
return mask
Then, iterate to build the result:
mask_ = mask((5,5), 2) # call the mask sized as values array with a rhomboid area of size 2
averages = np.zeros_like(values, dtype=np.float16) # initialize the recipient
# iterate over the mask to calculate the average
for y in range(len(mask_)):
for x in range(len(mask_)):
masked = ma.array(values, mask = mask_)
averages[y,x] = np.mean(masked)
mask_ = np.roll(mask_, 1, axis = 1)
mask_ = np.roll(mask_, 1, axis = 0)
Which returns
# [[3.076 2.924 3.54 4.152 4. ]
# [2.924 2.77 3.385 4. 3.846]
# [3.54 3.385 4. 4.617 4.46 ]
# [4.152 4. 4.617 5.23 5.08 ]
# [4. 3.846 4.46 5.08 4.92 ]]

Fastest way to check which interval index a value is

I have a vector like this:
intervals = [6, 7, 8, 9, 10, 11] #always regular
I want to check which interval index a value is. For example: the index of the interval where 8.5 is, is 3.
#Interval : index
6 -> 7 : 1
7 -> 8 : 2
8 -> 9 : 3
9 -> 10 : 4
10 -> 11 : 5
So I made this code:
from numpy import *
N = 8000
data = random.random(N)
step_number = 50
max_value = max(data)
min_value = min(data)
step_length = (max_value - min_value)/step_number
intervals = arange(min_value + step_length, max_value + step_length, step_length )
for x in data:
for index in range(len(intervals)):
if x < intervals[index]:
print("That's the index", index)
break
This code is working, but It's tooo slow, I think I'm wasting time in these loops. Is there a way to check this faster? Maybe using some numpy special function that check this for me...

Depending on how you want to handle the endpoints, there is bisect.bisect_left and bisect.bisect_right:
>>> import bisect
>>> intervals = [6, 7, 8, 9, 10, 11]
>>> for n in (6, 6.1, 6.2, 6.5, 6.8, 7):
... print bisect.bisect_left(intervals, n)
...
0
1
1
1
1
1
>>> for n in (6, 6.1, 6.2, 6.5, 6.8, 7):
... print bisect.bisect_right(intervals, n)
...
1
1
1
1
1
2
Numpy implements the same thing using the searchsorted method.
>>> import numpy as np
>>> np.searchsorted(intervals, (6, 6.1, 6.2, 6.5, 6.8, 7), side='left')
array([0, 1, 1, 1, 1, 1])
>>> np.searchsorted(intervals, (6, 6.1, 6.2, 6.5, 6.8, 7), side='right')
array([1, 1, 1, 1, 1, 2])
And, of course, if your intervals are equally spaced, you can do:
>>> for n in (6, 6.1, 6.2, 6.5, 6.8, 7):
... iwidth = intervals[1] - intervals[0]
... print np.ceil((n - intervals[0]) / iwidth)
...
0.0
1.0
1.0
1.0
1.0
1.0

As others have mentioned, if you have irregular intervals, use a bisecting search (e.g. np.searchsorted and/or np.digitize).
However, in your specific case where you've stated you'll always have regular intervals, you can also do something similar to:
import numpy as np
intervals = [6, 7, 8, 9, 10, 11]
vals = np.array([8.5, 6.2, 9.8])
dx = intervals[1] - intervals[0]
x0 = intervals[0]
i = np.ceil((vals - x0) / dx).astype(int)
Or, building on your example code:
import numpy as np
N = 8000
num_intervals = 50
data = np.random.random(N)
intervals = np.linspace(data.min(), data.max(), num_intervals)
x0 = intervals[0]
dx = intervals[1] - intervals[0]
i = np.ceil((data - x0) / dx).astype(int)
This will be much faster than a binary search for large arrays.

As long as your list is sorted, you can use the bisect library to get the insertion index.
index = bisect.bisect_left(intervals, 8.5)

Using numpy.digitize:
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.digitize.html#numpy-digitize
>>> import numpy as np
>>> intervals = [6, 7, 8, 9, 10, 11]
>>> data = [3.5, 6.3, 9.4, 11.5, 8.5]
>>> np.digitize(data, bins=interval)
array([0, 1, 4, 6, 3])
0 is underflow, len(intervals) is overflow

Just using numpy:
import numpy as np
intervals = np.array([6, 7, 8, 9, 10, 11])
val = (intervals > 8.5)
print val.argmax()

I'd go for a function:
def f_idx(f_list, number):
for idx,item in enumerate(f_list):
if item>number:
return idx
return len(f_list)
In one liner:
result = [idx for idx,value in enumerate(intervals) if value>number][0] if intervals[-1]>number else len(intervals)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Numpy: Replace duplicate values with mean value - python

Related

Calculating the averages of elements in one array based on data in another array

Split numpy array into contiguous sections using numpy.where()

Numpy: concatenation and sorting of matrices

Efficient way of getting average area in numpy

Fastest way to check which interval index a value is

Categories

Resources