Fastest way to check which interval index a value is - python

I have a vector like this:
intervals = [6, 7, 8, 9, 10, 11] #always regular
I want to check which interval index a value is. For example: the index of the interval where 8.5 is, is 3.
#Interval : index
6 -> 7 : 1
7 -> 8 : 2
8 -> 9 : 3
9 -> 10 : 4
10 -> 11 : 5
So I made this code:
from numpy import *
N = 8000
data = random.random(N)
step_number = 50
max_value = max(data)
min_value = min(data)
step_length = (max_value - min_value)/step_number
intervals = arange(min_value + step_length, max_value + step_length, step_length )
for x in data:
for index in range(len(intervals)):
if x < intervals[index]:
print("That's the index", index)
break
This code is working, but It's tooo slow, I think I'm wasting time in these loops. Is there a way to check this faster? Maybe using some numpy special function that check this for me...

Depending on how you want to handle the endpoints, there is bisect.bisect_left and bisect.bisect_right:
>>> import bisect
>>> intervals = [6, 7, 8, 9, 10, 11]
>>> for n in (6, 6.1, 6.2, 6.5, 6.8, 7):
... print bisect.bisect_left(intervals, n)
...
0
1
1
1
1
1
>>> for n in (6, 6.1, 6.2, 6.5, 6.8, 7):
... print bisect.bisect_right(intervals, n)
...
1
1
1
1
1
2
Numpy implements the same thing using the searchsorted method.
>>> import numpy as np
>>> np.searchsorted(intervals, (6, 6.1, 6.2, 6.5, 6.8, 7), side='left')
array([0, 1, 1, 1, 1, 1])
>>> np.searchsorted(intervals, (6, 6.1, 6.2, 6.5, 6.8, 7), side='right')
array([1, 1, 1, 1, 1, 2])
And, of course, if your intervals are equally spaced, you can do:
>>> for n in (6, 6.1, 6.2, 6.5, 6.8, 7):
... iwidth = intervals[1] - intervals[0]
... print np.ceil((n - intervals[0]) / iwidth)
...
0.0
1.0
1.0
1.0
1.0
1.0

As others have mentioned, if you have irregular intervals, use a bisecting search (e.g. np.searchsorted and/or np.digitize).
However, in your specific case where you've stated you'll always have regular intervals, you can also do something similar to:
import numpy as np
intervals = [6, 7, 8, 9, 10, 11]
vals = np.array([8.5, 6.2, 9.8])
dx = intervals[1] - intervals[0]
x0 = intervals[0]
i = np.ceil((vals - x0) / dx).astype(int)
Or, building on your example code:
import numpy as np
N = 8000
num_intervals = 50
data = np.random.random(N)
intervals = np.linspace(data.min(), data.max(), num_intervals)
x0 = intervals[0]
dx = intervals[1] - intervals[0]
i = np.ceil((data - x0) / dx).astype(int)
This will be much faster than a binary search for large arrays.

As long as your list is sorted, you can use the bisect library to get the insertion index.
index = bisect.bisect_left(intervals, 8.5)

Using numpy.digitize:
http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.digitize.html#numpy-digitize
>>> import numpy as np
>>> intervals = [6, 7, 8, 9, 10, 11]
>>> data = [3.5, 6.3, 9.4, 11.5, 8.5]
>>> np.digitize(data, bins=interval)
array([0, 1, 4, 6, 3])
0 is underflow, len(intervals) is overflow

Just using numpy:
import numpy as np
intervals = np.array([6, 7, 8, 9, 10, 11])
val = (intervals > 8.5)
print val.argmax()

I'd go for a function:
def f_idx(f_list, number):
for idx,item in enumerate(f_list):
if item>number:
return idx
return len(f_list)
In one liner:
result = [idx for idx,value in enumerate(intervals) if value>number][0] if intervals[-1]>number else len(intervals)

Related

How to replace list/tuple with dictionaries, in working code, to improve its performance?

I have this code which works fine and is minimal and reproducible. It uses lists and tuples. Given the slowness of lists and tuples on large amounts of data, i would like to change the whole setting and use dictionaries to speed up performance.
So I'd like to convert this block of queues into something similar that uses dictionaries.
The purpose of the code is to create the variables x and y (calculation of mathematical data) and add them to a list, using an append and tuples. I then mine the numbers for certain purposes.
How can I add dictionaries where needed and replace them with list/append codes? Thank you!
VERSION WITH TUPLE AND LIST
mylist = {('Jack', 'Grace', 8, 9, '15:00'): [0, 1, 1, 5],
('William', 'Dawson', 8, 9, '18:00'): [1, 2, 3, 4],
('Natasha', 'Jonson', 8, 9, '20:45'): [0, 1, 1, 2]}
new = []
for key, value in mylist.items():
#create variables and perform calculations
calc_x= sum(value)/ len(value)
calc_y = (calc_x *100) / 2
#create list with 3 tuples inside
if calc_x > 0.1:
new.append([[key], [calc_x], [calc_y]])
print(new)
print(" ")
#example for call calc_x
print_x = [tuple(i[1]) for i in new]
print(print_x)
I was trying to write something like this, but I don't think it fits, so don't even look at it.I have two requests if possible:
I would like sum(value)/ len(value) and (calc_x *100) / 2 to continue to have their own variables calc_x and calc_y, so that they can invoke individually in the append as you can see
In the new variable, i would like to be able to call the variables when i are needed, such as for example i do for print_x = [tuple(i[1]) for i in new]. Thank you
If you really want to improve performance, you can use Pandas (or Numpy) to vectorize math operations:
import pandas as pd
# Transform your dataset to DataFrame
df = pd.DataFrame.from_dict(mylist, orient='index')
# Compute some operations
df['x'] = df.mean(axis=1)
df['y'] = df['x'] * 50
# Filter out and export
out = df.loc[df['x'] > 0.1, ['x', 'y']].to_dict('split')
new = dict(zip(out['index'], out['data']))
Output:
>>> new
{('Jack', 'Grace', 8, 9, '15:00'): [1.75, 87.5],
('William', 'Dawson', 8, 9, '18:00'): [2.5, 125.0],
('Natasha', 'Jonson', 8, 9, '20:45'): [1.0, 50.0]}
A numpy version:
import numpy as np
# transform keys to numpy array (special hack to keep tuples)
keys = np.empty(len(mylist), dtype=object)
keys[:] = tuple(mylist.keys())
# transform values to numpy array
vals = np.array(tuple(mylist.values()))
x = np.mean(vals, axis=1)
y = x * 50
# boolean mask to exclude some values
m = x > 0.1
out = np.vstack([x, y]).T
new = dict(zip(keys[m].tolist(), out[m].tolist()))
print(new)
# Output
{('Jack', 'Grace', 8, 9, '15:00'): [1.75, 87.5],
('William', 'Dawson', 8, 9, '18:00'): [2.5, 125.0],
('Natasha', 'Jonson', 8, 9, '20:45'): [1.0, 50.0]}
A python version:
new = {}
for k, v in mylist.items():
x = sum(v) / len(v)
y = x * 50
if x > 0.1:
new[k] = [x, y]
print(new)
# Output
{('Jack', 'Grace', 8, 9, '15:00'): [1.75, 87.5],
('William', 'Dawson', 8, 9, '18:00'): [2.5, 125.0],
('Natasha', 'Jonson', 8, 9, '20:45'): [1.0, 50.0]}
Update: How to extract x:
# Pandas
>>> df['x'].tolist() # or simply df['x'] to extract the column
[1.75, 2.5, 1.0]
# Python
>>> [v[0] for v in new.values()]
[1.75, 2.5, 1.0]

How to cut the lower values when caclulating percentile?

If I calculate 90 percentile using numpy:
import numpy as np
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
p = np.percentile(a, 90)
print (p)
It cuts the highest value so the result is:
9.1
How to cut instead the lower values so the output would be:
2
Thank you!
You want the 10th percentile, not the 90th.
p = np.percentile(a, 10)
print (p)
# 1.9

Numpy digitize included bin edge by absolute value

For the np.digitize function, I have a distribution of data about zero (includes negative and positive values). I would like the bin edge to be right=False for the positive values, but right=True for negative ones (i.e. were I to take the absolute value, the lower bound is inclusive in the bin).
>>> x = np.array([-10, -4, -1.2, -0.3, 3, 4, 7])
>>> bins = np.array([-8, -4, 0, 4, 8])
>>> np.digitize(x,bins,right=????)
array([0, 1, 2, 2, 3, 4, 4])
Is there an alternative method to handle this other than a conditional set:
if x <= -8:
return 0
elif -8 < x <= -4:
return 1
elif -4 < x <= 0:
return 2
elif 0 < x < 4:
return 3
elif 4 <= x < 8:
return 4
elif 8 <= x:
return 5
You can shift some of the boundaries by the smallest possible amount using numpy.nextafter:
>>> bins = bins.astype(x.dtype)
>>> bins = np.nextafter(bins, bins + (bins <= 0))
# apply
>>> np.digitize(x, bins)
array([0, 1, 2, 2, 3, 4, 4])
# zero also goes to the right bin
>>> np.digitize(0, bins)
array(2)
Upon inspection
>>> bins
array([-8.e+000, -4.e+000, 5.e-324, 4.e+000, 8.e+000])
# ndarray.__str__ rounds, but casting to list reveals
>>> bins.tolist()
[-7.999999999999999, -3.9999999999999996, 5e-324, 4.0, 8.0]
we see that zero was shifted to something looking suspiciously like a denormal which may or may not cause problems on some platforms.
Just to be sure to be sure we can avoid this issue going the other way:
>>> bins = np.array([-8, -4, 0, 4, 8])
>>> bins = bins.astype(x.dtype)
>>> bins = np.nextafter(bins, np.minimum(bins, 0))
>>> np.digitize(x, bins, True)
array([0, 1, 2, 2, 3, 4, 4])
>>> np.digitize(0, bins, True)
array(2)
>>> bins.tolist()
[-8.0, -4.0, 0.0, 3.9999999999999996, 7.999999999999999]

Python Numpy: Replace duplicate values with mean value

I have two measurements, position and temperature, which are sampled at a fixed sampling rate. Some positions might occour multiple times in the data. Now I want to plot the temperature over the position and not over the time. Instead of displaying two points at the same position, I want to replace the temperature measurements with the mean value for the given location. How can this be done nicely in python with numpy?
My solution so far looks like this:
import matplotlib.pyplot as plt
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Get correct order
idx = np.argsort(x)
x, y = x[idx], y[idx]
plt.plot(x, y) # Plot with multiple points at same location
# Calculate means for dupplicates
new_x = []
new_y = []
skip_next = False
for idx in range(len(x)):
if skip_next:
skip_next = False
continue
if idx < len(x)-1 and x[idx] == x[idx+1]:
new_x.append(x[idx])
new_y.append((y[idx] + y[idx+1]) / 2)
skip_next = True
else:
new_x.append(x[idx])
new_y.append(y[idx])
skip_next = False
x, y = np.array(new_x), np.array(new_y)
plt.plot(x, y) # Plots desired output
This solution does not take into account that some positions might occoure more than twice in the data. To replace all values, the loop must be run multiple times. I know there must be a better solution to this!
One approach using np.bincount -
import numpy as np
# x = Position Data
# y = Temperature Data
x = np.random.permutation([0, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9])
y = (x + np.random.rand(len(x)) * 1 - 0.5).round(2)
# Find unique sorted values for x
x_new = np.unique(x)
# Use bincount to get the accumulated summation for each unique x, and
# divide each summation by the respective count of each unique value in x
y_new_mean= np.bincount(x, weights=y)/np.bincount(x)
Sample run -
In [16]: x
Out[16]: array([7, 0, 2, 8, 5, 4, 1, 9, 6, 8, 1, 3, 5])
In [17]: y
Out[17]:
array([ 6.7 , 0.12, 2.33, 8.19, 5.19, 3.68, 0.62, 9.46, 6.01,
8. , 1.07, 3.07, 5.01])
In [18]: x_new
Out[18]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [19]: y_new_mean
Out[19]:
array([ 0.12 , 0.845, 2.33 , 3.07 , 3.68 , 5.1 , 6.01 , 6.7 ,
8.095, 9.46 ])
If I understand what you're asking, here's one way to do it that is a lot simpler.
Given some dataset that is randomly arranged, but each position is connected with each temperature:
data = np.random.permutation([(1, 5.6), (1, 3.4), (1, 4.5), (2, 5.3), (3, 2.2), (3, 6.8)])
>> array([[ 3. , 2.2],
[ 3. , 6.8],
[ 1. , 3.4],
[ 1. , 5.6],
[ 2. , 5.3],
[ 1. , 4.5]])
We can sort and put each position in a dictionary as its key while keeping track of the temperatures for that position in an array in the dictionary. We use some error handling here, if the key (position) is not yet in our dictionary python will complain with a KeyError so we add it.
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
print(results)
>> {1.0: [3.3999999999999999, 5.5999999999999996, 4.5],
2.0: [5.2999999999999998],
3.0: [2.2000000000000002, 6.7999999999999998]}
And with a final list comprehension we can flatten this and get the resulting array.
np.array([[key, np.mean(results[key])] for key in results.keys()])
>> array([[ 1. , 4.5],
[ 2. , 5.3],
[ 3. , 4.5]])
This can be put in a function:
def flatten_by_position(data):
results = {}
for entry in sorted(data, key=lambda t: t[0]):
try:
results[entry[0]] = results[entry[0]] + [entry[1]]
except KeyError:
results[entry[0]] = [entry[1]]
return np.array([[key, np.mean(results[key])] for key in results.keys()])
Tested with a variety of inputs this solution should be fast enough for datasets under 1000000 entries.

averaging matrix efficiently

in Python, given an n x p matrix, e.g. 4 x 4, how can I return a matrix that's 4 x 2 that simply averages the first two columns and the last two columns for all 4 rows of the matrix?
e.g. given:
a = array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
return a matrix that has the average of a[:, 0] and a[:, 1] and the average of a[:, 2] and a[:, 3].
I want this to work for an arbitrary matrix of n x p assuming that the number of columns I am averaging of n is obviously evenly divisible by n.
let me clarify: for each row, I want to take the average of the first two columns, then the average of the last two columns. So it would be:
1 + 2 / 2, 3 + 4 / 2 <- row 1 of new matrix
5 + 6 / 2, 7 + 8 / 2 <- row 2 of new matrix, etc.
which should yield a 4 by 2 matrix rather than 4 x 4.
thanks.
How about using some math? You can define a matrix M = [[0.5,0],[0.5,0],[0,0.5],[0,0.5]] so that A*M is what you want.
from numpy import array, matrix
A = array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
M = matrix([[0.5,0],
[0.5,0],
[0,0.5],
[0,0.5]])
print A*M
Generating M is pretty simple too, entries are 1/n or zero.
reshape - get mean - reshape
>>> a.reshape(-1, a.shape[1]//2).mean(1).reshape(a.shape[0],-1)
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
is supposed to work for any array size, and reshape doesn't make a copy.
It's a bit unclear what should happen for matrices with n > 4, but this code will do what you want:
a = N.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]], dtype=float)
avg = N.vstack((N.average(a[:,0:2], axis=1), N.average(a[:,2:4], axis=1))).T
This yields avg =
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
Here's a way to do it. You only need to change groupsize to make it work with other sizes like you said, though I'm not fully sure what you want.
groupsize = 2
out = np.hstack([np.mean(x,axis=1,out=np.zeros((a.shape[0],1))) for x in np.hsplit(a,groupsize)])
yields
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
for out. Hopefully it gives you some ideas on how to do exactly what it is that you want to do. You can make groupsize dependent on the dimensions of a for instance.

Categories