I have this simple loop which processes a big dataset.
for i in range (len(nbi.LONG_017)):
StoredOCC = []
for j in range (len(Final.X)):
r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
if (r < 0.03048):
SWw = Final.CUM_OCC[j]
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
len(nbi.LONG_017) is 3000 and len(Final.X) is 6 millions data points.
I was wondering if there is an efficient way to implement this code or use parallel computing to make it faster?
I used the code provided here :
Haversine Formula in Python (Bearing and Distance between two GPS points) for my function haversine:
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles
return r * c
Before talking about parallelization, you can work on optimizing your loop. The first way would be to iterate over the data instead of incremental values over the length and then access the data each time:
#toy sample
np.random.seed(1)
size_nbi = 20
size_Final = 100
nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/100+73,
'LAT_016':np.random.random(size=size_nbi)/100+73,})
Final = pd.DataFrame({'X':np.random.random(size=size_Final)/100+73,
'Y':np.random.random(size=size_Final)/100+73,
'CUM_OCC':np.random.randint(size_Final,size=size_Final)})
Using your method, you get for these size of dataframes about 75ms:
%%timeit
for i in range (len(nbi.LONG_017)):
StoredOCC = []
for j in range (len(Final.X)):
r = haversine(nbi.LONG_017[i], nbi.LAT_016[i], Final.X[j], Final.Y[j])
if (r < 0.03048):
SWw = Final.CUM_OCC[j]
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
nbi.loc[i, 'ADTT_02_2019'] = np.max(StoredOCC)
# 75.6 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now if you change slightly the loops, iterate over the data themselves and use a list for the result then put the result as a column outside the loops, you can go down to 5ms so 15 times faster, by just optimizing it (and you might find even better optimization):
%%timeit
res_ADIT = []
for lon1, lat1 in zip (nbi.LONG_017.to_numpy(),
nbi.LAT_016.to_numpy()):
StoredOCC = []
for lon2, lat2, SWw in zip(Final.X.to_numpy(),
Final.Y.to_numpy(),
Final.CUM_OCC.to_numpy()):
r = haversine(lon1, lat1, lon2, lat2)
if (r < 0.03048):
StoredOCC.append(SWw)
if len(StoredOCC) != 0:
res_ADIT.append(np.max(StoredOCC))
else:
res_ADIT.append(np.nan)
nbi['ADIT_v2'] = res_ADIT
# 5.23 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, you can even go further and use numpy and vectorization of your second loop, and also pass directly the radians values instead of having to map them at each loop:
# empty list for result
res_ADIT = []
# work with array in radians
arr_lon2 = np.radians(Final.X.to_numpy())
arr_lat2 = np.radians(Final.Y.to_numpy())
arr_OCC = Final.CUM_OCC.to_numpy()
for lon1, lat1 in zip (np.radians(nbi.LONG_017), #pass directly the radians
np.radians(nbi.LAT_016)):
# do all the substraction in a vectorize way
arr_dlon = arr_lon2 - lon1
arr_dlat = arr_lat2 - lat1
# same here using numpy functions
arr_dist = np.sin(arr_dlat/2)**2 + np.cos(lat1) * np.cos(arr_lat2) * np.sin(arr_dlon/2)**2
arr_dist = 2 * np.arcsin(np.sqrt(arr_dist))
arr_dist *= 6372.8
# extract the values of CUM_OCC that meet the criteria
r = arr_OCC[arr_dist<0.03048]
# check that at least one element
if r.size>0:
res_ADIT.append(max(r))
else :
res_ADIT.append(np.nan)
nbi['AUDIT_np'] = res_ADIT
If you do a timeit of this, you go down to 819 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) so about 90 time faster than you original solution for the small size of dataframes and all the results are the same
print(nbi)
LONG_017 LAT_016 ADTT_02_2019 ADIT_v2 AUDIT_np
0 73.004170 73.008007 NaN NaN NaN
1 73.007203 73.009683 30.0 30.0 30.0
2 73.000001 73.003134 14.0 14.0 14.0
3 73.003023 73.006923 82.0 82.0 82.0
4 73.001468 73.008764 NaN NaN NaN
5 73.000923 73.008946 NaN NaN NaN
6 73.001863 73.000850 NaN NaN NaN
7 73.003456 73.000391 NaN NaN NaN
8 73.003968 73.001698 21.0 21.0 21.0
9 73.005388 73.008781 NaN NaN NaN
10 73.004192 73.000983 93.0 93.0 93.0
11 73.006852 73.004211 NaN NaN NaN
You can play a bit with the code and increase the size of each toy dataframe, and you'll see how by increasing the size especially of the data Final, the gain becomes interesting for example with size_nbi = 20 and size_Final = 1000, you get a gain of 400 time faster with the vectorized solution. and so on. For your full data size (3K *6M), you still need a bit of time and on my computer it would take about 25 minutes I estimate while your solution is in hundred of hours. Now if this is not enough, you can think about parallelization or also using numba
This method using Balltree will take about a minute on my machine (depending on the radius, which the smaller the faster), with the sizes you mentions (7000 and 6M)
import numpy as np
import sklearn
import pandas as pd
I generate data, thx Ben i used your code
#toy sample
np.random.seed(1)
size_nbi = 7000
size_Final = 6000000
nbi = pd.DataFrame({'LONG_017':np.random.random(size=size_nbi)/10+73,
'LAT_016':np.random.random(size=size_nbi)/10+73,})
Final = pd.DataFrame({'X':np.random.random(size=size_Final)/10+73,
'Y':np.random.random(size=size_Final)/10+73,
'CUM_OCC':np.random.randint(size_Final,size=size_Final)})
nbi_gps = nbi[["LAT_016", "LONG_017"]].values
final_gps = Final[["Y", "X"]].values
Create a Balltree
%%time
from sklearn.neighbors import BallTree
import numpy as np
nbi = np.radians(nbi_gps)
final = np.radians(final_gps)
tree = BallTree(final, leaf_size=12, metric='haversine')
Which took Wall time: 23.8 s
Query with
%%time
radius = 0.0000003
StoredOCC_indici = tree.query_radius(nbi, r=radius, return_distance=False, count_only=False)
(under a second)
And get the maximum of the things you are interested in
StoredOCC = [ np.max( Final.CUM_OCC[i] ) for i in StoredOCC_indici]
Which automatically make Nans for empty lists, which is nice. This took Wall time: 3.64 s
For this radius the whole calculation on 6 million took under a minute on my machine.
Related
I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I'm trying to return the minimum distance between the coordinates in each individual row compared to every coordinate in the list. The following code works on a small scale (dataframe with 200 rows), but when calculating over 2.7MM rows, it seemingly runs forever.
from haversine import haversine
df
Latitude Longitude
39.989 -89.980
39.923 -89.901
39.990 -89.987
39.884 -89.943
39.030 -89.931
end_coords_list = [(41.342,-90.423),(40.349,-91.394),(38.928,-89.323)]
for row in df.itertuples():
def min_distance(row):
beg_coord = (row.Latitude, row.Longitude)
return min(haversine(beg_coord, end_coord) for end_coord in end_coords_list)
df['Min_Distance'] = df.apply(min_distance, axis=1)
I know the issue lies in the sheer number of calculations that are happening (5.7MM * 2,000 = ~11.4BN), and the fact that running this many loops is incredibly inefficient.
Based on my research, it seems like a vectorized NumPy function might be a better approach, but I'm new to Python and NumPy so I'm not quite sure how to implement this in this particular situation.
Ideal Output:
df
Latitude Longitude Min_Distance
39.989 -89.980 3.7
39.923 -89.901 4.1
39.990 -89.987 4.2
39.884 -89.943 5.9
39.030 -89.931 3.1
Thanks in advance!
The haversine func in essence is :
# convert all latitudes/longitudes from decimal degrees to radians
lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))
# calculate haversine
lat = lat2 - lat1
lng = lng2 - lng1
d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
Here's a vectorized method leveraging the powerful NumPy broadcasting and NumPy ufuncs to replace those math-module funcs so that we would operate on entire arrays in one go -
# Get array data; convert to radians to simulate 'map(radians,...)' part
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
# Get the differentiations
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
# Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.
# Add into "sin(lat * 0.5) ** 2" part.
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
# Get h and assign into dataframe
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance'] = h.min(1)
For further performance boost, we can make use of numexpr module to replace the transcendental funcs.
Runtime test and verification
Approaches -
def loopy_app(df, coords_list):
for row in df.itertuples():
df['Min_Distance1'] = df.apply(min_distance, axis=1)
def vectorized_app(df, coords_list):
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance2'] = h.min(1)
Verification -
In [158]: df
Out[158]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [159]: loopy_app(df, coords_list)
In [160]: vectorized_app(df, coords_list)
In [161]: df
Out[161]:
Latitude Longitude Min_Distance1 Min_Distance2
0 39.989 -89.980 126.637607 126.637607
1 39.923 -89.901 121.266241 121.266241
2 39.990 -89.987 126.037388 126.037388
3 39.884 -89.943 118.901195 118.901195
4 39.030 -89.931 53.765506 53.765506
Timings -
In [163]: df
Out[163]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [164]: %timeit loopy_app(df, coords_list)
100 loops, best of 3: 2.41 ms per loop
In [165]: %timeit vectorized_app(df, coords_list)
10000 loops, best of 3: 96.8 µs per loop
i have the data like this
ID 8-Jan 15-Jan 22-Jan 29-Jan 5-Feb 12-Feb LowerBound UpperBound
001 618 720 645 573 503 447 - -
002 62 80 67 94 81 65 - -
003 32 10 23 26 26 31 - -
004 22 13 1 28 19 25 - -
005 9 7 9 6 8 4 - -
I want to create two columns with lower bounds and upper bounds for each product using 95% confidence intervals. I know manual way of writing a function which loops through each product ID
import numpy as np
import scipy as sp
import scipy.stats
# Method copied from http://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m-h, m+h
Is there an efficient way in Pandas or (one liner kind of thing) ?
Of course, you want df.apply. Note you need to modify mean_confidence_interval to return pd.Series([m-h, m+h]).
df[['LowerBound','UpperBound']] = df.apply(mean_confidence_interval, axis=1)
Standard error of the mean is pretty straightforward to calculate so you can easily vectorize this:
import scipy.stats as ss
df.mean(axis=1) + ss.t.ppf(0.975, df.shape[1]-1) * df.std(axis=1)/np.sqrt(df.shape[1])
will give you the upper bound. Use - ss.t.ppf for the lower bound.
Also, pandas seems to have a sem method. If you have a large dataset, I don't suggest using apply over rows. It is pretty slow. Here are some timings:
df = pd.DataFrame(np.random.randn(100, 10))
%timeit df.apply(mean_confidence_interval, axis=1)
100 loops, best of 3: 18.2 ms per loop
%%timeit
dist = ss.t.ppf(0.975, df.shape[1]-1) * df.sem(axis=1)
mean = df.mean(axis=1)
mean - dist, mean + dist
1000 loops, best of 3: 598 µs per loop
Since you already created a function for calculating the confidence interval, simply apply it to each row of your data:
def mean_confidence_interval(data):
confidence = 0.95
m = data.mean()
se = scipy.stats.sem(data)
h = se * sp.stats.t._ppf((1 + confidence) / 2, data.shape[0] - 1)
return pd.Series((m - h, m + h))
interval = df.apply(mean_confidence_interval, axis=1)
interval.columns = ("LowerBound", "UpperBound")
pd.concat([df, interval],axis=1)
There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105
Suppose house sale figures are presented for a town in ranges:
< $100,000 204
$100,000 - $199,999 1651
$200,000 - $299,999 2405
$300,000 - $399,999 1972
$400,000 - $500,000 872
> $500,000 1455
I want to know which house-price bin a given percentile falls. Is there a way of using numpy's percentile function to do this? I can do it by hand:
import numpy as np
a = np.array([204., 1651., 2405., 1972., 872., 1455.])
b = np.cumsum(a)/np.sum(a) * 100
q = 75
len(b[b <= q])
4 # ie bin $300,000 - $399,999
But is there a way to use np.percentile instead?
You were almost there:
cs = np.cumsum(a)
bin_idx = np.searchsorted(cs, np.percentile(cs, 75))
At least for this case (and a couple others with larger a arrays), it's not any faster, though:
In [9]: %%timeit
...: b = np.cumsum(a)/np.sum(a) * 100
...: len(b[b <= 75])
...:
10000 loops, best of 3: 38.6 µs per loop
In [10]: %%timeit
....: cs = np.cumsum(a)
....: np.searchsorted(cs, np.percentile(cs, 75))
....:
10000 loops, best of 3: 125 µs per loop
So unless you want to check for multiple percentiles, I'd stick with what you have.
I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.