I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).
The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.
Minimal Reproducible example:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
# The goal is to obtain lists of the coordinates
# Of each distinct blob.
blobs = []
label = 1
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Output:
2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
[0 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 1 1 0 1 1 0 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]]
2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0 1 0 0 2 2 0 3 3 0 0 4]
[ 0 1 0 2 2 2 0 3 3 3 0 4]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 5 5 5 5 0 0 0 0 3 0 0]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 0 6 0 0 0 0 0 0 0 0 0]
[ 0 6 0 0 7 7 0 8 8 0 0 9]
[ 0 0 0 0 0 0 0 8 8 8 0 0]
[ 0 10 10 10 10 0 0 0 0 8 0 0]]
Beginning extract_blobs_from_labelled_array timing
Time taken:
9.346099977847189e-05
9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.
Is there a faster way to do this?
Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here:
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.
Instead, you should use:
for label in range(np.max(labels)):
and then you can ignore the if ...: break.
A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).
But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.
The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
"""Return a list containing coordinates of pixels in each blob."""
props = measure.regionprops(labelled_array)
blobs = [p.coords for p in props]
return blobs
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
I have a CSV file with 0s and 1s and need to determine the sum total of the entire file. The file looks like this when opened in ExCel:
0 1 1 1 0 0 0 1 0 1
1 0 1 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 1
0 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 1 1 0 1 1
0 0 1 1 0 0 1 1 0 1
1 0 1 0 1 0 1 1 1 0
0 1 0 0 1 0 0 0 1 1
Using this script I can sum the values of each row and they print out in a single column:
import csv
import numpy as np
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
print line.count('1')
I need to be able to sum up the resulting column, but my experience with this is mild. Looking for suggestions. Thanks.
If you have more than just 1's and 0's map to int and sum all rows:
with open( r'E:\myPy\one_zero.csv') as f:
r = csv.reader(f, delimiter = ',')
count = sum(sum(map(int,row)) for row in r)
Or just count the 1's:
with open( r'E:\myPy\one_zero.csv' ) as f:
r = csv.reader(f, delimiter = ',')
count = sum(row.count("1") for row in r)
Just use with open(r'E:\myPy\one_zero.csv'), you don't need to and should not open and then pass the file handle to with.
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
answer = 0
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
a = line.count(1)
answer += a
print answer
Example:
answer = 0
lines = [[1, 0, 0, 1],[1,1,1,1],[0,0,0,1]]
for line in lines:
a = line.count(1)
answer += a
print answer
7
One possible error is you used:
line.count('1')
vs
line.count(1)
looking for a string instead of a numeric
Why use the CSV module at all? You have a file full of 0s, 1s, commas and newlines. Just open the file, read() it and count the 1s:
>>> with open(filename, 'r') as fin: print fin.read().count('1')
That should get you what you want, no?
I have a text file of this format:
EFF 3500. GRAVITY 0.00000 SDSC GRID [+0.0] VTURB 2.0 KM/S L/H 1.25
wl(nm) Inu(ergs/cm**2/s/hz/ster) for 17 mu in 1221 frequency intervals
1.000 .900 .800 .700 .600 .500 .400 .300 .250 .200 .150 .125 .100 .075 .050 .025 .010
9.09 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.35 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.61 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.77 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.96 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10.20 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10.38 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...more numbers
I'm trying to make it so File[0][0] will print the word "EFF" and so on.
import sys
import numpy as np
from math import *
import matplotlib.pyplot as plt
print 'Number of arguments:', len(sys.argv), 'arguments.'
print 'Argument List:', str(sys.argv)
z = np.array(sys.argv) #store all of the file names into array
i = len(sys.argv) #the length of the filenames array
File = open(str(z[1])).readlines() #load spectrum file
for n in range(0, len(File)):
File[n].split()
for n in range(0, len(File[1])):
print File[1][n]
However,it keeps outputting individual characters as if each list index is a single character. This includes whitespace too. I have split() in a loop because if I put readlines().split() it gives an error.
Output:
E
F
F
3
5
0
0
.
G
R
A
V
I
...ect
What am i doing wrong?
>>> text = """some
... multiline
... text
... """
>>> lines = text.splitlines()
>>> for i in range(len(lines)):
... lines[i].split() # split *returns* the list of tokens
... # it does *not* modify the string inplace
...
['some']
['multiline']
['text']
>>> lines #strings unchanged
['some', 'multiline', 'text']
>>> for i in range(len(lines)):
... lines[i] = lines[i].split() # you have to modify the list
...
>>> lines
[['some'], ['multiline'], ['text']]
If you want a one-liner do:
>>> words = [line.split() for line in text.splitlines()]
>>> words
[['some'], ['multiline'], ['text']]
Using a file object it should be:
with open(z[1]) as f:
File = [line.split() for line in f]
By the way, you are using an anti-idiom when looping. If you want to loop over an iterable simply do:
for element in iterable:
#...
If you need also the index of the element use enumerate:
for index, element in enumerate(iterable):
#...
In your case:
for i, line in enumerate(File):
File[i] = line.split()
for word in File[1]:
print word
You want something like this:
for line in File:
fields = line.split()
#fields[0] is "EFF", fields[1] is "3500.", etc.
The split() method returns a list of strings, it does not modify the object that is is called on.
I have a Dataframe and an input text file of activity.Dataframe is produced via pandas.I want to find out the regression coefficient of each term using following formula
Y=C1aX1a+C1bX1b+...+C2aX2a+C2bX2b+....C0 ,
where Y is the activity Cna the regression coefficient for the residue choice a at position n, X the dummy variable coding (xna= 1 or 0) corresponding to the presence or absence of residue choice a at position n, and C0 the mean value of the activity.
My dataframe look likes
2u 2s 4r 4n 4m 7h 7v
0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0
Here 1 and 0 represents the presence and absence of residues respectively.
Using MLR(multiple linear regression) how can i find out the regression coefficient of each residue ie, 2u,2s,4r,4n,4m,7h,7v.
C1a represents the regression coefficient of residue a at 1st position(here 1a is 2u,1b is 2s, 2a is 4r...) X1a represents the dummy value ie 0 or 1 corresponding to 1a.
Activity file contain following data
6.5
5.9
5.7
6.4
5.2
So first equation will look like
6.5=C1a*0+C1b*1+C2a*1+C2b*0+C2c*0+C3a*0+C3b*1+C0
…
Can I get regression coefficient using numpy?.Please help me, All suggestions will be appreciated.
Let A be your dataframe (you can get it as a pure and simple numpy array. Read it in using np.loadtxt if it's CSV), and y be your activity file (again, a numpy array), and use np.linalg.lstsq
DF = """0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0"""
res = """6.5, 5.9, 5.7, 6.4, 5.2"""
A = np.fromstring ( DF, sep=" " ).reshape((5,7))
y = np.fromstring(res, sep=" ")
(x, res, rango, svals ) = np.linalg.lstsq(A, y )
print x
# 2.115625, 2.490625, 1.24375 , 1.19375 , 2.16875 , 2.115625, 2.490625
print np.sum(A.dot(x)**2) # Sum of squared residuals:
# 177.24750000000003
print A.dot(x) # Print predicition
# 6.225, 6.175, 5.425, 6.4 , 5.475