How do I load both Strings and floats into a numpy array? - python

I need to somehow make numpy load in both text and numbers.
I am getting this error:
Traceback (most recent call last):
File "ip00ktest.py", line 13, in <module>
File = np.loadtxt(str(z[1])) #load spectrum file
File "/usr/lib64/python2.6/site-packages/numpy/lib/npyio.py", line 805, in loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: invalid literal for float(): EFF
because my file I'm loading in has text in it. I need each word to be stored in an array index as well as the data below it. How do I do that?
Edit: Sorry for not giving an example. Here is what my file looks like.
FF 3500. GRAVITY 0.00000 SDSC GRID [+0.0] VTURB 2.0 KM/S L/H 1.25
wl(nm) Inu(ergs/cm**2/s/hz/ster) for 17 mu in 1221 frequency intervals
1.000 .900 .800 .700 .600 .500 .400 .300 .250 .200 .150 .125 .100 .075 .050 .025 .010
9.09 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.35 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.61 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.77 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.96 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
There are thousands of numbers below the ones shown here. Also, there are different datasets within the file such that the header you see on top repeats, followed by a new set of new numbers.
Code that Fails:
import sys
import numpy as np
from math import *
print 'Number of arguments:', len(sys.argv), 'arguments.'
print 'Argument List:', str(sys.argv)
z = np.array(sys.argv) #store all of the file names into array
i = len(sys.argv) #the length of the filenames array
File = np.loadtxt(str(z[1])) #load spectrum file

If the line that messes it up always begins with EFF, then you can ignore that line quite easily:
np.loadtxt(str(z[1]), comments='EFF')
Which will treat any line beginning with 'EFF' as a comment and it will be ignored.

To read the numbers, use the skiprows parameter of numpy.loadtxt to skip the header. Write custom code to read the header, because it seems to have an irregular format.
NumPy is most useful with homogenous numerical data -- don't try to put the strings in there.

Related

Efficient way to find coordinates of connected blobs in binary image

I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).
The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.
Minimal Reproducible example:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
# The goal is to obtain lists of the coordinates
# Of each distinct blob.
blobs = []
label = 1
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Output:
2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
[0 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 1 1 0 1 1 0 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]]
2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0 1 0 0 2 2 0 3 3 0 0 4]
[ 0 1 0 2 2 2 0 3 3 3 0 4]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 5 5 5 5 0 0 0 0 3 0 0]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 0 6 0 0 0 0 0 0 0 0 0]
[ 0 6 0 0 7 7 0 8 8 0 0 9]
[ 0 0 0 0 0 0 0 8 8 8 0 0]
[ 0 10 10 10 10 0 0 0 0 8 0 0]]
Beginning extract_blobs_from_labelled_array timing
Time taken:
9.346099977847189e-05
9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.
Is there a faster way to do this?
Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here:
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.
Instead, you should use:
for label in range(np.max(labels)):
and then you can ignore the if ...: break.
A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).
But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.
The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
"""Return a list containing coordinates of pixels in each blob."""
props = measure.regionprops(labelled_array)
blobs = [p.coords for p in props]
return blobs
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")

Why does python find the line 'ens' but not 'tun' when using `for line in open()`

I have the following function I'm using to check whether an interface exists or not:
def status(interface):
print("Checking VPN Status...")
for line in open('/proc/net/dev'):
if interface in line:
proof = line.split(" ")[1].split()
print(proof)
return proof
Here is a copy of my /proc/net/dev file:
$ cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
ens33: 1480355767 23538625 13 1 0 0 0 0 752613935 44655037 0 0 0 0 0 0
lo: 548140418 2567168 0 0 0 0 0 0 548140418 2567168 0 0 0 0 0 0
tun0: 17067 85 0 0 0 0 0 0 10819 114 0 0 0 0 0 0
If I call status('ens'), I get the correct output:
['ens33:]
But if I call status('tun'), which I fully expect to work, I get:
[]
Any idea what's going on here?
The tun0 line begins with two spaces, so line.split(" ") returns ['', '', 'tun0:', ... ], meaning element 1 is itself a blank space.
If you use just plain split(), it won't preserve the whitespace, and will just return ['tun0:']. (In this case you'd need to access element zero instead of element one.)

Sum all values in CSV

I have a CSV file with 0s and 1s and need to determine the sum total of the entire file. The file looks like this when opened in ExCel:
0 1 1 1 0 0 0 1 0 1
1 0 1 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 1
0 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 1 1 0 1 1
0 0 1 1 0 0 1 1 0 1
1 0 1 0 1 0 1 1 1 0
0 1 0 0 1 0 0 0 1 1
Using this script I can sum the values of each row and they print out in a single column:
import csv
import numpy as np
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
print line.count('1')
I need to be able to sum up the resulting column, but my experience with this is mild. Looking for suggestions. Thanks.
If you have more than just 1's and 0's map to int and sum all rows:
with open( r'E:\myPy\one_zero.csv') as f:
r = csv.reader(f, delimiter = ',')
count = sum(sum(map(int,row)) for row in r)
Or just count the 1's:
with open( r'E:\myPy\one_zero.csv' ) as f:
r = csv.reader(f, delimiter = ',')
count = sum(row.count("1") for row in r)
Just use with open(r'E:\myPy\one_zero.csv'), you don't need to and should not open and then pass the file handle to with.
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
answer = 0
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
a = line.count(1)
answer += a
print answer
Example:
answer = 0
lines = [[1, 0, 0, 1],[1,1,1,1],[0,0,0,1]]
for line in lines:
a = line.count(1)
answer += a
print answer
7
One possible error is you used:
line.count('1')
vs
line.count(1)
looking for a string instead of a numeric
Why use the CSV module at all? You have a file full of 0s, 1s, commas and newlines. Just open the file, read() it and count the 1s:
>>> with open(filename, 'r') as fin: print fin.read().count('1')
That should get you what you want, no?

Why is .readlines() making a list of individual characters?

I have a text file of this format:
EFF 3500. GRAVITY 0.00000 SDSC GRID [+0.0] VTURB 2.0 KM/S L/H 1.25
wl(nm) Inu(ergs/cm**2/s/hz/ster) for 17 mu in 1221 frequency intervals
1.000 .900 .800 .700 .600 .500 .400 .300 .250 .200 .150 .125 .100 .075 .050 .025 .010
9.09 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.35 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.61 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.77 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.96 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10.20 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10.38 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...more numbers
I'm trying to make it so File[0][0] will print the word "EFF" and so on.
import sys
import numpy as np
from math import *
import matplotlib.pyplot as plt
print 'Number of arguments:', len(sys.argv), 'arguments.'
print 'Argument List:', str(sys.argv)
z = np.array(sys.argv) #store all of the file names into array
i = len(sys.argv) #the length of the filenames array
File = open(str(z[1])).readlines() #load spectrum file
for n in range(0, len(File)):
File[n].split()
for n in range(0, len(File[1])):
print File[1][n]
However,it keeps outputting individual characters as if each list index is a single character. This includes whitespace too. I have split() in a loop because if I put readlines().split() it gives an error.
Output:
E
F
F
3
5
0
0
.
G
R
A
V
I
...ect
What am i doing wrong?
>>> text = """some
... multiline
... text
... """
>>> lines = text.splitlines()
>>> for i in range(len(lines)):
... lines[i].split() # split *returns* the list of tokens
... # it does *not* modify the string inplace
...
['some']
['multiline']
['text']
>>> lines #strings unchanged
['some', 'multiline', 'text']
>>> for i in range(len(lines)):
... lines[i] = lines[i].split() # you have to modify the list
...
>>> lines
[['some'], ['multiline'], ['text']]
If you want a one-liner do:
>>> words = [line.split() for line in text.splitlines()]
>>> words
[['some'], ['multiline'], ['text']]
Using a file object it should be:
with open(z[1]) as f:
File = [line.split() for line in f]
By the way, you are using an anti-idiom when looping. If you want to loop over an iterable simply do:
for element in iterable:
#...
If you need also the index of the element use enumerate:
for index, element in enumerate(iterable):
#...
In your case:
for i, line in enumerate(File):
File[i] = line.split()
for word in File[1]:
print word
You want something like this:
for line in File:
fields = line.split()
#fields[0] is "EFF", fields[1] is "3500.", etc.
The split() method returns a list of strings, it does not modify the object that is is called on.

regression coefficient calculation in python

I have a Dataframe and an input text file of activity.Dataframe is produced via pandas.I want to find out the regression coefficient of each term using following formula
Y=C1aX1a+C1bX1b+...+C2aX2a+C2bX2b+....C0 ,
where Y is the activity Cna the regression coefficient for the residue choice a at position n, X the dummy variable coding (xna= 1 or 0) corresponding to the presence or absence of residue choice a at position n, and C0 the mean value of the activity.
My dataframe look likes
2u 2s 4r 4n 4m 7h 7v
0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0
Here 1 and 0 represents the presence and absence of residues respectively.
Using MLR(multiple linear regression) how can i find out the regression coefficient of each residue ie, 2u,2s,4r,4n,4m,7h,7v.
C1a represents the regression coefficient of residue a at 1st position(here 1a is 2u,1b is 2s, 2a is 4r...) X1a represents the dummy value ie 0 or 1 corresponding to 1a.
Activity file contain following data
6.5
5.9
5.7
6.4
5.2
So first equation will look like
6.5=C1a*0+C1b*1+C2a*1+C2b*0+C2c*0+C3a*0+C3b*1+C0
…
Can I get regression coefficient using numpy?.Please help me, All suggestions will be appreciated.
Let A be your dataframe (you can get it as a pure and simple numpy array. Read it in using np.loadtxt if it's CSV), and y be your activity file (again, a numpy array), and use np.linalg.lstsq
DF = """0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0"""
res = """6.5, 5.9, 5.7, 6.4, 5.2"""
A = np.fromstring ( DF, sep=" " ).reshape((5,7))
y = np.fromstring(res, sep=" ")
(x, res, rango, svals ) = np.linalg.lstsq(A, y )
print x
# 2.115625, 2.490625, 1.24375 , 1.19375 , 2.16875 , 2.115625, 2.490625
print np.sum(A.dot(x)**2) # Sum of squared residuals:
# 177.24750000000003
print A.dot(x) # Print predicition
# 6.225, 6.175, 5.425, 6.4 , 5.475

Categories