python reading in files for a specific column range - python

I'm writing a script that reads in one file containing a list of files and performing gaussian fits on each of those files. Each of these files is made up of two columns (wv and flux in the script below). My small issue here is how do I limit the range based "wv" values? I tried using a "for" loop for this but I get errors related to the fit (which I don't get if I don't limit the "wv" range).
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
fits = []
wvi_b = []
wvi_r = []
p = open("file_input.txt","r")
for line in p:
fits.append(str(line.split()[0]))
wvi_b.append(float(line.split()[1]))
wvi_r.append(float(line.split()[2]))
p.close()
for j in range(len(fits)):
wv = []
flux = []
f = open("%s"%(fits[j]),"r")
for line in f:
wv.append(float(line.split()[0]))
flux.append(float(line.split()[1]))
f.close()
def gauss(x,a,b,c,a1,b1,c1,d):
func = a*np.exp(-((x-b)**2)/(2.0*(c)**2)) + a1*np.exp(-((x-b1)**2)/(2.0*(c1)**2))+d
return func
for wv in range(6450, 6575):
guess=(0.8,wvi_b[j],3.0,1.0,wvi_r[j],3.0,1.0)
popt,pconv=curve_fit(gauss,wv,flux,guess)
print popt[1], popt[4]
ymod=gauss(wv,*popt)
plt.plot(wv,ymod)
plt.plot(wv,flux,marker='.')
plt.show()

When you call for wv in range(6450, 6575), wv is just an integer in that range, not a member of the list. I'd try taking a look at how you're using that variable. If you want to access data from the list wv, you would have to update the syntax to be wv[wv] (which is a little confusing - it might be best to change the variable in your for loop to something else).

Related

Load many files into one array - Python

So, I have to load many .mat files with some features to plot it.
Each array to be plotted is loaded into a dictionary:
import numpy as np
import scipy.io as io
dict1 = io.loadmat('file1.MAT')
dict2 = io.loadmat('file2.MAT') # type = dict
dict3 = io.loadmat('file3.MAT')
...
so I have to take the dictionarie's element I need, to plot after:
array1 = dict1['data']
array2 = dict2['data']
array3 = dict3['data']
...
After this, I can plot the data. It works, but looks dumb to me (If I have 100 vectors, it will take some time...). Is there a better way to make this task?
Given that you are talking about dealing with many matrices, you should manage them as a collection. First, let's define your set of files. It could be a tuple, or a list:
Matrix_files = [ 'fileA.MAT', 'file1.MAT', 'no pattern to these names.MAT' ]
If they happen to have a pattern, you might try generating the names:
Matrix_files = [ 'file{}.MAT'.format(num) for num in range(1,4) ]
If they share a common location, you might consider using one of the various directory scanning approaches (opendir or glob, to name two).
Once you have a list of filenames, you can read the dictionaries in:
def read_matrix(filespec):
from scipy.io import loadmat
md = loadmat(filespec)
# process md
return md
With that, you can either get all the data, or get some of the data:
All_data = [read_matrix(f) for f in Matrix_files]
Some_data = [read_matrix(f)['data'] for f in Matrix_files]
If you only care about the data, you could skip the function definition:
from scipy.io import loadmat
Just_data = [loadmat(f)['data'] for f in Matrix_files]

How to write .csv file in Python?

I am running the following: output.to_csv("hi.csv") where output is a pandas dataframe.
My variables all have values but when I run this in iPython, no file is created. What should I do?
Better give the complete path for your output csv file. May be that you are checking in a wrong folder.
You have to make sure that your 'to_csv' method of 'output' object has a write-file function implemented.
And there is a lib for csv manipulation in python, so you dont need to handle all the work:
https://docs.python.org/2/library/csv.html
I'm not sure if this will be useful to you, but I write to CSV files frequenly in python. Here is an example generating random vectors (X, V, Z) values and writing them to a CSV, using the CSV module. (The paths are os paths are for OSX but you should get the idea even on a different os.
Working Writing Python to CSV example
import os, csv, random
# Generates random vectors and writes them to a CSV file
WriteFile = True # Write CSV file if true - useful for testing
CSVFileName = "DataOutput.csv"
CSVfile = open(os.path.join('/Users/Si/Desktop/', CSVFileName), 'w')
def genlist():
# Generates a list of random vectors
global v, ListLength
ListLength = 25 #Amount of vectors to be produced
Max = 100 #Maximum range value
x = [] #Empty x vector list
y = [] #Empty y vector list
z = [] #Empty x vector list
v = [] #Empty xyz vector list
for i in xrange (ListLength):
rnd = random.randrange(0,(Max)) #Generate random number
x.append(rnd) #Add it to x list
for i in xrange (ListLength):
rnd = random.randrange(0,(Max))
y.append(rnd) #Add it to y list
for i in xrange (ListLength):
rnd = random.randrange(0,(Max)) #Generate random number
z.append(rnd) #Add it to z list
for i in xrange (ListLength):
merge = x[i], y[i],z[i] # Merge x[i], y[i], x[i]
v.append(merge) #Add merged list into v list
def writeCSV():
# Write Vectors to CSV file
wr = csv.writer(CSVfile, quoting = csv.QUOTE_MINIMAL, dialect='excel')
wr.writerow(('Point Number', 'X Vector', 'Y Vector', 'Z Vector'))
for i in xrange (ListLength):
wr.writerow((i+1, v[i][0], v[i][1], v[i][2]))
print "Data written to", CSVfile
genlist()
if WriteFile is True:
writeCSV()
Hopefully there is something useful in here for you!

How do I get a text output from a string created from an array to remain unshortened?

Python/Numpy Problem. Final year Physics undergrad... I have a small piece of code that creates an array (essentially an n×n matrix) from a formula. I reshape the array to a single column of values, create a string from that, format it to remove extraneous brackets etc, then output the result to a text file saved in the user's Documents directory, which is then used by another piece of software. The trouble is above a certain value for "n" the output gives me only the first and last three values, with "...," in between. I think that Python is automatically abridging the final result to save time and resources, but I need all those values in the final text file, regardless of how long it takes to process, and I can't for the life of me find how to stop it doing it. Relevant code copied beneath...
import numpy as np; import os.path ; import os
'''
Create a single column matrix in text format from Gaussian Eqn.
'''
save_path = os.path.join(os.path.expandvars("%userprofile%"),"Documents")
name_of_file = 'outputfile' #<---- change this as required.
completeName = os.path.join(save_path, name_of_file+".txt")
matsize = 32
def gaussf(x,y): #defining gaussian but can be any f(x,y)
pisig = 1/(np.sqrt(2*np.pi) * matsize) #first term
sumxy = (-(x**2 + y**2)) #sum of squares term
expden = (2 * (matsize/1.0)**2) # 2 sigma squared
expn = pisig * np.exp(sumxy/expden) # and put it all together
return expn
matrix = [[ gaussf(x,y) ]\
for x in range(-matsize/2, matsize/2)\
for y in range(-matsize/2, matsize/2)]
zmatrix = np.reshape(matrix, (matsize*matsize, 1))column
string2 = (str(zmatrix).replace('[','').replace(']','').replace(' ', ''))
zbfile = open(completeName, "w")
zbfile.write(string2)
zbfile.close()
print completeName
num_lines = sum(1 for line in open(completeName))
print num_lines
Any help would be greatly appreciated!
Generally you should iterate over the array/list if you just want to write the contents.
zmatrix = np.reshape(matrix, (matsize*matsize, 1))
with open(completeName, "w") as zbfile: # with closes your files automatically
for row in zmatrix:
zbfile.writelines(map(str, row))
zbfile.write("\n")
Output:
0.00970926751178
0.00985735189176
0.00999792646484
0.0101306077521
0.0102550302672
0.0103708481917
0.010477736974
0.010575394844
0.0106635442315
.........................
But using numpy we simply need to use tofile:
zmatrix = np.reshape(matrix, (matsize*matsize, 1))
# pass sep or you will get binary output
zmatrix.tofile(completeName,sep="\n")
Output is in the same format as above.
Calling str on the matrix will give you similarly formatted output to what you get when you try to print so that is what you are writing to the file the formatted truncated output.
Considering you are using python2, using xrange would be more efficient that using rane which creates a list, also having multiple imports separated by colons is not recommended, you can simply:
import numpy as np, os.path, os
Also variables and function names should use underscores z_matrix,zb_file,complete_name etc..
You shouldn't need to fiddle with the string representations of numpy arrays. One way is to use tofile:
zmatrix.tofile('output.txt', sep='\n')

Python: Using multiprocessing module as possible solution to increase the speed of my function

I wrote a function in Python 2.7 (on Window OS 64bit) in order to calculate the mean value of of the intersection area from a reference polygon (Ref) and one or more segmented (Seg) polygon(s) in ESRI shapefile format. The code is quite slow because i have more that 2000 reference polygon (s) and for each Ref_polygon the function run for every time for all Seg polygons(s) (more than 7000). I am sorry but the function is a prototype.
I wish to know if multiprocessing can help me to increase the speed of my loop or there are more performance solutions. if multiprocessing can be a possible solution i wish to know the best way to optimize my following function
import numpy as np
import ogr
import osr,gdal
from shapely.geometry import Polygon
from shapely.geometry import Point
import osgeo.gdal
import osgeo.gdal as gdal
def AreaInter(reference,segmented,outFile):
# open shapefile
ref = osgeo.ogr.Open(reference)
if ref is None:
raise SystemExit('Unable to open %s' % reference)
seg = osgeo.ogr.Open(segmented)
if seg is None:
raise SystemExit('Unable to open %s' % segmented)
ref_layer = ref.GetLayer()
seg_layer = seg.GetLayer()
# create outfile
if not os.path.split(outFile)[0]:
file_path, file_name_ext = os.path.split(os.path.abspath(reference))
outFile_filename = os.path.splitext(os.path.basename(outFile))[0]
file_out = open(os.path.abspath("{0}\\{1}.txt".format(file_path, outFile_filename)), "w")
else:
file_path_name, file_ext = os.path.splitext(outFile)
file_out = open(os.path.abspath("{0}.txt".format(file_path_name)), "w")
# For each reference objects-i
for index in xrange(ref_layer.GetFeatureCount()):
ref_feature = ref_layer.GetFeature(index)
# get FID (=Feature ID)
FID = str(ref_feature.GetFID())
ref_geometry = ref_feature.GetGeometryRef()
pts = ref_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
# convert in a shapely polygon
ref_polygon = Polygon(points)
# get the area
ref_Area = ref_polygon.area
# create an empty list
Area_seg, Area_intersect = ([] for _ in range(2))
# For each segmented objects-j
for segment in xrange(seg_layer.GetFeatureCount()):
seg_feature = seg_layer.GetFeature(segment)
seg_geometry = seg_feature.GetGeometryRef()
pts = seg_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
seg_polygon = Polygon(points)
seg_Area.append = seg_polygon.area
# intersection (overlap) of reference object with the segmented object
intersect_polygon = ref_polygon.intersection(seg_polygon)
# area of intersection (= 0, No intersection)
intersect_Area.append = intersect_polygon.area
# Avarage for all segmented objects (because 1 or more segmented polygons can intersect with reference polygon)
seg_Area_average = numpy.average(seg_Area)
intersect_Area_average = numpy.average(intersect_Area)
file_out.write(" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
file_out.close()
You can use the multiprocessing package, and especially the Pool class. First create a function that does all the stuff you want to do within the for loop, and that takes as an argument only the index:
def process_reference_object(index):
ref_feature = ref_layer.GetFeature(index)
# all your code goes here
return (" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
Note that this doesn't write to a file itself- that would be messy because you'd have multiple processes writing to the same file at the same time. Instead, it returns the string that needs to be written. Also note that there are objects in this function like ref_layer or ref_geometry that will need to reach it somehow- that's up to you how to do it (you could put process_reference_object as the method in a class initialized with them, or it could be as ugly as just defining them globally).
Then, you create a pool of process resources, and run all of your indices using Pool.imap_unordered (which will itself allocate each index to a different process as necessary):
from multiprocessing import Pool
p = Pool() # run multiple processes
for l in p.imap_unordered(process_reference_object, range(ref_layer.GetFeatureCount())):
file_out.write(l)
This will parallelize the independent processing of your reference objects across multiple processes, and write them to the file (in an arbitrary order, note).
Threading can help to a degree, but first you should make sure you can't simplify the algorithm. If you're checking each of 2000 reference polygons against 7000 segmented polygons (perhaps I misunderstood), then you should start there. Stuff that runs at O(n2) is going to be slow, so maybe you can prune away things that will definitely not intersect or find some other way to speed things up. Otherwise, running multiple processes or threads will only improve things linearly when your data grows geometrically.

Python: suggestions to improve a chunk-by-chunk code to read several millions of points

I wrote a code to read *.las file in Python. *las file are special ascii file where each line is x,y,z value of points.
My function read N. number of points and check if they are inside a polygon with points_inside_poly.
I have the following questions:
When I arrive at the end of the file I get this message: LASException: LASError in "LASReader_GetPointAt": point subscript out of range because the number of points is under the chunk dimension. I cannot figure how to resolve this problem.
a = [file_out.write(c[m]) for m in xrange(len(c))] I use a = in order to avoid video print. Is it correct?
In c = [chunk[l] for l in index] I create a new list c because I am not sure that replacing a new chunk is the smart solution (ex: chunk = [chunk[l] for l in index]).
In a statement if else...else I use pass. Is this the right choice?
Really thank for help. It's important to improve listen suggestions from expertise!!!!
import shapefile
import numpy
import numpy as np
from numpy import nonzero
from liblas import file as lasfile
from shapely.geometry import Polygon
from matplotlib.nxutils import points_inside_poly
# open shapefile (polygon)
sf = shapefile.Reader(poly)
shapes = sf.shapes()
# extract vertices
verts = np.array(shapes[0].points,float)
# open las file
f = lasfile.File(inFile,None,'r') # open LAS
# read "header"
h = f.header
# create a file where store the points
file_out = lasfile.File(outFile,mode='w',header= h)
chunkSize = 100000
for i in xrange(0,len(f), chunkSize):
chunk = f[i:i+chunkSize]
x,y = [],[]
# extraxt x and y value for each points
for p in xrange(len(chunk)):
x.append(chunk[p].x)
y.append(chunk[p].y)
# zip all points
points = np.array(zip(x,y))
# create an index where are present the points inside the polygon
index = nonzero(points_inside_poly(points, verts))[0]
# if index is not empty do this otherwise "pass"
if len(index) != 0:
c = [chunk[l] for l in index] #Is It correct to create a new list or i can replace chunck?
# save points
a = [file_out.write(c[m]) for m in xrange(len(c))] #use a = in order to avoid video print. Is it correct?
else:
pass #Is It correct to use pass?
f.close()
file_out.close()
code proposed by #Roland Smith and changed by Gianni
f = lasfile.File(inFile,None,'r') # open LAS
h = f.header
# change the software id to libLAS
h.software_id = "Gianni"
file_out = lasfile.File(outFile,mode='w',header= h)
f.close()
sf = shapefile.Reader(poly) #open shpfile
shapes = sf.shapes()
for i in xrange(len(shapes)):
verts = np.array(shapes[i].points,float)
inside_points = [p for p in lasfile.File(inFile,None,'r') if pnpoly(p.x, p.y, verts)]
for p in inside_points:
file_out.write(p)
f.close()
file_out.close()
i used these solution:
1) reading f = lasfile.File(inFile,None,'r') and after the read head because i need in the *.las output file
2) close the file
3) i used inside_points = [p for p in lasfile.File(inFile,None,'r') if pnpoly(p.x, p.y, verts)] instead of
with lasfile.File(inFile, None, 'r') as f:
... inside_points = [p for p in f if pnpoly(p.x, p.y, verts)]
...
because i always get this error message
Traceback (most recent call last):
File "", line 1, in
AttributeError: _exit_
Regarding (1):
First, why are you using chunks? Just use the lasfile as an iterator (as shown in the tutorial), and process the points one at a time. The following should get write all the points inside the polygon to the output file, by using the pnpoly function in a list comprehension instead of points_inside_poly.
from liblas import file as lasfile
import numpy as np
from matplotlib.nxutils import pnpoly
with lasfile.File(inFile, None, 'r') as f:
inside_points = (p for p in f if pnpoly(p.x, p.y, verts))
with lasfile.File(outFile,mode='w',header= h) as file_out:
for p in inside_points:
file_out.write(p)
The five lines directly above should replace the whole big for-loop. Let's go over them one-by-one:
with lasfile.File(inFile...: Using this construction means that the file will be closed automatically when the with block finishes.
Now comes the good part, the generator expression that does all the work (the part between ()). It iterates over the input file (for p in f). Every point that is inside the polygon (if pnpoly(p.x, p.y, verts)) is added to the generator.
We use another with block for the output file
and all the points (for p in inside_points, this is were the generator is used)
are written to the output file (file_out.write(p))
Because this method only adds the points that are inside the polygon to the list, you don't waste memory on points that you don't need!
You should only use chunks if the method shown above doesn't work.
When using chunks you should handle the exception properly. E.g:
from liblas import LASException
chunkSize = 100000
for i in xrange(0,len(f), chunkSize):
try:
chunk = f[i:i+chunkSize]
except LASException:
rem = len(f)-i
chunk = f[i:i+rem]
Regarding (2): Sorry, but I fail to understand what you are trying to accomplish here. What do you mean by "video print"?
Regarding (3): since you are not using the original chunk anymore, you can re-use the name. Realize that in python a "variable" is just a nametag.
Regarding (4): you aren't using the else, so leave it out completely.

Categories