List index out of range - python

So I am calling some data through SQL queries and I am running into an error of list index range when attempting to loop through it, normalizing and plotting it.
Here's my SQLs:
s1 = db.dict.execute("SELECT sp.wavelength, sp.flux, spt.spectral_type, s.designation, s.shortname, s.names FROM spectra AS sp JOIN sources AS s ON sp.source_id=s.id JOIN spectral_types as spt ON spt.source_id=s.id WHERE sp.instrument_id=6 AND sp.mode_id=2 AND 10<spt.spectral_type<19").fetchall()
s2 = db.dict.execute("SELECT sp.wavelength, sp.flux, spt.spectral_type, s.designation, s.shortname, s.names FROM spectra AS sp JOIN sources AS s ON sp.source_id=s.id JOIN spectral_types as spt ON spt.source_id=s.id WHERE sp.instrument_id=9 AND sp.wavelength_order='n3' AND 10<spt.spectral_type<19").fetchall()
s3 = db.dict.execute("SELECT sp.wavelength, sp.flux, spt.spectral_type, s.designation, s.shortname, s.names FROM spectra AS sp JOIN sources AS s ON sp.source_id=s.id JOIN spectral_types as spt ON spt.source_id=s.id WHERE sp.instrument_id=16 AND 10<spt.spectral_type<19").fetchall()
Then I am combining them into S:
S = s1+s2+s3
Finally, I want to loop through them to normalize, trim and plot all the entries in the dictionary I called.
for n,i in enumerate(S):
W,F = S[n]['wavelength'], S[n]['flux'] # here I define wavelength and flux from SQL query "S"
band = [1.15,1.325] #The range at which I want to normalize, see next line
S1 = at.norm_spec([W,F], band) # here I normalize W and F and define S1 as the normalized W,F
W3 = S1[n][0] #here I define a new W and F but from the normalized S1 spectrum file
F3 = S1[n][1]
W2,F2 =[p[np.where(np.logical_and(W3>1.15, W3<1.325))] for p in [W3,F3]] #here I trim the noisy ends of data and narrow to a small range
z = np.polyfit(W2, F2, 3) #from here on it's just fitting polynomials and plotting
f = np.poly1d(z)
yvalue = 6*(max(F2)/9)
xvalue = 6*(max(W2)/9)
W_new = np.linspace(W2[0], W2[-1], 5000)
F_new = f(W_new)
plt.ylabel('Flux F$_{\lambda}$')
plt.xlabel('Wavelength ($\mu$m)')
name= S[n]['designation']
name2= S[n]['shortname']
name3= S[n]['names']
plt.annotate('{0} \n Spectral type: {1}'.format(S[n][('designation' or 'shortname' or 'names')], S[n]['spectral_type']), xy=(xvalue, yvalue), xytext=(xvalue, yvalue), color='black')
#plt.figure()
plt.plot(W2,F2, 'k-', W_new, F_new, 'g-')
Now, it goes through the first iteration, meaning it plots S1[0][0] and S1[0][1], but it breaks and says S1[1][0] and S1[1][1] is out of range:
61 print len(S1)
62 #print S1[n]['wavelength']
--->63 W3 = S1[n][0] #here I define a new W and F but from the normalized S1 spectrum file
64 F3 = S1[n][1]
65 W2,F2 =[p[np.where(np.logical_and(W3>1.15, W3<1.325))] for p in [W3,F3]] #here I trim the noisy ends and narrow to potassium lines
IndexError: list index out of range
I really don't see where my error is in this, any help will be appreciated!
Sara

If I understood your code well, firstly you're iterating over 'n' objects of S, but then S1 is a whole new object which quantity isn't "n", but some other (length of this vector, or something). Maybe just W3 = S1[0] would be correct? Though, I don't know what at.norm_spec and what type of object is returnig, so I'm only guessing.
Suggestion: Write (more) meaningful variables in your code. It's really, really hard to read something like that and it's very, very easy to made a mistake writing that kind of code. And it's not pythonic.

Related

Saving continuously generated simulation data with Python3

So my question is how I should save a large amount of simulation data to a file using Python (or update new data rows to the existing file).
Lets say I have NN=1000 particles, and I want to save the position and velocity data of each particle (x y z, vx vy vz). The data is in format [x1,y1,z1,vx1,vy1,vz1, x2,y2,z2,vx2,vy2,vz2, ...] and so on.
Simulation is working well, but I believe the methods I use for saving and keeping these information saved is not really optimal for me.
Pseudo code similar to my code
T_max = 1000 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
ZZ = np.zeros( (iterations, 2+NN*6 ) ) # Here I generate whole data matrix at the beginning.
# ^ might not be the best idea as the system needs to keep everything in memory for the whole time
# So I guess saving could be done in chunks?
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
i = 0
while i < iteration:
T += dt
Z[i+1][0], Z[i+1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
Z[i+1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
i += 1
# Now the simulation data is basically done, so one would need to save
# This one feels slow, as it takes 181s to save and is size of 1046246KB
np.savetxt('test1.txt', ZZ)
#other method with a bit less accuracy as I don't need to have all decimals saved
np.savetxt('test2.txt', ZZ, fmt='%1.6f') # Takes 125s and size is 426698KB
# Both of the above are kinda slow so I also tried to save to npy format
np.save('test.npy', ZZ) # It took 8.9s and size 164118KB
so this np.save() method seems to be fast, but I read somewhere that I can not append data to it. So this would not work if I keep saving the data in parts while calculating new positions.
So back to my question. How should/could I save the data efficiently (fast and memory friendly). I keep having some memory issues when NN and T_max gets larger because with this method I keep this whole ZZ all the time in memory.
So I guess I should calculate ZZ in parts, i.e. iterations/10 parts but then I should append this data to an existing file, and tests I have made felt slow. Any suggestions?
EDIT: feel free to ask more specifying questions as I feel like I forgot to explain something.
That highly depends on what you intend to use the output for. If it's stored for further calculations, .npy or some other binary format is always the way to go as it is faster, takes less space, and doesn't lose precision between loads and saves, instead of serializing it into a human readable format. If you need it to be readable, you might as well just output row by row to a csv file or something.
If you want to do it with binary, h5py allows you to extend a dataset after saving and append more stuff to it.
import numpy as np
import h5py
T_max = 10**4 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
chunk_size = 10**3
ZZ = np.zeros( (chunk_size, 2+NN*6 ) )
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
with h5py.File("test.h5", "a") as f:
dset = f.create_dataset('ZZ', (0,2+NN*6), maxshape=(None,2+NN*6), dtype='float64', chunks=(chunk_size,2+NN+6))
for chunk in range(0, iterations, chunk_size):
for i in range(0, chunk_size - 1):
T += dt
ZZ[i + 1][0], ZZ[i + 1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[i + 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
# Expand the file here to allow for more data.
dset.resize(dset.shape[0] + chunk_size, axis=0)
dset[chunk: chunk + chunk_size ] = ZZ
# update and initialize next chunk. the next chunk's first row should be the last row of the previous chunk + iteration
T += dt
ZZ[0][0], ZZ[0][1] = T, dt
#Z[0][2:] = rk4(EOM_function, posvel=Z[-1][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[0][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
print(dset.shape)
This takes 70 seconds on the save step on my computer, generating a 45GB file, for a dataset that is 100 times your original code.
The above code is more general in case you are streaming your data and don't know your final size. If you know it from the start, you can replace the initial create_dataset with
dset = f.create_dataset('ZZ', (iterations,2+NN*6), dtype='float64')
and remove the dset.resize(dset.shape[0] + chunk_size, axis=0)
You'll probably also want to read it back in chunks afterwards for other processing, in which case you can follow the docs here: https://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data
Okay so I'm continuing my question / providing possible answer to it based on the answer of EricChen1248. EDIT: Answer provided by EricChen1248 works now and is way better than this my code part. See his code
I do not yet still understand completely how this f.create_dataset () truly works (i.e. when does it write data to file in the loop etc).
Using the code provided by Eric, it created and saved the data files fastly, but when I read the file as follows
hf = h5py.File('temp/test.h5', 'r')
ZZ = np.array(hf['ZZ'])
hf.close()
and plotted the first column (time T column, which should increase by timestep dt after each iteration) I get the following figure
plt.plot(ZZ[:,0])
time T column plotted
and as can be seen, it grows to a time of 100, and then goes to zero. This happens after the first 'chunk_size' has been passed. I started to read docs provided by Eric, and using his code as reference I managed to write something like this
import numpy as np
import h5py
T_max = 10**4
dt = 0.1
T = 0
NN = 1000
iterations = int(T_max/dt)
chunk_size = 10**3
with h5py.File('temp/data12.h5', 'a') as hf:
dset = hf.create_dataset("ZZ", (chunk_size, 2+NN*6),maxshape=(None,2+NN*6) ,chunks=(chunk_size, 2+NN*6), dtype='f8' )
# ^ first I create data set equals to one chunk_size
# Here I initialize the system. Columns ; 0=T , 1=dt, 2=arbitrary data point, 3=sin(column2)
# all the rest columns are random numbers just to fill some numbers in
dset[0,0], dset[0,1] = T, dt
#dset[0,2:] = np.random.uniform(0,1,NN*6)
dset[0,2] = 1
dset[0,3] = np.sin(dset[0,2])
dset[0,4:] = np.random.uniform(0,1,NN*6 -2)
print('starts')
# Main difference down there is that I use dataset (dset)
# as a data matrix to be filled instead of matrix ZZ as in my question.
i = 0
#for j, s in enumerate(dset.iter_chunks()):
for j, s in enumerate(range(0, iterations, chunk_size )):
print(j, s)
while i < iterations and i < chunk_size*(j+1) -1:
#for i in range(chunk_size*j, chunk_size*(j+1)-1):
T += dt
dset[i+1,0], dset[i+1,1] = T, dt
#dset[i+1,2:] = np.sin(dset[i,2:]+dt)
dset[i+1,2] = dset[i,2] + dt
dset[i+1,3] = np.sin(dset[i,2]+dt)
dset[i+1,4:] = dset[i,4:] + np.random.uniform(-1,1,NN*6-2)
i+=1
print(dset.shape)
dset.resize(dset.shape[0] + chunk_size, axis=0)
This code runs in 1min 50s , and saves a file of size 4.47GB so I am happy with the speed, and what I'm really happy is that it do not use so much memory while iterating (I used to get into problem with huge RAM usage).
When I read the data file provided by my code (similarly as above) I get following image for time Time T column plotted, my code version and it grows nicely to T=10e4 as should be. It still generated one more chunk_size block to the end of dataset which is full of zeros. That I need to get rid of. One more proof that the code works and saves data without weird problems is this sinusoidal plot plt.plot(ZZ[500:1500,0] , ZZ[500:1500,3]). Sinusoidal image proof Note that the plot is limited for T ~ [50,150] so one could still see something there (if plotted the whole thing, one could not see lines well).
I believe this is not the best way to write this code, but it is the way I got this working. So if someone sees improvements, please let me know. Also, I am curious to know why the code provided by Eric did not work, at least for me.
EDIT : fixed typos

Using awkward-array with zip/unzip with two different physics objects

I'm trying to reproduce parts of the Higgs discovery in the Higgs --> 4 leptons channel with open data and making use of awkward. I can do it when the leptons are the same (e.g. 4 muons) with zip/unzip, but is there a way to do it in the 2 muon/2 electron channel? I started with the example in the HSF tutorial
https://hsf-training.github.io/hsf-training-scikit-hep-webpage/04-awkward/index.html
So I now have the following. First get the input file
curl http://opendata.cern.ch/record/12361/files/SMHiggsToZZTo4L.root --output SMHiggsToZZTo4L.root
Then I do the following
import numpy as np
import matplotlib.pylab as plt
import uproot
import awkward as ak
# Helper functions
def energy(m, px, py, pz):
E = np.sqrt( (m**2) + (px**2 + py**2 + pz**2))
return E
def invmass(E, px, py, pz):
m2 = (E**2) - (px**2 + py**2 + pz**2)
if m2 < 0:
m = -np.sqrt(-m2)
else:
m = np.sqrt(m2)
return m
def convert(pt, eta, phi):
px = pt * np.cos(phi)
py = pt * np.sin(phi)
pz = pt * np.sinh(eta)
return px, py, pz
####
# Read in the file
infile_name = 'SMHiggsToZZTo4L.root'
infile = uproot.open(infile_name)
# Convert to Cartesian
muon_pt = infile_signal['Events/Muon_pt'].array()
muon_eta = infile_signal['Events/Muon_eta'].array()
muon_phi = infile_signal['Events/Muon_phi'].array()
muon_q = infile_signal['Events/Muon_charge'].array()
muon_mass = infile_signal['Events/Muon_mass'].array()
muon_px,muon_py,muon_pz = convert(muon_pt, muon_eta, muon_phi)
muon_energy = energy(muon_mass, muon_px, muon_py, muon_pz)
# Do the magic
nevents = len(infile['Events/Muon_pt'].array())
# nevents = 1000 # For testing
max_entries = nevents
muons = ak.zip({
"px": muon_px[0:max_entries],
"py": muon_py[0:max_entries],
"pz": muon_pz[0:max_entries],
"e": muon_energy[0:max_entries],
"q": muon_q[0:max_entries],
})
quads = ak.combinations(muons, 4)
mu1, mu2, mu3, mu4 = ak.unzip(quads)
mass_try = (mu1.e + mu2.e + mu3.e + mu4.e)**2 - ((mu1.px + mu2.px + mu3.px + mu4.px)**2 + (mu1.py + mu2.py + mu3.py + mu4.py)**2 + (mu1.pz + mu2.pz + mu3.pz + mu4.pz)**2)
mass_try = np.sqrt(mass_try)
qtot = mu1.q + mu2.q + mu3.q + mu4.q
plt.hist(ak.flatten(mass_try[qtot==0]), bins=100,range=(0,300));
And the histogram looks good!
So how would I do this for 2-electron + 2-muon combinations? I would guess there's a way to make lepton_xxx arrays? But I'm not sure how to do this elegantly (quickly) such that I could also create a flag to keep track of what the lepton combinations are?
Thanks!
Matt
This could be answered in a variety of ways:
make a union array (mixed data types) of electrons and muons
make an array of electrons and muons that are the same type, but have a flag to indicate flavor (electron vs muon)
use ak.combinations with n=2 for the muons, ak.combinations with n=2 again for the electrons, and then combine them with ak.cartesian (and deal with tuples of tuples, rather than one level of tuples, which would mean two calls to ak.unzip)
break the electron and muon collections down into single-charge collections. You'll want exactly 1 positive muon, 1 negative muon, 1 positive electron, and 1 negative electron, so that would be an ak.cartesian of the four collections.
I'll go with the last method because I've decided that it's easiest.
Another thing you probably want to know about is the Vector library. After
import vector
vector.register_awkward()
you don't have to do explicit coordinate transformations or mass calculations. I'll be using that. Here's how I read in the data:
infile = uproot.open("/tmp/SMHiggsToZZTo4L.root")
muon_branch_arrays = infile["Events"].arrays(filter_name="Muon_*")
electron_branch_arrays = infile["Events"].arrays(filter_name="Electron_*")
muons = ak.zip({
"pt": muon_branch_arrays["Muon_pt"],
"phi": muon_branch_arrays["Muon_phi"],
"eta": muon_branch_arrays["Muon_eta"],
"mass": muon_branch_arrays["Muon_mass"],
"charge": muon_branch_arrays["Muon_charge"],
}, with_name="Momentum4D")
electrons = ak.zip({
"pt": electron_branch_arrays["Electron_pt"],
"phi": electron_branch_arrays["Electron_phi"],
"eta": electron_branch_arrays["Electron_eta"],
"mass": electron_branch_arrays["Electron_mass"],
"charge": electron_branch_arrays["Electron_charge"],
}, with_name="Momentum4D")
And this reproduces your plot:
quads = ak.combinations(muons, 4)
quad_charge = quads["0"].charge + quads["1"].charge + quads["2"].charge + quads["3"].charge
mu1, mu2, mu3, mu4 = ak.unzip(quads[quad_charge == 0])
plt.hist(ak.flatten((mu1 + mu2 + mu3 + mu4).mass), bins=100, range=(0, 200));
The quoted number slices (e.g. "0" and "1") are picking tuple fields, rather than array entries; it's a manual ak.unzip. (The fields could have had real names if I had given a fields argument to ak.combinations.)
For the two muons, two electrons case, let's make four distinct collections.
muons_plus = muons[muons.charge > 0]
muons_minus = muons[muons.charge < 0]
electrons_plus = electrons[electrons.charge > 0]
electrons_minus = electrons[electrons.charge < 0]
The ak.combinations function (with default axis=1) returns the Cartesian product of each list in the array of lists with itself, excluding an item with itself, (x, x) (unless you want that, specify replacement=True), and excluding one of each symmetric pair (x, y)/(y, x).
If you want just a plain Cartesian product of lists from one array with lists from another array, that's ak.cartesian. The four collections muons_plus, muons_minus, electrons_plus, electrons_minus are non-overlapping and we want each four-lepton group to have exactly one from each, so that's a plain Cartesian product.
mu1, mu2, e1, e2 = ak.unzip(ak.cartesian([muons_plus, muons_minus, electrons_plus, electrons_minus]))
plt.hist(ak.flatten((mu1 + mu2 + e1 + e2).mass), bins=100, range=(0, 200));
Separating particles by flavor (electron vs muon) but not by charge is an artifact of the typing constraints imposed by C++. Electrons are measured in different ways from muons, so they have different attributes in the dataset. In a statically typed language like C++, we couldn't put them in the same collection because they differ by type (have different attributes). But charges only differ by value (an integer), so there was no reason they had to be in different collections.
But now, the only thing distinguishing a type is (a) what attributes the objects have and (b) what objects with that set of attributes are named. Here, I named them "Momentum4D" because that allowed Vector to recognize them as Lorentz vectors and give them Lorentz vector methods. But they could have had other attributes (e.g. e_over_p for electrons, but not for muons). charge is a non-Momentum4D attribute.
So if we're going to the trouble to put different-flavor leptons in different arrays, why not put different-charge leptons in different arrays, too? Physically, flavor and charge are both particle properties at the same level of distinction, so we probably want to do the analysis that way. No reason to let a C++ type distinction get in the way of the analysis logic!
Also, you probably want
nevents = infile["Events"].num_entries
to get the number of entries from a TTree without reading the array data from it.

Iterations of equations using sequences

To start off with I have two files, lets call them fileA and fileB.
In fileB I have two sequences, lets call them initial and final. Each sequences has exactly 32 values, most of which are simple equations that slightly differ from each other, hence the 32 unique values. For simplicity's sake let's scope them down to 5 each. so for initial it would look something like.
~fileB
T1 = 60
inital = [0.112, 0.233, 0.322*T1, 0.55*T1, 0.665*T1]
Variable T1 does not change at any point at all, So initial is constant permanently. The 2nd variable is called "final"
For final I have:
T2 = 120
k_0 = T2**2 - T1**2
final = [x * k_0 for x in initial]
This gives me the values I want for final and it gives me a sequence of the same length. In fileA I want to evalute an iterator at multiple T2 values and get an "answer" for each respective T2 value. However, since I am new, i'm limiting my self so that i'm only doing this for the very first final value.
So now on to fileA:
~fileA
import fileB
import math
answer = []
T2 = np.array(120,400,10)
x = symbols('x')
int1 = Integral(x**2 -1,x)
eq1 = int.doit()
for i in T2:
k = k_0*final[0]
answer.append(solve(eq1 - k, x))
This is where things get tricky, as i want it to evaluate this ONLY for the first "final" value
final[0]
but i want it to re-evaluate the two variables
k_0 = T2**2 - T1**2
and
answer = []
At each and every T2 value, how can I do this so that I can make an array/table that looks like the following
T2 (header) Answer(header)
value_1 Value_1
value_2 Value_2
value_3 Value_3
value_4 Value_4
.... ....
If you need me to explain it better of have questions feel free to ask.
If necessary i'm using python 3.6/3.7 in the anaconda distribution.
Okay my question was a bit confusing but I figured out how to solve it.
The first step was to write a list comprehension for T2 instead of a np array, like so:
T2 = [20 + x*1.01 +273.15 for x in range(40)]
Then I assigned the integral i wanted solved to a variable, let's call it int1.
int1 = Integral((1-x)**-2,x)
Then I have to create two new and separate empty sequences.
answer = []
X = []
For the first sequence (answer) I did the following:
for i,temp in enumerate(t2):
k_0 = math.exp((ee/r)*((1/t1)-(1/t2[i])))
k = initial_k[0]*k_0
v_0 = 2
Fa_0 = 5
Ca_0 = Fa_0/v_0
Q = (k*Ca_0*V)/v_0
eq1 = int1.doit()
answer.append(solve(eq1 - Q,x))
While this will give the numerical answers I was looking for, each and every individual answer is returned as a sequence, which is inside of a sequence.
answer = [[value1],[value2],[value3],[value4]]
This causes problems when trying to use the numerical answers in other equations or operations.
To fix this i used the same technique as above and returned all of the numerical float values into a single sequence
for i,val in enumerate(answer):
X.append(float(answer[i][0]))
This finally allowed me to use the numerical float values original stored in answer[] by simply transferring them to a new sequence where they are not nested.
Ca = [(1-x)*Ca_0 for x in X]
Fa = [(1-x)*Fa_0 for x in X]
Finally, I was able to get the table I originally wanted!
Temperature Conversion Ca Fa
293.15 0.44358038322375287 1.3910490419406178 2.7820980838812357
303.15 0.44398389128120275 1.390040271796993 2.780080543593986
313.15 0.44436136324002395 1.38909659189994 2.77819318379988
323.15 0.4447152402068642 1.3882118994828396 2.7764237989656793
I'm not completely sure why I had to do this but it worked, however, any help or insight into this would be appreciated.

Vectorizing a monte carlo simulation in python

I've recently been working on some code in python to simulate a 2 dimensional U(1) gauge theory using monte carlo methods. Essentially I have an n by n by 2 array (call it Link) of unitary complex numbers (their magnitude is one). I randomly select element of my Link array and propose a random change to the number at that site. I then compute the resulting change in the action that would occur due to that change. I then accept the change with a probability equal to min(1,exp(-dS)), where dS is the change in the action. The code for the iterator is as follows
def iteration(j1,B0):
global Link
Staple = np.zeros((2),dtype=complex)
for i0 in range(0,j1):
x1 = np.random.randint(0,n)
y1 = np.random.randint(0,n)
u1 = np.random.randint(0,1)
Linkrxp1 = np.roll(Link,-1, axis = 0)
Linkrxn1 = np.roll(Link, 1, axis = 0)
Linkrtp1 = np.roll(Link, -1, axis = 1)
Linkrtn1 = np.roll(Link, 1, axis = 1)
Linkrxp1tn1 = np.roll(np.roll(Link, -1, axis = 0),1, axis = 1)
Linkrxn1tp1 = np.roll(np.roll(Link, 1, axis = 0),-1, axis = 1)
Staple[0] = Linkrxp1[x1,y1,1]*Linkrtp1[x1,y1,0].conj()*Link[x1,y1,1].conj() + Linkrxp1tn1[x1,y1,1].conj()*Linkrtn1[x1,y1,0].conj()*Linkrtn1[x1,y1,1]
Staple[1] = Linkrtp1[x1,y1,0]*Linkrxp1[x1,y1,1].conj()*Link[x1,y1,0].conj() + Linkrxn1tp1[x1,y1,0].conj()*Linkrxn1[x1,y1,1].conj()*Linkrxn1[x1,y1,0]
uni = unitary()
Linkprop = uni*Link[x1,y1,u1]
dE3 = (Linkprop - Link[x1,y1,u1])*Staple[u1]
dE1 = B0*np.real(dE3)
d1 = np.random.binomial(1, np.minimum(np.exp(dE1),1))
d = np.random.uniform(low=0,high=1)
if d1 >= d:
Link[x1,y1,u1] = Linkprop
else:
Link[x1,y1,u1] = Link[x1,y1,u1]
At the beginning of program I call a routine called "randomize" to generate K random unitary complex numbers which have small imaginary parts and store them in an array called Cnum of length K. In the same routine I also go through my Link array and set each element to a random unitary complex number. The code is listed below.
def randommatrix():
global Cnum
global Link
for i1 in range(0,K):
C1 = np.random.normal(0,1)
Cnum[i1] = np.cos(C1) + 1j*np.sin(C1)
Cnum[i1+K] = np.cos(C1) - 1j*np.sin(C1)
for i3,i4 in itertools.product(range(0,n),range(0,n)):
C2 = np.random.uniform(low=0, high = 2*np.pi)
Link[i3,i4,0] = np.cos(C2) + 1j*np.sin(C2)
C2 = np.random.uniform(low=0, high = 2*np.pi)
Link[i3,i4,1] = np.cos(C2) + 1j*np.sin(C2)
The following routine is used during the iteration routine to get a random complex number with a small imaginary part (by retrieving a random element of the Cnum array we generated earlier).
def unitary():
I1 = np.random.randint((0),(2*K-1))
mat = Cnum[I1]
return mat
Here is an example of what the iteration routine would be used for. I've written a routine called plaquette, which calculates the mean plaquette (real part of a 1 by 1 closed loop of link variables) for a given B0. The iteration routine is being used to generate new field configurations which are independent of previous configurations. After we get a new field configuration we calculate the plaquette for said configuration. We then repeat this process j1 times using a while loop, and at the end we end up with the mean plaquette.
def Plq(j1,B0):
i5 = 0
Lboot = np.zeros(j1)
while i5<j1:
iteration(25000,B0)
Linkrxp1 = np.roll(Link,-1, axis = 0)
Linkrtp1 = np.roll(Link, -1, axis = 1)
c0 = np.real(Link[:,:,0]*Linkrxp1[:,:,1]*Linkrtp1[:,:,0].conj()*Link[:,:,1].conj())
i5 = i5 + 1
We need to define some variables before we run anything, so here's the initial variables which I define before defining any routines
K = 20000
n = 50
a = 1.0
Link = np.zeros((n,n,2),dtype = complex)
Cnum = np.zeros((2*K), dtype = complex)
This code works, but it is painfully slow. Is there a way that I can use multiprocessing or something to speed this up?
You should use cython and c data types. Another cython link. It's built for fast computation.
You could use multiprocessing, potentially, in one of two cases.
If you have one object that multiple process would need to share you would need to use Manager (see multiprocessing link), Lock, and Array to share the object between processes. However, there is no guarantee this will result in an increased speed since each process needs to lock the link to guarantee your prediction, assuming the predictions are affected by all elements in the link (if a process modifies an element at the same time another process is making a prediction for an element, the prediction wouldn't be based on the most current information).
If your predictions do not take into account the state of the other elements, i.e. it only cares about the one element, then you could break your Link array into segments and divvy chunks out to several processes in a process pool, and when done combine the segments back to one array. This would certainly save time, and you wouldn't have to use any additional multiprocessing mechanisms.

Python input/output more efficiently

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(parts1[0])
n2.append(parts1[1])
n3.append(float(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(idx_start,idx_end): #### loop over 10 million entries
data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
for i in xrange(idx_start,idx_end)
filename = "../../../data/" + str(n1[i]) + "/spPlate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000] #### return a list of list
flux = fluxx[0]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(4000)
wavegrid = np.power(10,logwave)
##### After I read the flux and the wavegrid, then I can do my following analysis.
##### save data to data_stru
##### Reading is the most time-consuming part of this code, my later analysis is not time consuming.
The problem is that the files are too big, there is no enough memory to load it at once, and my catalog is not structured such that all entries which will open the same file are grouped together. I wonder is there anyone who can offer some thoughts to split the large loop into two loops: 1) first loop over the files so that we can avoid repeatedly opening/reading files again and again, 2) loop over the entries which are going to use the same file.
If I understand your code correctly, n1 and n2 determine which file to open. So why do you not just lexsort them. You can then use itertools.groupby to group records with the same n1, n2. Here is a down-scaled proof of concept:
import itertools
n1 = np.random.randint(0, 3, (10,))
n2 = np.random.randint(0, 3, (10,))
mockdata = np.arange(10)+100
s = np.lexsort((n2, n1))
for k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):
# groupby groups the iterations i of its first argument
# (zip(...) in this case) by the result of applying the
# optional second argument (here lambda) to i.
# Here we use the lambda expression to remove si from the
# tuple (si, n1si, n2si) that zip produces because otherwise
# equal (n1si, n2si) pairs would still be treated as different
# because of the distinct si's. Hence no grouping would occur.
# Putting si in there in the first place is necessary, so we
# we can reference the other records of the corresponding row
# in the inner loop.
print(k)
for si, n1s, ns2 in g:
# si can be used to access the corresponding other records
print (si, mockdata[si])
Prints something like:
(0, 1)
4 104
(0, 2)
0 100
2 102
6 106
(1, 0)
1 101
(2, 0)
8 108
9 109
(2, 1)
3 103
5 105
7 107
You may want to include n3 in the lexsort, but not the grouping so you can process the files' content in order.

Categories