I have been trying to optimize this code:
def spectra(mE, a):
pdf = lambda E: np.exp(-(a+1)*E/mE)*(((a+1)*E/mE)**(a+1))/(scipy.special.gamma(a+1)*E)
u=[]
n=np.random.uniform()
E=np.arange(0.01,150, 0.1)
for i in E:
u.append(scipy.integrate.quad(pdf,0,i)[0])
f=interp1d(u, E)
return f(n)
I was trying to create a lookup table out of f, but it appears that every time I call the function it does the integration all over again. Is there a way to put in something like an if statement, which will let me create f once for values of mE and a and then just call that afterwards?
Thanks for the help.
Cheers.
It sounds like what you want to do is return a known value if the function is re-called with the same (mE, a) values, and perform the calculation if the input is new.
That is called memoization. See for example What is memoization and how can I use it in Python? . Note that in modern versions of Python, you can create a decorator to apply the memoization of the function, which lets you be a little neater.
Most probably you'll not be able to store values of spectra(x, y) and reasonably retrieve them by exact values of floating-point x and y. You rarely encounter exact same floating point values in real life.
Note that I don't think you can cache f directly, becuse it depends on a long list of floats. Possible input space of it is so large that finding a close match seems very improbable to me.
If you cache values of spectra() you could retrieve the value for a close enough pair of arguments with a reasonable probability.
The problem is searching for such close pairs. A hash table cannot work (we need imprecise matches), an ordered list and binary search cannot work either (we have 2 dimensions). I'd use a quad tree or some other form of spatial index. You can build it dynamically and efficiently search for closest known points near your given point.
If you found cached a point really close to the one you need, you can just return the cached value. If no point is close enough, you add it to the index, in hope that it will be reused in the future. Maybe you could even interpolate if your point is between two known points close by.
The big prerequisite, of course, is that sufficient number of points in the cache has a chance to be reused. To estimate this, run your some of your calculation and store (mE, a) pairs somewhere (e.g in a file), then plot them. You'll instantly see if you have groups of points close to one another. You can look for tight clusters without plotting, too, of course. If you have enough clusters that are tight (where you could reuse one point's value for another), your cache will work. If not, don't bother implementing it.
Related
I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?
Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).
I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.
I'm writing a program to do various calculations involving nuclides. Some of these involve binding energies, magnetic moments, etc. Part of the program will need to be the storing of some dictionary, list, or something I'm unaware of as a novice Python programmer. I'd like to (by hand) create a set that contains Z, N, masses etc. Specifically, I'd like a structure that has multiple traits to a piece. I've thought of making a nested dictionary (maybe calling an attribute, nuclides[C14[attribute]]), but don't think this is intuitive. Here's the trickiest part, I'd like nuclides to referable by either Z and N, and by a string (e.g. nuclides['14C'] or nuclides[6,8]). As far as I know, dictionaries are only referenced by their label, so I'm not sure if dictionaries are ideal.
TL;DR
What's the best format for storing numerous of sets of integers/floats and a unique string where each set can be referenced by either it's string or pair of numbers.
An example of application, if say given 238Pu, finding the daughter nuclide and it's mass (both of which are in this table/data) from alpha decay.
I'm quite new to python (2.7) and have a question about what's the most Pythonic way to do something; my code (part of a class) Looks like this (a somewhat naive Version):
def calc_pump_height(self):
for i in range(len(self.primary_)):
for j in range(len(self.primary_)):
if self.connections_[i][j].sub_kind_ in [1,4]:
self.calc_spec_pump_height(i,j)
def calc_spec_pump_height(self,i,j):
pass
(obviously pass will be replaced by something else, manipulating attributes of the object of this class, without generating a return value)
I'd like to ask how I should do this: I could avoid the second function and write the extra code directly into the first function, getting rid of one function (Simple is better than complex), but creating a heavily nested function at the same time (Flat is better than nested).
I could also create some sort of list comprehension to avoid using a double Loop, eg:
def calc_pump_height(self):
ra = range(len(self.primary_))
[self.calc_spec_pump_height(i,j) for i,j in zip(ra, ra)]
(I'd have to move the if condition into the 2nd function; this would also create a null-list but I don't care about this, since calc_spec_pump_height is supposed to manipulate the object, not return something useful)
In essence: I'm iterating over a 2D list, testing each object for a certain characteristic and then do something with that object.
Which of the above methods is 'the best'? Or is there another way that I'm missing?
The key thing about functions/methods is that they should do one thing.
calc_pump_height implements two things: It finds elements in a 2D list that match some criteria, and then it calculates a value for each of those elements. It's ok for its purpose to be combining the other two operations, if that makes sense for the object's public API, but its not ok for it to implement either or both.
Finding the elements that match the criteria is a discrete step; that should be a function.
Calculating your value is clearly a discrete step; that should be a function.
I would implement the element matcher as a (private) generator, that takes the test condition as an argument, and yields all matching elements. It's just an iterator over your data structure, masked by the logical test. You can wrap that in a named public method called get_1_4_subkinds() or something that makes more sense in your domain. That generalises the code and gives you the flexibility to implement other conditions in the future. Also, your i and j are tightly coupled, so it makes sense to pass them around as a single concept. Then your code becomes:
def calc_pump_height(self):
for subkind_indices in self.get_1_4_subkinds():
self.calc_pump_spec_height(subkind_indices)
You have misunderstood “simplicity”:
write the extra code directly into the first function, getting rid of one function (Simple is better than complex)
That's not simple. Breaking complex sequences into discrete, focussed functions increases simplicity.
In that light, I would say that yes, you should definitely prefer calc_spec_pump_height as a separate function.
You can eliminate one level of nesting in your first function by using itertools.product to generate your i and j values at the same time (itertools.product(range(len(self.primary_)), repeat=2). The zip you use in the your second version won't work correctly, it will only yield identical pairs, 0,0, 1,1, 2,2, etc.
As for the overall design, you should not use a list comprehension if you don't care about the return value from the function you're calling. Use an explicit loop when it's the looping you want (rather than a list of computed values).
If there's a non-trivial amount of code that will go in calc_spec_pump_height, it makes perfect sense to make it as a separate method. If it's a one or two liner, then it might be OK to inline within calc_pump_height, but that method's loops and condition testing may be complicated enough already to justify factoring out the inner part of the algorithm.
You should usually think about splitting a big function up when it is too long to fit onto a single screen in your editor. That is about the limit of how many details (variable names, etc.) we can keep in our mind simultaneously. On the other hand, you shouldn't waste time (either your own programming time or function call overhead at run time) by factoring out every little piece of every problem. Factor part of a function out if you're using it from more than one place, or if you can't keep the details of the whole function in your head at once otherwise.
So, other than the (marginal) improvement of itertools.product and given the limited information you've provided about what calc_spec_pump_height will do, I think your code is already about as good as it can get!
I have a problem with my code running on google app engine. I dont know how to modify my code to suit GAE. The following is my problem
for j in range(n):
for d in range(j):
for d1 in range(d):
for d2 in range(d1):
# block which runs in O(n^2)
Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. Thus I am using task queues. I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of n x n x n x n ie needs memory space O(N^4)
Since the limitation of put() is 10 MB, I cant store the entire array. So I tried chopping into smaller chunks and store it and when retrieve combine them. I used the json function for this but it doesnt support for larger n (> 40).
Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. So there is no local variable. When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). As a result, my code takes more time for computation. After few iterations, I get the error 203 raised by GAE and task fails with code 500.
I know that my code may not be best suited for GAE. But what is the best way to implement it on GAE ?
There may be even more efficient ways to store your data and to iterate over it.
Questions:
What datatype are you storing, list of list ... of int?
What range of the nested list does your innermost loop O(n^2) portion typically operate over?
When you do the putitem, getitem how many values are you retrieving in a single put or get?
Ideas:
You could try compressing your json (and base64 for cut and pasting). 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as #Robert suggested. You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. It would also allow you to sparsly populate your structure. You would need to track and know your bounds of what data you loaded in another way. A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]
You've omitted important details like the expected size of n from your question. Also, does the "# block which runs in O(n^2)" need access to the entire matrix, or are you simply populating the matrix based on the index values?
Here is a general answer: you need to find a way to break this up into smaller chunks. Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. How you store your matrix depends on how you split the problem up. You might be able to store submatrices, or perhaps subvectors using the index values as key-names; again, this will depend on your problem and the strategy you use.
An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. You'll just need to pass in the starting position, then resume running from there. You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem.
Sam, just give you an idea and pointer on where to start.
If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval.
list is a python object, and you should be able to serialize it.
http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore