I'd like to perform some array calculations using NumPy for a view callable in Pyramid. The array I'm using is quite large (3500x3500), so I'm wondering where the best place to load it is for repeated use.
Right now my application is a single page and I am using a single view callable.
The array will be loaded from disk and will not change.
If the array is something that can be shared between threads then you can store it in the registry at application startup (config.registry['my_big_array'] = ??). If it cannot be shared then I'd suggest using a queuing system with workers that can always have the data loaded, probably in another process. You can hack this by making the value in the registry be a threadlocal and then storing a new array in the variable if one is not there already, but then you will have a copy of the array per thread and that's really not a great idea for something that large.
I would just load it in the obvious place in the code, where you need to use it (in your view, I guess?) and see if you have performance problems. It's better to work with actual numbers than try to guess what's going to be a problem. You'll usually be surprised by the reality.
If you do see performance problems, assuming you don't need a copy for each of multiple threads, try just loading it in the global scope after your imports. If that doesn't work, try moving it into its own module and importing that. If that still doesn't help... I don't know what then.
Related
I did a quick search and the only relevant questions I found talk about the old numpy.random interface. I am trying to understand how to use the new interface. I would like to be able to run some simulation for a given amount of time. Then I want to store the random number generator state information to a file so that I can continue the simulation at a later time.
I have found one way to accomplish this, but it seems to me to be a bad idea since it isn't documented in the API anywhere. I'm wondering if there is a simple way that I have somehow overlooked.
Let's say that I start a simulation with the following code.
from numpy.random import Generator, PCG64
rg = Generator(PCG64(12345))
rg.standard_normal(1024)
save_to_file('state.txt', rg.bit_generator.state)
print(rg.standard_normal(8))
Here, save_to_file saves the dictionary returned by rg.bit_generator.state to state.txt. Now, if I want to continue processing the simulation where I saved it at a later time, I can do so by using the following.
from numpy.random import Generator, PCG64
rg = Generator(PCG64())
rg.bit_generator.state = load_from_file('state.txt')
print(rg.standard_normal(8))
This works, the same 8 numbers are printed for me. I figured out how to do this by inspecting the bit_generator object in the python console. I am using Python 3.6.8 and Numpy 1.18.4. The documentation here and here on the bit_generator object is extremely sparse and doesn't have any suggestions for this common (at least in my work) scenario.
This answer to a similar question about the older interface seems to suggest that it is quite difficult to obtain this for the Mersenne Twister (MT19937), but I am using the PCG64 algorithm which seems not to have as much internal state. At least based on the success of the code I have provided. Is there a better way to accomplish this? One that is either documented or condoned by the community at large? Something that won't break without being well-documented if I one day decide to update Numpy.
Accessing the bit generator through rg is the same as declaring pg = PCG64() and then accessing pg.state. There's nothing wrong with accessing via rg.bit_generator. The docs are a bit scarce but the docs for BitGenerator state that accessing BitGenerator.state allows you to get and set the state if you chose BitGenerator.
https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.PCG64.state.html?highlight=pcg64
I am working on a small webpage that uses geospacial data. I did my initial analysis in Python using GeoPandas and Shapely and I am attempting to build a webpage from this. The problem is, when using Django, I can't seem to find a way to keep the shape file stored as a constant object. Each time a request is made to do operations on the shapefile, I need to load the data from source. This takes something like 6 seconds while a standard dataframe deep copy df.copy() would take fractions of a second. Is there a way I can store a dataframe in Django that can be accessed and deep copied by the views without re-reading the shapefile?
Due to the nature of Django, global variables to not really work that well. I solved this problem in two different ways. The first was to just use django sessions. This way, the object you wanted to store globally now only needs to be loaded once per session on your website. Second and more efficient option is to use a cache server, either Redis or memcached. This will allow you to store and get your objects very quickly across all of your sessions and will increase performance the most.
I'm transiting from Matlab/Octave to NumPy/SciPy. When I use Matlab in the interactive mode, I used clear or clear [some_variable] from time to time to remove that variable from memory. For example, before reading some new data to start a new sets of experiments, I used to clear data in Matlab.
How could I do the same thing with NumPy/SciPy?
I did some research, and I found there is a command called del, but I heard that del actually doesn't clear memory, but the variable disappears from the namespace instead. Am I right?
That being said, what would be the best way to mimic "clear" of Matlab in NumPy/SciPy?
del(obj) will work, according to the scipy mail list
If you're working in IPython, then you can also use %xdel obj
...but I heard that "del" actually doesn't clear memory, but the
variable disappears from the namespace. Am I right?
Yes, that's correct. That's what garbage collection is, Python will handle clearing the memory when it makes sense to, you don't need to worry about it, as from your end the variable no longer exists. Your code will behave the same, whether or not garbage collection has occurred yet won't matter, so there's no need for an alternative to del.
If you are curious about the differences of Matlab and Pythons garbage collection / memory allocation, you can read this SO thread on it.
Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.
I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.
I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.
Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?
The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module
from shove import Shove
mem_store = Shove()
file_store = Shove('file://mystore')
file_store['key'] = value
Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.
The shelve module may do it; at any rate, it should be simple to test. Instead of:
self.lengths = {}
do:
import shelve
self.lengths = shelve.open('lengths.shelf')
The only catch is that keys to shelves must be strings, so you'll have to replace
self.lengths[indx]
with
self.lengths[str(indx)]
(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)
There's no built-in caching in memory, but your operating system may do that for you anyway.
[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]
Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.
A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.
With a little bit of thought it seems like you could get the shelve module to do what you want.
I've read you think shelve is too slow and you tried to hack your own dict using sqlite.
Another did this too :
http://sebsauvage.net/python/snyppets/index.html#dbdict
It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?
You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.
For simple use cases sqlitedict
can help. However when you have much more complex databases you might one to try one of the more upvoted answers.
It isn't exactly a dictionary, but the vaex module provides incredibly fast dataframe loading and lookup that is lazy-loading so it keeps everything on disk until it is needed and only loads the required slices into memory.
https://vaex.io/docs/tutorial.html#Getting-your-data-in