I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.
Using a Python dictionary is the way to go.
You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.
If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.
Related
I am using ROS and I am currently trying to munipulate a costmap, so essentially I am wanting to change individual values in a tuple that has a length in this situation of approaching 7 digits.
See http://docs.ros.org/api/nav_msgs/html/msg/OccupancyGrid.html
Originally I tried just turning this long tuple into a list then changing the values and then turning this back into a tuple, as you can imagine this extremely inefficient. I need this to be able to run quickly as it needs to update the costmap often for dynamic object avoidance.
Is there a way that I can change individual values in a tuple efficiently?
Sadly, this is simply a limitation of ROS's python message data model. Array-like structures are always deserialized as tuple for performance reasons, except for lists of bool for some reason. And tuple is immutable.
However, if you were in C++ space, you would be receiving a const OccupancyGridConstPtr& anyway, so it would still be just as immutable. Or you could have registered the callback as OccupancyGrid message and get pass-by-value, but you're just moving the copy to method-call-time. There's no avoiding the copy if you intend on modifying the grid, wether you're in Python or C++.
There is no need to convert back to tuple however, ROS's python message serialization accepts either list or tuple.
You can gain quite a bit of efficiency as well if you can do some of your processing work during that copy (saves an iteration over the grid). Though I don't know exactly what you're trying to do, so I don't know the feasibility of that.
The fact that tuples are immutable is supposed to be a feature, not a bug.
So if you need to make changes, the data probably ought not to be in tuple form in the first place.
Can't you have a list from begin to end? (I do not know ROS)
If that is impossible, I'd say you either have a serious problem with the software design or you are trying to solve the wrong problem.
If I instantiate/update a few lists very, very few times, in most cases only once, but I check for the existence of an object in that list a bunch of times, is it worth it to convert the lists into dictionaries and then check by key existence?
Or in other words is it worth it for me to convert lists into dictionaries to achieve possible faster object existence checks?
Dictionary lookups are faster the list searches. Also a set would be an option. That said:
If "a bunch of times" means "it would be a 50% performance increase" then go for it. If it doesn't but makes the code better to read then go for it. If you would have fun doing it and it does no harm then go for it. Otherwise it's most likely not worth it.
You should be using a set, since from your description I am guessing you wouldn't have a value to associate. See Python: List vs Dict for look up table for more info.
Usually it's not important to tune every line of code for utmost performance.
As a rule of thumb, if you need to look up more than a few times, creating a set is usually worthwhile.
However consider that pypy might do the linear search 100 times faster than CPython, then a "few" times might be "dozens". In other words, sometimes the constant part of the complexity matters.
It's probably safest to go ahead and use a set there. You're less likely to find that a bottleneck as the system scales than the other way around.
If you really need to microtune everything, keep in mind that the implementation, cpu cache, etc... can affect it, so you may need to remicrotune differently for different platforms, and if you need performance that badly, Python was probably a bad choice - although maybe you can pull the hotspots out into C. :)
random access (look up) in dictionary are faster but creating hash table consumes more memory.
more performance = more memory usages
it depends on how many items in your list.
I have a Python dictionary that uses integers as keys
d[7] = ...
to reference custom objects:
c = Car()
d[7] = c
However, each of these custom objects also has a string identifier (from a third party). I want to be able to access the objects using both an integer or a string. Is the preferred way to use both keys in the same dictionary?
d[7] = c
d["uAbdskmakclsa"] = c
Or should I split it up into two dictionaries? Or is there a better way?
It really depends on what you're doing.
If you get the different kinds of keys from different sources, so you always know which kind you're looking up, it makes more sense, conceptually, to use separate dictionaries.
On the other hand, if you need to be able to handle keys that could be either kind, it's probably simpler to use a single dictionary. Otherwise, you need to write code like that uses type-switching, or tries one dict and then tries the other on KeyError, or something else ugly.
(If you're worried about efficiency, it really won't make much difference either way. It's only a very, very tiny bit faster to look things up in a 5000-key dictionary as in a 10000-key dictionary, and it only costs a very small of extra memory to keep two 5000-key dictionaries than one 10000-key dictionary. So, don't worry about efficiency; do whichever makes sense. I don't have any reason to believe you are worried about efficiency, but a lot of people who ask questions like this seem to be, so I'm just covering the bases.)
I would use two dicts. One mapping 3rd party keys to integer keys and another one mapping integer keys to objects. Depending on which one you use more frequently you can switch that of course.
It's a fairly specific situation, I doubt there is any 'official' preference on what to do in this situation. I do however, feel that having keys of multiple types is 'dirty', although I can't really think of a reason why it is.
But since you state that the string keys come from a third party, that alone might be a good reason to split off to another dictionary. I would split as well. You never know what the future might bring and this method is easier to maintain. Also less error prone if you think of type safety.
For setting values in your dictionaries you can then use helper methods. This will make adding easier and prevent you from forgetting to add/update to one of the dictionaries.
So I currently have a 2d list of objects that define the map of a game where each object represents a tile on that map. As I was repurposing the code for something else, I wondered if it would make more sense to use a dictionary to store the map data or to continue using a list. With a list, the indices represent the x and y of the map, whereas in a dictionary, a (x,y) tuple would be the keys for the dictionary.
The reason I ask is because the map changing is a rare event, so the data is fairly static, and as far as i know, the fairly constant lookups will be faster in dictionaries. It should also simplify looping through the map to draw it. Mostly I think using dictionaries will simplify accessing the data, though I'm not sure that will be the case in all cases.
Are these benefits worth the additional memory that I assume the dictionary will take up? or am I even right about the benefits being benefits?
EDIT
I know that the current method works, its was moreso to whether or not it would make sense to switch in order to have cleaner looking code and to find any potential drawbacks.
Stuff like looping through the array would go from something like
for i in range(size[0]):
for e in range(size[1]):
thing.blit(....using i and e)
to
for i, e in dict.items():
i.blit(....using i and e)
or looking up a dict item would be
def get(x, y):
if (x in range(size[0])) and (y in range(size[1])):
return self.map[x][y].tile
to
def get(item):
return self.dict.get(item)
its not much, but its somewhat cleaner, and if its not any slower and there are no other drawbacks i see no reason not to.
I would be wary of premature optimization.
Does your current approach have unacceptable performance? Does the data structure you're using make it harder to reason about or write your code?
If there isn't a specific problem you need to solve that can't be addressed with your current architecture, I would be wary about changing it.
This is a good answer to reference about the speed and memory usage of python lists vs dictionarys: https://stackoverflow.com/a/513906/1506904
Until you get to an incredibly large data set it is most likely that your current method, if it is working well for you, will be perfectly suitable.
I'm not sure that you'll get "the right" answer for this, but when I created the *Game of Life *in Python I used a dict. Realistically there should not be a substantial difference between the lookup cost of multi-dimensional lists and the lookup cost in a dict (both are O(1)), but if you're using a dict then you don't need to worry about instantiating the entire game-board. In chess, this means that you are only creating 32 pieces instead of 64 squares and 32 pieces. In the game of Go, on the other hand you can create only 1 object instead of 361 list-cells.
That said, in the case of the dict you will need to instantiate the tuples. If you can cache those (or only iterate the dict's keys) then maybe you will get the best of all worlds.
I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!
You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.
This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).