Unable to store huge strings in-memory

Unable to store huge strings in-memory - python

I have data in the following form:
## De
A B C.
## dabc
xyz def ghi.
## <MyName_1>
Here is example.
## Df
A B C.
## <MyName_2>
De another one.
## <MyName_3>
Df next one.
## dabc1
xyz def ghi.
## <MyName_4>
dabc this one.
Convert it into the following form:
A B#1 C. //step 1 -- 1 assigned to the first occurrence of A B C.
xyz def#1 ghi. //1 assigned to first occurrence of xyz def ghi
Here is example
A B#2 C. //step 1 -- 2 assigned in increasing order
B#1 another one. //step 2
B#2 next one.
xyz def ghi.
def#1 this one.
// Here stand for comments and are not are part of the output.
The algorithm is the following.
If the second line following ## gets repeated. Then, append to the
middle word#number, where number is a numeric identifier and is
assigned in increasing order of repetition of the second line.
Replace ##... with word#number where ever it occurs.
Remove all ## where the second line is not getting repeated.
In order to achieve this I am storing all the triples and then finding their occurrences in order to assign numbers in increasing order. Is there some other way to achieve the same in python. Actually my file is 500GB and it is not possible to store all the triples in-memory in order to find their occurrences.

If you need something that's like a dict, but too big to hold in memory, what you need is a key-value database.
The simplest way to do this is with a dbm-type library, which is a very simple key-value database with almost exactly the same interface as a dict, except that it only allows strings for keys and values, and has some extra methods to control persistence and caching and the like. Depending on your platform and how your Python 2.7 was built, you may have any of:
dbm
gdbm
dumbdbm
dbhash
bsddb
bsddb185
bsddb3
PyBSDDB
The last three are all available on PyPI if your Python install doesn't include them, as long as you have the relevant version of libbsddb itself and don't have any problems with its license.
The problem is that, depending on your platform, the various underlying database libraries may not exist (although of course you can download the C library, install it, then build and install the Python wrapper), or may not support databases this big, or may do so but only in a horribly inefficient way (or, in a few cases, in a buggy way…).
Hopefully one of them will work for you, but the only way you'll really know is to test all of the ones you have.
Of course, if I understand properly, you're mapping strings to integers, not to strings. You could use the shelve module, which wraps any dbm-like library to allow you to use string keys but anything picklable as values… but that's huge overkill (and may kill your performance) for a case like this; you just need to change code like this:
counts.setdefault(key, 0)
counts[key] += 1
… into this:
counts.setdefault(key, '0')
counts[key] = str(int(counts[key]) + 1)
And of course you can easily write a wrapper class that does this for you (maybe even one that supports the Counter interface instead of the dict interface).
If that doesn't work, you need a more powerful database.
Most builds of Python come with sqlite3 in the stdlib, but using it will require learning a pretty low-level API, and learning SQL, which is a whole different language that's very unlike Python. (There are also a variety of different relational databases out there, but you shouldn't need any of them.)
There are also a variety of query expression libraries and even full object-relational mappers, like SQLAlchemy (which can be used either way) that let you write your queries in a much more Pythonic way, but it's still not going to be as simple as using a dict or dbm. (That being said, it's not that hard to wrap a dbm-like interface around SQLAlchemy.)
There are also a wide variety of non-relational or semi-relational databases that are generally lumped under the term NoSQL, the simplest of which are basically dbm on steroids. Again, they'll usually require learning a pretty low-level API, and sometimes a query language as well—but some of them will have nice Python libraries that make them easier to use.

Related

How to dynamically store and execute equations and functions

During my current project, I have been receiving data from a set of long-range sensors, which are sending data as a series of bytes. Generally, due to having multiple types of sensors, the bytes structures and data contained are different, hence the need to make the functionality more dynamic as to avoid having to hard-code every single setup in the future (which is not practical).
The server will be using Django, which I believe is irrelevant to the issue at hand but I have mentioned just in case it might have something that can be used.
The bytes data I am receiving looks like this:
b'B\x10Vu\x87%\x00x\r\x0f\x04\x01\x00\x00\x00\x00\x1e\x00\x00\x00\x00ad;l'
And my current process looks like this:
Take the first bytes to get the deviceID (deviceID = val[0:6].hex())
Look up the format to be used in the struct.unpack() (here: >BBHBBhBHhHL after removing the first bytes for the id.
Now, the issue is the next step. Many of the datas I have have different forms of per-processing that needs to be done. F.e. some values need to be ran with a join statement (e.g. ".".join(str(values[2]) ) while others need some simple mathematical changes (-113 + 2 * values[4]) and finally, others needs a simple logic check (values[7]==0x80) to return a boolean value.
My question is, what's the best way to code those methods? I would really like to avoid hardcoding them, but it almost seems like the best idea. another idea I saw was to store the functionalities as a string and execute them such as seen here, but I've been reading that its a very bad idea, and that it also slows down execution. The last idea I had was to hardcode some general functions only and use something similar to here, but this doesn't solve the issue of having to hard-code every new sensor-type, which is not realistic in a live-installation. Are there any better methods to achieve the same thing?
I have also looked at here, with the idea that some functionality can be somehow optimized as an equation, but I didn't see that a possibility for every occurrence, especially when any string manipulation is needed at all.
Additionally, is there a possibility of using some maths to apply some basic string manipulation? I can hard-code one string manipulation maybe, but to be honest this whole thing has been bugging me...
Finally, I am considering if I go with the function storing as string then executing, is there a way to set some "security" to avoid any malicious exploitation? Since such a method is... awful insecure to say the least.
However, after almost a week total of searching I am so far unable to find a better solution than storing functions as a string and running eval on them, despite not liking that option. If anyone finds a better option before then, I would be extremely grateful to any tips or ideas.
Appendum: Minimum code that can be used to show-case and test different methods:
import struct
def decode(input):
val = bytearray(input)
deviceID = val[0:6].hex()
del(val[0:6])
print(deviceID)
values = list(struct.unpack('>BBHBBhBHhHL', val))
print(values)
# Now what?
decode(b'B\x10Vu\x87%\x00x\r\x0f\x04\x01\x00\x00\x00\x00\x1e\x00\x00\x00\x00ad;l')

Is using strings as an object identifier bad practice?

I am developing a small app for managing my favourite recipes. I have two classes - Ingredient and Recipe. A Recipe consists of Ingredients and some additional data (preparation, etc). The reason i have an Ingredient class is, that i want to save some additional info in it (proper technique, etc). Ingredients are unique, so there can not be two with the same name.
Currently i am holding all ingredients in a "big" dictionary, using the name of the ingredient as the key. This is useful, as i can ask my model, if an ingredient is already registered and use it (including all it's other data) for a newly created recipe.
But thinking back to when i started programming (Java/C++), i always read, that using strings as an identifier is bad practice. "The Magic String" was a keyword that i often read (But i think that describes another problem). I really like the string approach as it is right now. I don't have problems with encoding either, because all string generation/comparison is done within my program (Python3 uses UTF-8 everywhere if i am not mistaken), but i am not sure if what i am doing is the right way to do it.
Is using strings as an object identifier bad practice? Are there differences between different languages? Can strings prove to be an performance issue, if the amount of data increases? What are the alternatives?

No -
actually identifiers in Python are always strings. Whether you keep then in a dictionary yourself (you say you are using a "big dictionary") or the object is used programmaticaly, with a name hard-coded into the source code. In this later case, Python creates the name in one of its automaticaly handled internal dictionary (that can be inspected as the return of globals() or locals()).
Moreover, Python does not use "utf-8" internally, it does use "unicode" - which means it is simply text, and you should not worry how that text is represented in actual bytes.

Python relies on dictionaries for many of its core features. For that reason the pythonic default dict already comes with a quite effective, fast implementation "from factory", decent hash, etc.
Considering that, the dictionary performance itself should not be a concern for what you need (eventual calls to read and write on it), although the way you handle it / store it (in a python file, json, pickle, gzip, etc.) could impact load/access time, etc.
Maybe if you provide a few lines of code showing us how you deal with the dictionary we could provide specific details.
About the string identifier, check jsbueno's answer, he gave a much better explanation then I could do.

How do I compare two nested data structures for unittesting?

For those who know perl, I'm looking for something similar to Test::Deep::is_deeply() in Python.
In Python's unittest I can conveniently compare nested data structures already, if I expect them to be equal:
self.assertEqual(os.walk('some_path'),
my.walk('some_path'),
"compare os.walk with my own implementation")
However, in the wanted test, the order of files in the respective sublist of the os.walk tuple shall be of no concern.
If it was just this one test it would be ok to code an easy solution. But I envision several tests on differently structured nested data. And I am hoping for a general solution.
I checked Python's own unittest documentation, looked at pyUnit, and at nose and it's plugins. Active maintenance would also be an important aspect for usage.
The ultimate goal for me would be to have a set of descriptive types like UnorderedIterable, SubsetOf, SupersetOf, etc which can be called to describe a nested data structure, and then use that description to compare two actual sets of data.
In the os.walk example I'd like something like:
comparison = OrderedIterable(
OrderedIterable(
str,
UnorderedIterable(),
UnorderedIterable()
)
)
The above describes the kind of data structure that list(os.walk()) would return. For comparison of data A and data B in a unit test, the current path names would be cast into a str(), and the dir and file lists would be compared ignoring the order with:
self.assertDeep(A, B, comparison, msg)
Is there anything out there? Or is it such a trivial task that people write their own? I feel comfortable doing it, but I don't want to reinvent, and especially would not want to code the full orthogonal set of types, tests for those, etc. In short, I wouldn't publish it and thus the next one has to rewrite again...

Python Deep seems to be a project to reimplement perl's Test::Deep. It is written by the author of Test::Deep himself. Last development happened in early 2016.
Update (2018/Aug): Latest release (2016/Feb) is located on PyPi/Deep
I have done some P3k porting work on github

Not a solution, but the currently implemented workaround to solve the particular example listed in the question:
os_walk = list(os.walk('some_path'))
dt_walk = list(my.walk('some_path'))
self.assertEqual(len(dt_walk), len(os_walk), "walk() same length")
for ((osw, osw_dirs, osw_files), (dt, dt_dirs, dt_files)) in zip(os_walk, dt_walk):
self.assertEqual(dt, osw, "walk() currentdir")
self.assertSameElements(dt_dirs, osw_dirs, "walk() dirlist")
self.assertSameElements(dt_files, osw_files, "walk() fileList")
As we can see from this example implementation that's quite a bit of code. As we can also see, Python's unittest has most of the ingredients required.

Python ordered garbage collectible dictionary?

I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.
So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.
Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.
Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.

If the presence of the __del__ method is problematic for you, just remove it:
>>> import collections
>>> del collections.OrderedDict.__del__
You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.

It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.
I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…
But two more points:
rewriting everything to avoid cycles is a huge amount of effort.
The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.
You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.
But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.
Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.
For example, take this simple program:
d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
print(k, d[k], id(k), id(d[k]), hash(k))
If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.
The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)
But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.

Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.

Are there reasons to use get/put methods instead of item access?

I find that I have recently been implementing Mapping interfaces on classes which on the surface fit the model (they are essentially just key-value stores with no more meta-data), but underneath they are sometimes quite complex.
Here are a couple examples of increasing severity:
An object which wraps another mapping converting all objects to strings when set.
An object which uses a local database as a back-end to store the key-value pairs.
An object which makes HTTP requests to remote servers to get/set data.
Lets suppose all of these examples seamlessly implement the Mapping interface, and the only indication that there is something fishy going on is that item access potentially could take a few seconds, and an item may not be retrievable in the same form as stored (if at all). I'm perfectly content with something like the first example, pretty okay with the second, but I'm getting kinda uncomfortable with the last.
The question is, is there a line at which the API for these models should not use item access, even though the underlying structure may feel like it fits on the surface?

It sounds like you are describing the standard anydbm module semantics. And just as anydbm can raise exception anydbm.error, so too could your subclass raise derivatives like MyDbmTimeoutError as needed. Whether you implement it as dictionary operations or function calls, the caller will still have to contend with exceptions anyway (e.g. KeyError, NameError).
I think that arbitrary "tied" hashes exist in Python 2 and 3.x is decent justification for saying it is a reasonable approach. Indeed, I've been looking for (and designing in my head) more complex bindings than simple key ⇒ value mappings without a heavy ORM SQL layer in between.
added: The more I think about it, the more Pythonic tied dictionaries seem to be. A key ⇒ value collection is a dictionary. Whether it lives in core or on disk or across the network is an implementation detail that is best abstracted away. The only substantive differences are increased latency and possible unavailability; however, on a virtual memory based OS, "core" can have higher latency than RAM and in a multiprocessing OS, "core" can become unavailable too. So these are differences in degree only, not kind.

From a strictly philosophical point of view, I don't think that there is a line you can cross with this. If some tool provides the needed functionality, but its API is different, adapt away. The only time you shouldn't do this is if the adapted to API simply is not expressive enough to manipulate the adapted component in a way that is needed.
I wouldn't hesitate to adapt a database into a dict, because that's a great way to manipulate collections, and it's already compatible with a heck of a lot of other code. If I find that my particular application must make calls to the database connections begin(), commit(), and rollback() methods to work right, then a dict won't do, since dict's don't have transaction semantics.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.