I have been looking into YAML and the Python parsing options with PyYAML. I kind of understand how it works but still have a question regarding the process:
Is it possible to directly update an item inside the YAML file without parsing the whole file, creating a dictionary for everything, operating on that dictionary and then dumping it back?
HOUSE:
- white
APPLE:
- red
BANANA:
- yellow
Let say that I want to make the APPLE "green", is that possible by only operating on the APPLE object, and not working on the whole dictionary?
Thanks.
Ok. I think Roman is right. I was asking this cause I was worried about the overhead of complex YAML objects. But, I guess, If things are becoming complex, then one should switch to a db solution, like MongoDB or a like. Once YAML is kept simple, the serialisation and de-serialisation overhead should bot be a huge issues.
Related
I am developing a small app for managing my favourite recipes. I have two classes - Ingredient and Recipe. A Recipe consists of Ingredients and some additional data (preparation, etc). The reason i have an Ingredient class is, that i want to save some additional info in it (proper technique, etc). Ingredients are unique, so there can not be two with the same name.
Currently i am holding all ingredients in a "big" dictionary, using the name of the ingredient as the key. This is useful, as i can ask my model, if an ingredient is already registered and use it (including all it's other data) for a newly created recipe.
But thinking back to when i started programming (Java/C++), i always read, that using strings as an identifier is bad practice. "The Magic String" was a keyword that i often read (But i think that describes another problem). I really like the string approach as it is right now. I don't have problems with encoding either, because all string generation/comparison is done within my program (Python3 uses UTF-8 everywhere if i am not mistaken), but i am not sure if what i am doing is the right way to do it.
Is using strings as an object identifier bad practice? Are there differences between different languages? Can strings prove to be an performance issue, if the amount of data increases? What are the alternatives?
No -
actually identifiers in Python are always strings. Whether you keep then in a dictionary yourself (you say you are using a "big dictionary") or the object is used programmaticaly, with a name hard-coded into the source code. In this later case, Python creates the name in one of its automaticaly handled internal dictionary (that can be inspected as the return of globals() or locals()).
Moreover, Python does not use "utf-8" internally, it does use "unicode" - which means it is simply text, and you should not worry how that text is represented in actual bytes.
Python relies on dictionaries for many of its core features. For that reason the pythonic default dict already comes with a quite effective, fast implementation "from factory", decent hash, etc.
Considering that, the dictionary performance itself should not be a concern for what you need (eventual calls to read and write on it), although the way you handle it / store it (in a python file, json, pickle, gzip, etc.) could impact load/access time, etc.
Maybe if you provide a few lines of code showing us how you deal with the dictionary we could provide specific details.
About the string identifier, check jsbueno's answer, he gave a much better explanation then I could do.
I'm looking for solution to store huge (to big for RAM) array-like/list-like object on HDD. So, basically I'm looking for key-value database with:
- integer (not string!) keys
- ability to store python objects (list of tuples). Appended object will never be changed. There is no relation betweeen objects in array.
- low memory usage (no caching). If I need to load 35235235th object, I want to load only it.
So, I could use SQLite and blobs, but I'm looking for something more elegant and very fast.
Sorry for my bad english. I'm using Python 3.
You could have a look at zodb http://www.zodb.org, which is a mature python db project. It has the btrees package, which provides the IOBTree, which is what you might be looking for.
I need to save a dictionary to a file, In the dictionary there are strings, integers, and dictionarys.
I did it by my own and it's not pretty and nice to user.
I know about pickle but as I know it is not safe to use it, because if someone replace the file and I (or someone else) will run the file that uses the replaced file, It will be running and might do some things. it's just not safe.
Is there another function or imported thing that does it.
Pickle is not safe when transfered by a untrusted 3rd party. Local files are just fine, and if something can replace files on your filesystem then you have a different problem.
That said, if your dictionary contains nothing but string keys and the values are nothing but Python lists, numbers, strings or other dictionaries, then use JSON, via the json module.
Presuming your dictionary contains only basic data types, the normal answer is json, it's a popular, well defined language for this kind of thing.
If your dictionary contains more complex data, you will have to manually serialise it at least part of the way.
JSON is not quite Python-way because of several reasons:
It can't wrap/unwrap all Python data types: there's no support for sets or tuples.
Not fast enough because it needs to deal with textual data and encodings.
Try to use sPickle instead.
I've seen a lot of similar questions to this, but nothing that really matched. Most other questions seemed to relate to speed. What I'm experiencing is a single json dictionary that sits in a 1.1gig file on my local box taking up all of my 16 gigabytes of memory when I try to load it using anything along the lines of:
f = open(some_file, "rb")
new_dictionary = json.load(f)
This happens regardless of what json library I use (I've tried ujson, json, yajl), and regardless of whether I read things in as a byte stream or not. This makes absolutely no sense to me. What's with the crazy memory usage, and how do I get around it?
In case it helps, the dictionary is just a bunch of nested dictionaries all having ints point to other ints. A sample looks like:
{"0":{"3":82,"4":503,"15":456},"956":{"56":823,"678":50673,"35":1232}...}
UPDATE: When I run this with simplejson, it actually only takes up 8 gigs. No idea why that one takes up so much less than all the others.
UPDATE 2: So I did some more investigation. I loaded up my dictionary with simplejson, and tried converting all the keys to ints (per Liori's suggestion that strings might take up more space). Space stayed the same at 8 gigs. Then I tried Winston Ewert's suggestion of running a gc.collect(). Space still remained at 8 gigs. Finally, annoyed and curious, I pickled my new data structure, exited Python, and reloaded. Lo and behold, it still takes up 8 gigs. I guess Python just wants that much space for a big 2d dictionary. Frustrating, for sure, but at least now I know it's not a JSON problem so long as I use simplejson to load it.
You could try with a streaming API:
http://lloyd.github.com/yajl/
of which there are a couple of python wrappers.
https://github.com/rtyler/py-yajl/
https://github.com/pykler/yajl-py
A little experimentation on my part suggests that calling gc.collect() after the json object has been parsed drops memory usage to where it was when the object was originally constructed.
Here is the results I get for memory usage on a smaller scale:
Build. No GC
762912
Build. GC
763000
Standard Json. Unicode Keys. No GC
885216
Standard Json. Unicode Keys. GC
744552
Standard Json. Int Keys. No GC
885216
Standard Json. Int Keys. GC
744724
Simple Json. Unicode Keys. No GC
894352
Simple Json. Unicode Keys. GC
745520
Simple Json. Int Keys. No GC
894352
Simple Json. Int Keys. GC
744884
Basically, running gc.collect() appears to cleanup some sort of garbage producing during the JSON parsing process.
I can't believe I'm about to say this but json is actually a very simple format, it wouldn't be too difficult to build your own parser.
That said, it would only makes sense if:
You don't need the full dictionary at the end (i.e., you can consume the data as you read it)
You have a good idea what sort of structure the data is in (an arbitrarily deep dictionary would make this much more difficult)
Gabe really figured this out in a comment, but since it's been a few months and he hasn't posted it as an answer, I figured I should just answer my own question, so posterity sees that there is an answer.
Anyway, the answer is that a 2d dictionary just takes up that much space in Python. Each one of those dictionaries winds up with some space overhead, and since there are a lot of them, it balloons up from 1.1 gig to 8 gigs, and there's nothing you can do about it except try to use a different data structure or get more ram.
Say you have a some meta data for a custom file format that your python app reads. Something like a csv with variables that can change as the file is manipulated:
var1,data1
var2,data2
var3,data3
So if the user can manipulate this meta data, do you have to worry about someone crafting a malformed meta data file that will allow some arbitrary code execution? The only thing I can imagine if you you made the poor choice to make var1 be a shell command that you execute with os.sys(data1) in your own code somewhere. Also, if this were C then you would have to worry about buffers being blown, but I don't think you have to worry about that with python. If your reading in that data as a string is it possible to somehow escape the string "\n os.sys('rm -r /'), this SQL like example totally wont work, but is there similar that is possible?
If you are doing what you say there (plain text, just reading and parsing a simple format), you will be safe. As you indicate, Python is generally safe from the more mundane memory corruption errors that C developers can create if they are not careful. The SQL injection scenario you note is not a concern when simply reading in files in python.
However, if you are concerned about security, which it seems you are (interjection: good for you! A good programmer should be lazy and paranoid), here are some things to consider:
Validate all input. Make sure that each piece of data you read is of the expected size, type, range, etc. Error early, and don't propagate tainted variables elsewhere in your code.
Do you know the expected names of the vars, or at least their format? Make sure the validate that it is the kind of thing you expect before you use it. If it should be just letters, confirm that with a regex or similar.
Do you know the expected range or format of the data? If you're expecting a number, make sure it's a number before you use it. If it's supposed to be a short string, verify the length; you get the idea.
What if you get characters or bytes you don't expect? What if someone throws unicode at you?
If any of these are paths, make sure you canonicalize and know that the path points to an acceptable location before you read or write.
Some specific things not to do:
os.system(attackerControlledString)
eval(attackerControlledString)
__import__(attackerControlledString)
pickle/unpickle attacker controlled content (here's why)
Also, rather than rolling your own config file format, consider ConfigParser or something like JSON. A well understood format (and libraries) helps you get a leg up on proper validation.
OWASP would be my normal go-to for providing a "further reading" link, but their Input Validation page needs help. In lieu, this looks like a reasonably pragmatic read: "Secure Programmer: Validating Input". A slightly dated but more python specific one is "Dealing with User Input in Python"
Depends entirely on the way the file is processed, but generally this should be safe. In Python, you have to put in some effort if you want to treat text as code and execute it.