Combining -= and += modifiers in buildout scripts

Combining -= and += modifiers in buildout scripts - python

This doesn't seem to work:
[buildout]
extends = buildout.cfg
eggs -= python-ldap
eggs += psycopg2
The behaviour always seems to be as though the eggs += psycopg2 line was not present. It doesn't matter which order the two lines are in.
Is this a bug? Is there a way to achieve this result?

Unfortunately, zc.buildout up to version 1.5.2 doesn't support this use-case. Either the addition or the subtraction will succeed.
What happens internally is this:
For each key, value pair defined in the inheriting section:
If the key is using +=, take the inherited value, add things, and store it as the new value.
If the key is using -=, take the inherited value, remove things, and store it as the new value.
After these updates the inherited section is copied, updated with the new values and this is used as the final result.
The ordering is defined by the usual python mapping semantics, thus undefined; either the addition or the subtraction runs last. Because both operations take their input from the inherited section, modify it, then store it as the new value, the operation that runs last overwrites the result of the operation that ran before.
I've committed a fix for this; I don't have rights to release a new version of buildout to pypi though, I'll have to poke those who do.
Edit: zc.buildout version 1.6 contains this fix.

Related

How do I navigate the Python documentation to find out the syntax for os.environ.get()?

I saw this line in our code, our python version is 3.7.
example.py
PORT = os.environ.get("PORT", 5000)
I need to add an environment variable, but I need a default, and my background isn't Python. My guess is that 5000 above is a default value if there is no PORT in environment variables, but I want to double check this.
So I googled "os.environ.get python docs." This brought me to the os documentation on the python website, and I had to text search from ther efor environ until I found a paragraph dedicated to os.environ.
A mapping object where keys and values are strings that represent the process environment. For example, environ['HOME'] is the pathname of your home directory (on some platforms), and is equivalent to getenv("HOME") in C.
I had hoped for explicit documentation on environ.get but I figured that environ is itself some sort of Python data structure for which I'd need to look up documentation for on how to use. So I clicked the mapping object link. That brought me to another paragraph:
A container object that supports arbitrary key lookups and implements the methods specified in the Mapping or MutableMapping abstract base classes. Examples include dict, collections.defaultdict, collections.OrderedDict and collections.Counter.
At this point I'm a bit at a loss, because I don't think the docs ever said which kind of mapping object os.environ specifically is. So since it says "implements the methods specificed in the Mapping... abstract base classes" I clicked the abstract base classes link.
On that page I didn't see any reference to get(), so now I was confused because I was expecting a list of methods.
An alternative google search of course demonstrated that the second argument to .get() is the default should no such environment variable exist, but I'm curious what I did wrong there. I tried to do it "right" by looking up the information on my own (rtfm), but I failed. What was the actual proper way to use the Python documentation here?

If the docs mention something returning a mapping type, it means it behaves very similar to the ubiquitous dict type.
https://docs.python.org/3/library/stdtypes.html#dict.get
The os module doc mentions:
This mapping is captured the first time the os module is imported, typically during Python startup as part of processing site.py. Changes to the environment made after this time are not reflected in os.environ, except for changes made by modifying os.environ directly
So reading the os.environ docs it seems that for example writes to os.environ are reflected in the system. Regular dict would not do this of course so that is one important difference between this custom mapping type and a dict.

Equivalent to python's -R option that affects the hash of ints

We have a large collection of python code that takes some input and produces some output.
We would like to guarantee that, given the identical input, we produce identical output regardless of python version or local environment. (e.g. whether the code is run on Windows, Mac, or Linux, in 32-bit or 64-bit)
We have been enforcing this in an automated test suite by running our program both with and without the -R option to python and comparing the output, assuming that would shake out any spots where our output accidentally wound up dependent on iteration over a dict. (The most common source of non-determinism in our code)
However, as we recently adjusted our code to also support python 3, we discovered a place where our output depended in part on iteration over a dict that used ints as keys. This iteration order changed in python3 as compared to python2, and was making our output different. Our existing tests (all on python 2.7) didn't notice this. (Because -R doesn't affect the hash of ints) Once found, it was easy to fix, but we would like to have found it earlier.
Is there any way to further stress-test our code and give us confidence that we've ferreted out all places where we end up implicitly depending on something that will possibly be different across python versions/environments? I think that something like -R or PYTHONHASHSEED that applied to numbers as well as to str, bytes, and datetime objects could work, but I'm open to other approaches. I would however like our automated test machine to need only a single python version installed, if possible.
Another acceptable alternative would be some way to run our code with pypy tweaked so as to use a different order when iterating items out of a dict; I think our code runs on pypy, though it's not something we've ever explicitly supported. However, if some pypy expert gives us a way to tweak dictionary iteration order on different runs, it's something we'll work towards.

Using PyPy is not the best choice here, given that it always retain the insertion order in its dicts (with a method that makes dicts use less memory). We can of course make it change the order dicts are enumerated, but it defeats the point.
Instead, I'd suggest to hack at the CPython source code to change the way the hash is used inside dictobject.c. For example, after each hash = PyObject_Hash(key); if (hash == -1) { ..error.. }; you could add hash ^= HASH_TWEAK; and compile different versions of CPython with different values for HASH_TWEAK. (I did such a thing at one point, but I can't find it any more. You need to be a bit careful about where the hash values are the original ones or the modified ones.)

Maintaining two versions of an ipython notebook

I often need to create two versions of an ipython notebook: One contains tasks to be carried out (usually including some python code and output), the other contains the same text plus solutions. Let's call them the assignment and the solution.
It is easy to generate the solution document first, then strip the answers to generate the assignment (or vice versa). But if I subsequently need to make changes (and I always do), I need to repeat the stripping process. Is there a reasonable workflow that will allow changes in the assignment to be propagated to the solutions document?
Partial self-answer: I have experimented with leveraging mercurial's hg copy, which will let two files with different names share history. But I can only get this to work if assignment and solution are in different directories, in two linked hg repositories. I would much prefer a simpler set-up. I've also noticed that diff gets very confused when one JSON file has more sections than another, making a VCS-based solution even less attractive. (To be clear: Ordinary use of a VCS with notebooks is fine; it's the parallel versions that stumble).
This question covers similar ground, but does not solve my problem. In fact an answer to my question would solve the OP's second remaining problem, "pulling changes" (see the Update section).

It sounds like you are maintaining an assignment and an answer key of some kind and want to be able to distribute the assignments (without solutions) to students, and still have the answers for yourself or a TA.
For something like this, I would create two branches "unsolved" and "solved". First write the questions on the "unsolved" branch. Then create the "solved" branch from there and add the solutions. If you ever need to update a question, update back to the "unsolved" branch, make the update and merge the change into "solved" and fix the solution.
You could try going the other way, but my hunch is that going "backwards" from solved to unsolved might be strange to maintain.

After some experimentation I concluded that it is best to tackle this by processing the notebook's JSON code. Version control systems are not the right approach, for the following reasons:
JSON doesn't diff very well when adding or deleting cells. A minimal change leads to mis-matched braces and a very messy diff.
In my use case, the superset version of the file (containing both the assignments and their solutions) must be the source document. This is because the assignment includes example code and output that depends on earlier parts, to be written by the students. This model does not play well with version control, as pointed out by #ChrisPhillips in his answer.
I ended up filtering the JSON structure for the notebook and stripping out the solution cells; they may be recognized via special metadata (which can be set interactively using the metadata button in the interface), or by pattern-matching on the cell contents. The following snippet shows how to filter out cells whose first line starts with # SOLUTION:
def stripcell(cell, pattern):
"""Check if the first line of the cell's content matches `pattern`"""
if cell["cell_type"] == "code":
content = cell["input"]
else:
content = cell["source"]
return ( len(content) > 0 and re.search(pattern, content[0]) )
pattern = r"^# SOLUTION:"
struct = json.load(open("input.ipynb"))
cells = struct["worksheets"][0]["cells"]
struct["worksheets"][0]["cells"] = [ c for c in cells if not stripcell(c, pattern) ]
json.dump(struct, open("output.ipynb", "wb"), indent=1)
I used the generic json library rather than the notebook API. If there's a better way to go about it, please let me know.

Appengine Search API - Globally Consistent

I've been using the appengine python experimental searchAPI. It works great. With release 1.7.3 I updated all of the deprecated methods. However, I am now getting this warning:
DeprecationWarning: consistency is deprecated. GLOBALLY_CONSIST
However, I'm not sure how to address it in my code. Can anyone point me in the right direction?

This depends on whether or not you have any globally consistent indexes. If you do, then you should migrate all of your data from those indexes to new, per-document-consistent (which is the default) indexes. To do this:
Loop through the documents you have stored in the global index and reindexing them in the new index.
Change references from the global index to the new per-document index.
Ensure everything works, then delete the documents from your global index (not necessary to complete the migration, but still a good idea).
You then should remove any mention of consistency from your code; the default is per-document consistent, and eventually we will remove the ability to specify a consistency at all.
If you don't have any data in a globally consistent index, you're probably getting the warning because you're specifying a consistency. If you stop specifying the consistency it should go away.
Note that there is a known issue with the Python API that causes a lot of erroneous deprecation warnings about consistency, so you could be seeing that as well. That issue will be fixed in the next release.

Python ordered garbage collectible dictionary?

I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.
So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.
Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.
Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.

If the presence of the __del__ method is problematic for you, just remove it:
>>> import collections
>>> del collections.OrderedDict.__del__
You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.

It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.
I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…
But two more points:
rewriting everything to avoid cycles is a huge amount of effort.
The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.
You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.
But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.
Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.
For example, take this simple program:
d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
print(k, d[k], id(k), id(d[k]), hash(k))
If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.
The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)
But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.

Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.