Maintaining two versions of an ipython notebook - python

I often need to create two versions of an ipython notebook: One contains tasks to be carried out (usually including some python code and output), the other contains the same text plus solutions. Let's call them the assignment and the solution.
It is easy to generate the solution document first, then strip the answers to generate the assignment (or vice versa). But if I subsequently need to make changes (and I always do), I need to repeat the stripping process. Is there a reasonable workflow that will allow changes in the assignment to be propagated to the solutions document?
Partial self-answer: I have experimented with leveraging mercurial's hg copy, which will let two files with different names share history. But I can only get this to work if assignment and solution are in different directories, in two linked hg repositories. I would much prefer a simpler set-up. I've also noticed that diff gets very confused when one JSON file has more sections than another, making a VCS-based solution even less attractive. (To be clear: Ordinary use of a VCS with notebooks is fine; it's the parallel versions that stumble).
This question covers similar ground, but does not solve my problem. In fact an answer to my question would solve the OP's second remaining problem, "pulling changes" (see the Update section).

It sounds like you are maintaining an assignment and an answer key of some kind and want to be able to distribute the assignments (without solutions) to students, and still have the answers for yourself or a TA.
For something like this, I would create two branches "unsolved" and "solved". First write the questions on the "unsolved" branch. Then create the "solved" branch from there and add the solutions. If you ever need to update a question, update back to the "unsolved" branch, make the update and merge the change into "solved" and fix the solution.
You could try going the other way, but my hunch is that going "backwards" from solved to unsolved might be strange to maintain.

After some experimentation I concluded that it is best to tackle this by processing the notebook's JSON code. Version control systems are not the right approach, for the following reasons:
JSON doesn't diff very well when adding or deleting cells. A minimal change leads to mis-matched braces and a very messy diff.
In my use case, the superset version of the file (containing both the assignments and their solutions) must be the source document. This is because the assignment includes example code and output that depends on earlier parts, to be written by the students. This model does not play well with version control, as pointed out by #ChrisPhillips in his answer.
I ended up filtering the JSON structure for the notebook and stripping out the solution cells; they may be recognized via special metadata (which can be set interactively using the metadata button in the interface), or by pattern-matching on the cell contents. The following snippet shows how to filter out cells whose first line starts with # SOLUTION:
def stripcell(cell, pattern):
"""Check if the first line of the cell's content matches `pattern`"""
if cell["cell_type"] == "code":
content = cell["input"]
else:
content = cell["source"]
return ( len(content) > 0 and re.search(pattern, content[0]) )
pattern = r"^# SOLUTION:"
struct = json.load(open("input.ipynb"))
cells = struct["worksheets"][0]["cells"]
struct["worksheets"][0]["cells"] = [ c for c in cells if not stripcell(c, pattern) ]
json.dump(struct, open("output.ipynb", "wb"), indent=1)
I used the generic json library rather than the notebook API. If there's a better way to go about it, please let me know.

Related

What is the best or proper way to allow debugging of generated code?

For various reasons, in one project I generate executable code by means of generating AST from various source files the compiling that to bytecode (though the question could also work for cases where the bytecode is generated directly I guess).
From some experimentation, it looks like the debugger more or less just uses the lineno information embedded in the AST alongside the filename passed to compile in order to provide a representation for the debugger's purposes, however this assumes the code being executed comes from a single on-disk file.
That is not necessarily the case for my project, the executable code can be pieced together from multiple sources, and some or all of these sources may have been fetched over the network, or been retrieved from non-disk storage (e.g. database).
And so my Y questions, which may be the wrong ones (hence the background):
is it possible to provide a memory buffer of some sort, or is it necessary to generate a singular on-disk representation of the "virtual source"?
how well would the debugger deal with jumping around between the different bits and pieces if the virtual source can't or should not be linearised[0]
and just in case, is the assumption of Python only supporting a single contiguous source file correct or can it actually be fed multiple sources somehow?
[0] for instance a web-style literate program would be debugged in its original form, jumping between the code sections, not in the so-called "tangled" form
Some of this can be handled by the trepan3k debugger. For other things various hooks are in place.
First of all it can debug based on bytecode alone. But of course stepping instructions won't be possible if the line number table doesn't exist. And for that reason if for no other, I would add a "line number" for each logical stopping point, such as at the beginning of statements. The numbers don't have to be line numbers, they could just count from 1 or be indexes into some other table. This is more or less how go's Pos type position works.
The debugger will let you set a breakpoint on a function, but that function has to exist and when you start any python program most of the functions you define don't exist. So the typically way to do this is to modify the source to call the debugger at some point. In trepan3k the lingo for this is:
from trepan.api import debug; debug()
Do that in a place where the other functions you want to break on and that have been defined.
And the functions can be specified as methods on existing variables, e.g. self.my_function()
One of the advanced features of this debugger is that will decompile the bytecode to produce source code. There is a command called deparse which will show you the context around where you are currently stopped.
Deparsing bytecode though is a bit difficult so depending on which kind of bytecode you get the results may vary.
As for the virtual source problem, well that situation is somewhat tolerated in the debugger, since that kind of thing has to go on when there is no source. And to facilitate this and remote debugging (where the file locations locally and remotely can be different), we allow for filename remapping.
Another library pyficache is used to for this remapping; it has the ability I believe remap contiguous lines of one file into lines in another file. And I think you could use this over and over again. However so far there hasn't been need for this. And that code is pretty old. So someone would have to beef up trepan3k here.
Lastly, related to trepan3k is a trepan-xpy which is a CPython bytecode debugger which can step bytecode instructions even when the line number table is empty.

Splitting a Python script in multiple files sharing the same namespace

Similar questions have been asked for other languages (C++, Clojure, TypeScript, maybe others) but I am still looking for an answer for Python.
There are a lot of similar questions related to the use of import and global in Python but the associated answers don't fit my needs. I just want to split a big file into smaller ones to easily modify/reuse parts of the code in different versions of the same program, without having to deal with different namespaces or managing global variables.
A simple hack to do what I want would be to merge selected Python files at runtime with a script but I hope there is a pythonic way of doing that.
With an illustration, what I am trying to do is going from several big files which are almost identical:
big_file_v1.py
## First part
# Hundreds of code lines to define things, make computations...
## Second part
# More code to do a few additional things
big_file_v2.py
## First part
# Exactly the same first part as in big_file_v1.py
## Second part
# A lot of small differences compared to big_file_v1.py
...to several smaller files with most of them not needing any modification or only needing modifications that I want to share across all of the different versions:
myprog_v1.py
include first_part_common_to_both_versions.py
include second_part_v1.py
myprog_v2.py
include first_part_common_to_both_versions.py
include second_part_v2.py
In this second case, using include-like commands, I can instantly share the modifications made in first_part_common_to_both_versions.py across version 1 and 2 of my program and I just need to modify/copy the smaller files second_part_v2.py if I want to make new modifications/create another new version.
The question is: how to do this include in Python?
Just to avoid debates on good software development practices, I use Python as a tool to solve scientific questions and as such, I care more about editing comfort than coding practice.
Have a look at exec or execfile:
"Python's exec statement is similar to the import statement, with an
important difference: The exec statement executes a file in the
current namespace. The exec statement doesn't create a new namespace.
We'll look at this in the section called “The exec Statement”

Where to Store Borrowed Python Code?

Recently, I have been working on a Python project with usual directory structure, and have received help from someone else who has given me a code snippet (a single function definition, about 30 lines long) which I would like to import into my code. What is the most proper directory/location in a Python project to store borrowed code of this size? Is it best to store the snippet into an entirely different module and import it from there?
I generally find it easiest to put such code in a separate file, because for clarity you don't want more than one different copyright/licensing term to apply within a single file. So in Python this does indeed mean a separate module. Then the file can contain whatever attribution and other legal boilerplate you need.
As long as your file headers don't accidentally claim copyright on something to which you do not own the copyright, I don't think it's actually a legal problem to mix externally-licensed or public domain code into files you mostly own. I may be wrong, though, which is why I normally avoid giving myself reason to think about it. A comment saying "this is external code from the following source with the following license:" may well be clearer than dividing code into different files that naturally wouldn't be. So I do occasionally do that.
I don't see any definite need for a separate directory (or package) per separate external source. If that's already part of your project structure (that is, it already uses external libraries by incorporating their source) then I suppose you might as well continue the trend.
I usually place scripts I copy off the internet in a folder/package called borrowed so I know all of the code here is stuff that I didn't write myself.
That is, if it's something more substantial than a one or two-liner demonstrating how something works.

How to delete almost duplicate files

Edit 2:
Solved, see my answer waaaaaaay below.
Edit:
After banging my head a few times, I almost did it.
Here's my (not cleaned up, you can tell I was troubleshooting a bunch of stuff) code:
http://pastebin.com/ve4Qkj2K
And here's the problem: It works sometimes and other times not so much. For example, it will work perfectly with some files, then leave one of the longest codes instead of the shortest one, and for others it will delete maybe 2 out of 5 duplicates, leaving 3 behind. If it just performed reliably, I might be able to fix it, but I don't understand the seemingly random behavior. Any ideas?
Original Post:
Just so you know, I'm just beginning with python, and I'm using python 3.3
So here's my problem:
Let's say I have a folder with about 5,000 files in it. Some of these files have very similar names, but different contents and possible different extensions. After a readable name, there is a code, always with a "(" or a "[" (no quotes) before it. The name and code are separated by a space. For example:
something (TZA).blah
something [TZZ].another
hello (YTYRRFEW).extension
something (YJTR).another_ext
I'm trying to only get one of the something's.something, and delete the others. Another fact which may be important is that there are usually more than one code, such as "something (THTG) (FTGRR) [GTGEES!#!].yet_another_random_extension", all separated by spaces. Although it doesn't matter 100%, it would be best to save the one with the least codes.
I made some (very very short) code to get a list of all files:
import glob
files=[]
files=glob.glob("*")
but after this I'm pretty much lost. Any help would be appreciated, even if it's just pointing me in the right direction!
I would suggest creating separate array of bare file names and check the condition if any element exists in any other place by taking array with all indices excluding the current checked in loop iteration.
The
if str_fragment in name
condition simply finds any string fragment in any string-type name. It can be useful as well.
I got it! The version I ended up with works (99%) perfectly. Although it needs to make multiply passes, reading,analyzing, and deleting over 2 thousand files took about 2 seconds on my pitiful, slow notebook. My final version is here:
http://pastebin.com/i7SE1mh6
The only tiny bug is that if the final item in the list has a duplicate, it will leave it there (and no more than 2). That's very simple to manually correct so I didn't bother to fix it (ain't nobody got time fo that and all).
Hope sometime in the future this could actually help somebody other than me.
I didn't get too many answers here, but it WAS a pretty unusual problem, so thanks anyway. See ya.

How do I compare two nested data structures for unittesting?

For those who know perl, I'm looking for something similar to Test::Deep::is_deeply() in Python.
In Python's unittest I can conveniently compare nested data structures already, if I expect them to be equal:
self.assertEqual(os.walk('some_path'),
my.walk('some_path'),
"compare os.walk with my own implementation")
However, in the wanted test, the order of files in the respective sublist of the os.walk tuple shall be of no concern.
If it was just this one test it would be ok to code an easy solution. But I envision several tests on differently structured nested data. And I am hoping for a general solution.
I checked Python's own unittest documentation, looked at pyUnit, and at nose and it's plugins. Active maintenance would also be an important aspect for usage.
The ultimate goal for me would be to have a set of descriptive types like UnorderedIterable, SubsetOf, SupersetOf, etc which can be called to describe a nested data structure, and then use that description to compare two actual sets of data.
In the os.walk example I'd like something like:
comparison = OrderedIterable(
OrderedIterable(
str,
UnorderedIterable(),
UnorderedIterable()
)
)
The above describes the kind of data structure that list(os.walk()) would return. For comparison of data A and data B in a unit test, the current path names would be cast into a str(), and the dir and file lists would be compared ignoring the order with:
self.assertDeep(A, B, comparison, msg)
Is there anything out there? Or is it such a trivial task that people write their own? I feel comfortable doing it, but I don't want to reinvent, and especially would not want to code the full orthogonal set of types, tests for those, etc. In short, I wouldn't publish it and thus the next one has to rewrite again...
Python Deep seems to be a project to reimplement perl's Test::Deep. It is written by the author of Test::Deep himself. Last development happened in early 2016.
Update (2018/Aug): Latest release (2016/Feb) is located on PyPi/Deep
I have done some P3k porting work on github
Not a solution, but the currently implemented workaround to solve the particular example listed in the question:
os_walk = list(os.walk('some_path'))
dt_walk = list(my.walk('some_path'))
self.assertEqual(len(dt_walk), len(os_walk), "walk() same length")
for ((osw, osw_dirs, osw_files), (dt, dt_dirs, dt_files)) in zip(os_walk, dt_walk):
self.assertEqual(dt, osw, "walk() currentdir")
self.assertSameElements(dt_dirs, osw_dirs, "walk() dirlist")
self.assertSameElements(dt_files, osw_files, "walk() fileList")
As we can see from this example implementation that's quite a bit of code. As we can also see, Python's unittest has most of the ingredients required.

Categories