Get method names and lines from git commit - python

Get method names and lines from git commit - python - python

I am wondering if there is a way to get the list of method names that are modified along with modified lines and file path from git commit or if there are any better solution for python files
I have been going through few answers in stackoverflow where the suggestion was to edit the gitconfig file and .gitattributes but I couldn't get the desired output

Git has no notion of the semantics of the files it tracks. It can just give you the lines added or removed from a file or the bytes changed in binary files.
How these maps to your methods is beyond Git's understanding. However there might be tools and/or scripts that will try and heuristically determine which methods are affected by a (range of) commits.

Related

How to make Snakemake ignore an updated file when determining which input files have changed? [duplicate]

Even if the output files of a Snakemake build already exist, Snakemake wants to rerun my entire pipeline only because I have modified one of the first input or intermediary output files.
I figured this out by doing a Snakemake dry run with -n which gave the following report for updated input file:
Reason: Updated input files: input-data.csv
and this message for update intermediary files
reason: Input files updated by another job: intermediary-output.csv
How can I force Snakemake to ignore the file update?

You can use the option --touch to mark them up to date:
--touch, -t
Touch output files (mark them up to date without
really changing them) instead of running their
commands. This is used to pretend that the rules were
executed, in order to fool future invocations of
snakemake. Fails if a file does not yet exist.
Beware that this will touch all your files and thus modify the timestamps to put them back in order.

In addition to Eric's answer, see also the ancient flag to ignore timestamps on input files.
Also note that the Unix command touch can be used to modify the timestamp of an existing file and make it appear older than it actually is:
touch --date='2004-12-31 12:00:00' foo.txt
ls -l foo.txt
-rw-rw-r-- 1 db291g db291g 0 Dec 31 2004 foo.txt

In case --touch (with --force, --forceall or --forcerun as the official documentation says that needs to be used in order to force the "touch" if doesn't work by itself) didn't work out as expected, ancient is not an option or it would need to modify too much from the workflow file, or you faced https://github.com/snakemake/snakemake/issues/823 (that's what happened to me when I tried --force and --force*), here is what I did to solve this solution:
I noticed that there were jobs that shouldn't be running since I put files in the expected paths.
I identified the input and output files of the rules that I didn't want to run.
In the order of the rules that were being executed and I didn't want to, I executed touch on the input files and, after, on the output files (taking into account the order of the rules!).
That's it. Since now the timestamp is updated according the rules order and according the input and output files, snakemake will not detect any "updated" files.
This is the manual method, and I think is the last option if the methods mentioned by the rest of people don't work or they are not an option somehow.

Using black with git clean filter

I'm trying to set up black to run on any files checked into git.
I've set it up as follows:
git config filter.black.clean 'black -'
echo '*.py filter=black' >> .git/info/attributes
As far as I understand, this should work ok as black with - as the source path will read from STDIN and output to STDOUT, which is what I think the git filter needs it to do.
However this does not work. When I add an un-black file with git add I see the following output:
reformatted -
All done! ✨ 🍰 ✨
1 file reformatted.
And the file is not changed on disk. What am I doing wrong?

The Black documentation recommends using a pre-commit hook, rather than smudge and clean filters. Note that filter.black.clean defines a clean filter and you have not set up any smudge filters.
The reason you see no change to the work-tree version of the file is that a clean filter is used when turning the work-tree version of a file into the index (to-be-committed) version of the file. This has no effect on the work-tree version of the file!
A smudge filter is used in the opposite direction: Git has a file that is in the index—for whatever reason, such as because it was just copied into the index as part of the git checkout operation to switch to a specific commit—and desires to convert that in-the-index, compressed, Git-ized file to one that you can actually see and edit in your editor, or run with python. Git will, at this time, run the (de-compressed) file contents through your smudge filter.
Note that if you convert some file contents, pre-compression, in a clean filter, and then later extract that file from the repository, into the index, and on into your work-tree, you will at that time be able to see what happened in the clean filter (assuming you do not have a countervailing smudge filter that undoes the effect).
In a code-reformatting world, one could conceivably use a clean filter to turn all source files into some sort of canonical (perhaps four-space-indentation) form, and a smudge filter to turn all source files into one's preferred format (two-space or eight-space indentation). If these transformations are all fully reversible, what you would see, in your work-tree, would be in your preferred format, and what others would see in their work-trees would be their preferred format; but what the version control system itself would see would be the canonical, standardized format.
That's not how Black is actually intended to be used, though it probably can be used that way.

It's not obvious how to set this up manually, but using the pre-commit framework seems to work well.
This approach does two things:
checks prior to committing that the files pass the black test
runs black to fix the files if they don't
So if files don't pass, the commit fails, and black fixes the files which you then need to git -add before trying to commit again.
In a separate test, I managed to set up the pre-commit hook manually using black . --check in .git/hooks/pre-commit (this just does the check - doesn't fix anything if it fails), but never figured out how to configure black to work with the clean and smudge filters.

File update : multiple versions stored inside the ZIP archive

Let's say we have a test.zip file and we update a file:
zfh = zipfile.ZipFile("test.zip", mode = "a")
zfh.write("/home/msala/test.txt")
zfh.close()
Repeating a few times this "update", using the builtin method printdir()
I see in the archive there are stored not only the last one "test.txt" but also all the previous copies of the file.
Ok, I understand the zipfile library hasn't a delete method.
Questions:
if I call the builtin method extract("/home/msala/test.txt"),
which copy of the file is extracted and written to the file system ?
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
At the moment I list all the stored files and sort them by filename, last modification time...

The tl;dr is no, you can't do this without building a bit of extra info—but that can be done without sorting, and, even if you did have to sort, the performance cost would be irrelevant.
First, let me explain how zipfiles work. (Even if you understand this, later readers with the same problem may not.)
Unfortunately, the specification is a copyrighted and paywalled ISO document, so I can't link to it or quote it. The original PKZip APPNOTE.TXT which is the de facto pro-standardization standard is available, however. And numerous sites like Wikipedia have nice summaries.
A zipfile is 0 or more fragments, followed by a central directory.
Fragments are just treated as if they were all concatenated into one big file.
The body of the file can contain zip entries, in arbitrary order, along with anything you want. (This is how DOS/Windows self-extracting archives work—the unzip executable comes at the start of the first fragment.) Anything that looks like a zip entry, but isn't referenced by the central directory, is not treated as a zip entry (except when repairing a corrupted zipfile.)
Each zip entries starts with a header that gives you the filename, compression format, etc. of the following data.
The directory is a list of directory entries that contain most of the same information, plus a pointer to where to find the zip entry.
It's the order of directory entries that determines the order of the files in the archive.
if I call the builtin method extract("/home/msala/test.txt"), which copy of the file is extracted and written to the file system ?
The behavior isn't really specified anywhere.
Extracting the whole archive should extract both files, in the order present in the zip directory (the same order given by infolist), with the second one overwriting the first.
But extracting by name doesn't have to give you both—it could give you the last one, or the first, or pick one at random.
Python gives you the last. The way this works is that, when reading the directory, it builds a dict mapping filenames to ZipInfos, just adding them as encountered, so the last one will overwrite the previous ones. (Here's the 3.7 code.) Whenever you try to access something by filename, it just looks up the filename in that dict to get the ZipInfo.
But is that something you want to rely on? I'm not sure. On the one hand, this behavior has been the same from Python 1.6 to 3.7, which is usually a good sign that it's not going to change, even if it's never been documented. On the other hand, there are open issues—including #6818, which is intended to add deletion support to the library one way or another—that could change it.
And it's really not that hard to do the same thing yourself. With the added benefit that you can use a different rule—always keep the first, always keep the one with the latest mod time, etc.
You seem to be worried about the performance cost of sorting the infolist, which is probably not worth worrying about. The time it takes to read and parse the zip directory is going to make the cost of your sort virtually invisible.
But you don't really need to sort here. After all, you don't want to be able to get all of the entries with a given name in some order, you just want to get one particular entry for each name. So, you can just do what ZipFile does internally, which takes only linear time to build, and constant time each time you search it. And you can use any rule you want here.
entries = {}
for entry in zfh.infolist():
if entry.filename not in entries:
entries[entry.filename] = entries
This keeps the first entry for any name. If you want to keep the last, just remove the if. If you want to keep the latest by modtime, just change it if entry.date_time > entries[entry.filename].date_time:. And so on.
Now, instead of relying on what happens when you call extract("home/msala/test.txt"), you can call extract(entries["home/msala/test.txt"]) and know that you're getting the first/last/latest/whatever file of that name.
inside the zip archive, is there any flag telling that old copies .. are old copies, superseded by the last one ?
No, not really.
The way to delete a file is to remove it from the central directory. Which you do just by rewriting the central directory. Since it comes at the end of the zipfile, and is almost always more than small enough to fit on even the smallest floppy, this was generally considered fine even back in the DOS days.
(But notice that if you unplug the computer in the middle of it, you've got a zipfile without a central directory, which has to be rebuilt by scanning all of the file entries. So, many newer tools will instead, at least for smaller files, rewrite the whole file to a tempfile then rename it over the original, to guarantee a safe, atomic write.)
At least some early tools would sometimes, especially for gigantic archives, rewrite the entry's pathname's first byte with a NUL. But this doesn't really mark the entry as deleted, it just renames it to "\0ome/msala/test.txt". And many modern tools will in fact treat it as meaning exactly that and give you weird errors telling you they can't find a directory named 'ome' or '' or something else fun. Plus, this means the filename in the directory entry no longer matches the filename in the file entry header, which will cause many modern tools to flag the zipfile as corrupted.
At any rate, Python's zipfile module doesn't do either of these, so you'd need to subclass ZipFile to add the support yourself.

I solved this way, similar to database records management.
Adding a file to the archive, I look for previous stored copies (same filename).
For each of them, I set their field "comment" to a specific marker, for example "deleted".
We add the new file, with comment = empty.
As we like, we can "vacuum": shrink the zip archive using the usually tools (under the hood a new archive is created, discarding the files having the comment set to "deleted").
This way, we have also a simple "versioning".
We have all the previous files copies, until the vacuum.

How to ignore hidden files when using os.stat() results in Python?

I'm trying to get the time of last modification (os.stat.st_mtime) of a particular directory. My issue is I have added a few metadata files that are hidden (they start with .). If I use os.stat(directory).st_mtime I get the date at which I updated the metadata file, not the date that a non-hidden file was modified in the directory. I would like to get the most recent time of modification for all of the other files in the directory other than the hidden metadata files.
I figure it's possible to write my own function, something along the lines of:
for file in folder:
if not file starts with '.':
modified_times.append(os.path.getmtime('/path/to/file')
last_time = most recent of modified_times
However, is it possible to do this natively in python? Or do I need to write my own function like the pseudocode above (or something like this question)?

Your desired outcome is impossible. The most recent modification time of all non-hidden files doesn't necessarily correspond to the virtual "last modified time of a directory ignoring hidden files". The problem is that directories are modified when files are moved in and out of them, but the file timestamps aren't changed (the file was moved, but not modified). So your proposed solution is at best a heuristic; you can hope it's correct, but there is no way to be sure.
In any event, no, there is no built-in that provides this heuristic. The concept of hidden vs. non-hidden files is OS and file system dependent, and Python provides no built-in API that cares about the distinction. If you want to make a "last_modified_guess" function, you'll have to write it yourself (I recommend basing it on os.scandir for efficiency).
Something as simple as:
last_time = max(entry.stat().st_mtime for entry in os.scandir(somedir) if not entry.name.startswith('.'))
would get you the most recent last modified time (in seconds since the epoch) of your non-hidden directory entries.
Update: On further reflection, the glob module does include a concept of . prefix meaning "hidden", so you could use glob.glob/glob.iglob of os.path.join(somedir, '*') to have it filter out the "hidden" files for you. That said, by doing so, you give up some of the potential benefits of os.scandir (free or cached stat results, free type checks, etc.), so if all you need is "hidden" filtering, a simple .startswith('.') check is not worth giving that up.

Is it possible to generate a single .pot file from Sphinx documentation?

I'm undergoing the task of i18n / l10n the documentation of a largish project. The documentation is done with Sphinx, that has off-the shelf basic support for i18n.
My problem is similar to that of this other question: namely the fact a large chunk of the strings for each pot file is the same, and I would like my translators not to re-enter the same translation over and again. I would rather have a single template file.
My problem is not really merging the files (that is just a msgcat *.pot > all.pot away), but rather the fact that - for the domains to work when building the documentation in a particular language - I have to copy and rename all.pot back to the original file names. So my workaroundish way of working is:
Generate fileA.pot, fileB.pot
Merge the two into all.pot
cp all.pot fileA.pot + cp all.pot fileB.pot
Is there a cleaner way to do the same? gettext_compact brings me only half-way through my goal...

After over 7 months, extensive research and a couple of attempted answers, it seems safe to say that no - at least with the current version 1.1.3 - it is not possible to generate a single .pot file from Sphinx documentation.
The workaround described in the original question is also the most straightforward for automatically removing duplicate strings when merging the different .pot files into a single one.

I see two possibilities here:
One is to hack Sphinx to not use domains at all… which I guess you won’t want to do for several reasons.
The other is: since you split by domains, the -M option of msggrep looks like it does what you need to do. Otherwise, the -N option can still be used to filter by source file(s).
That is, if the obvious solution doesn't work:
generate fileA.pot, fileB.pot (also look at msguniq here)
merge them into all.pot and give that to translators
for x in file*.pot; do msgmerge -o ${x%t} all-translated.po $x; done
According to TFM, msgmerge only takes the (supposedly newer) translations from its first argument (the translated po file) that match the up-to-date source locations of the second argument (po file with old translations, or pot file = template with only empty msgstrings).

Add the following to your conf.py :
gettext_compact = "docs"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.