Is there an easy way to match files against .gitignore rules? - python

I'm writing a git pre-commit hook in Python, and I'd like to define a blacklist like a .gitignore file to check files against before processing them. Is there an easy way to check whether a file is defined against a set of .gitignore rules? The rules are kind of arcane, and I'd rather not have to reimplement them.

Assuming you're in the directory containing the .gitignore file, one shell command will list all the files that are not ignored:
git ls-files
From python you can simply call:
import os
os.system("git ls-files")
and you can extract the list of files like so:
import subprocess
list_of_files = subprocess.check_output("git ls-files", shell=True).splitlines()
If you want to list the the files that are ignored (a.k.a, untracked), then you add the option '--other':
git ls-files --other

This is rather klunky, but should work:
create a temporary git repository
populate it with your proposed .gitignore
also populate it with one file per pathname
use git status --porcelain on the resulting temporary repository
empty it out (remove it entirely, or preserve it as empty for the next pass, whichever seems more appropriate).
This does, however, smell like an XY problem. The klunky solution to Y is probably a poor solution to the real problem X.
Post-comment answer with details (and side notes)
So, you have some set of files to lint, probably from inspecting the commit. The following code may be more generic than you need (we don't really need the status part in most cases) but I include it for illustration:
import subprocess
proc = subprocess.Popen(['git',
'diff-index', # use plumbing command, not user diff
'--cached', # compare index vs HEAD
'-r', # recurse into subdirectories
'--name-status', # show status & pathname
# '--diff-filter=AM', # optional: only A and M files
'-z', # use machine-readable output
'HEAD'], # the commit to compare against
stdout=subprocess.PIPE)
text = proc.stdout.read()
status = proc.wait()
# and check for failure as usual: Git returns 0 on success
Now we need something like pairwise from Iterating over every two elements in a list:
import sys
if sys.version_info[0] >= 3:
izip = zip
else:
from itertools import izip
def pairwise(it):
"s -> (s0, s1), (s2, s3), (s4, s5), ..."
a = iter(it)
return izip(a, a)
and we can break up the git status output with:
for state, path in pairwise(text.split(b'\0')):
...
We now have a state (b'A' = added, b'M' = modified, and so on) for each file. (Be sure to check for state T if you allow symlinks, in case a file changes from ordinary file to symlink, or vice versa. Note that we're depending on pairwise to discard the unpaired empty b'' string at the end of text.split(b'\0'), which is there because Git produces a NUL-terminated list rather than a NUL-separated list.)
Let's assume that at some point we collect up the files-to-maybe-lint into a list (or iterable) called candidates:
>>> candidates
[b'a.py', b'dir/b.py', b'z.py']
I will assume that you have avoided putting .gitignore into this list-or-iterable, since we plan to take it over for our own purposes.
Now we have two big problems: ignoring some files, and getting the version of those files that will actually be linted.
Just because a file is listed as modified, doesn't mean that the version in the work-tree is the version that will be committed. For instance:
$ git status
$ echo foo >> README
$ git add README
$ echo bar >> README
$ git status --short
MM README
The first M here means that the index version differs from HEAD (this is what we got from git diff-index above) while the second M here means that the index version also differs from the work-tree version.
The version that will be committed is the index version, not the work-tree version. What we need to lint is not the work-tree version but rather the index version.
So, now we need a temporary directory. The thing to use here is tempfile.mkdtemp if your Python is old, or the fancified context manager version if not. Note that we have byte-string pathnames above when working with Python3, and ordinary (string) pathnames when working with Python2, so this also is version dependent.
Since this is ordinary Python, not tricky Git interaction, I leave this part as an exercise—and I'll just gloss right over all the bytes-vs-strings pathname stuff. :-) However, for the --stdin -z bit below, note that Git will need the list of file names as b\0-separated bytes.
Once we have the (empty) temporary directory, in a format suitable for passing to cwd= in subprocess.Popen, we now need to run git checkout-index. There are a few options but let's go this way:
import os
proc = subprocess.Popen(['git', 'rev-parse', '--git-dir'],
stdout=subprocess.PIPE)
git_dir = proc.stdout.read().rstrip(b'\n')
status = proc.wait()
if status:
raise ...
if sys.version_info[0] >= 3: # XXX ugh, but don't want to getcwdb etc
git_dir = git_dir.decode('utf8')
git_dir = os.path.join(os.getcwd(), git_dir)
proc = subprocess.Popen(['git',
'--git-dir={}'.format(git_dir),
'checkout-index', '-z', '--stdin'],
stdin=subprocess.PIPE, cwd=tmpdir)
proc.stdin.write(b'\0'.join(candidates))
proc.stdin.close()
status = proc.wait()
if status:
raise ...
Now we want to write our special ignore file to os.path.join(tmpdir, '.gitignore'). Of course we also need tmpdir to act like its own Git repository now. These three things will do the trick:
import shutil
subprocess.check_call(['git', 'init'], cwd=tmpdir)
shutil.copy(os.path.join(git_dir, '.pylintignore'),
os.path.join(tmpdir, '.gitignore'))
subprocess.check_call(['git', 'add', '-A'], cwd=tmpdir)
as we will now be using Git's ignore rules with the .pylintignore file we copied to .gitignore.
Now we just would need one more git status pass (with -z for b'\0' style output, likegit diff-index`) to deal with ignored files; but there's a simpler method. We can get Git to remove all the non-ignored files:
subprocess.check_call(['git', 'clean', '-fqx'], cwd=tmpdir)
shutil.rmtree(os.path.join(tmpdir, '.git'))
os.remove(os.path.join(tmpdir, '.gitignore')
and now everything in tmpdir is precisely what we should lint.
Caveat: if your python linter needs to see imported code, you won't want to remove files. Instead, you'll want to use git status or git diff-index to compute the ignored files. Then you'll want to repeat the git checkout-index, but with the -a option, to extract all files into the temporary directory.
Once done, just remove the temp directory as usual (always clean up after yourself!).
Note that some parts of the above are tested piecewise, but assembling it all into full working Python2 or Python3 code remains an exercise.

Related

Always regenerate Sphinx documents containing a specific directive

Sphinx usually incrementally builds the documentation which means that only files that have been changed will be regenerated. I am wondering if there is a way to tell Sphinx to always regenerate certain files which may not directly have been changed but are influenced by changes in other files. More specific: Is there a way to tell Sphinx to always regenerate files that contain a certain directive? The documentation I am working on relies on the possibility to collect and reformat information from other pages with the help of directives quite frequently. A clean (make clean && make [html]) and/or full (sphinx-build -a) build takes significantly longer than an incremental build. Additionally, manually keeping track of files which contain the directive might be complicated. The documentation is written by 10+ authors with limited experience in writing Sphinx documentation.
But even in less complex scenarios you might face this 'issue':
For instance sphinx.ext.todo contains a directive called todolist which collects todos from the whole documentation. If I create a file containing all the todos from my documentation (basically an empty document just containing the todolist directive) the list is not updated until I make a clean build or alter the file.
If you want to test it yourself: Create a documentation with sphinx-quickstart and stick to the default values except for
'> todo: write "todo" entries that can be shown or hidden on build (y/n) [n]: y'
Add a file in source called todos.rst and reference this file from index.rst.
Content of the index.rst:
Welcome to sphinx-todo's documentation!
=======================================
.. toctree::
:maxdepth: 2
todos
.. todo::
I have to do this
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Content of todos.rst:
.. _`todos`:
List of ToDos
=============
.. todolist::
Assuming you use the html output you will notice that todos.html will not change when you add todos to index.html.
tl;dr: How -- if possible -- do I include files containing a specific directive (e.g. todolist) into an incremental build of Sphinx without the need of manually keeping track of them?
By default Sphinx will only update output for new or changed files. This is buried under sphinx-build -a.
At the end of the documentation of command options for sphinx-build:
You can also give one or more filenames on the command line after the source and build directories. Sphinx will then try to build only these output files (and their dependencies).
You could either invoke sphinx-build directly or through your makefile, depending on the makefile that shipped with your version of Sphinx (you can customize the makefile, too).
Just for the record: I benchmarked several solutions.
I created a function called touch_files in my conf.py. It searches for strings in files and -- if found -- touches the file to trigger a rebuilt:
def touch_files(*args):
# recursively search the 'source' directory
for root, dirnames, filenames in os.walk('.'):
# check all rst files
for filename in fnmatch.filter(filenames, '*.rst'):
cur = os.path.join(root, filename)
f = open(cur)
# access content directly from disk
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
if any(s.find(d) != -1 for d in args):
# if a string pattern has been found alter the edit
# time of the file
os.utime(cur, None)
f.close()
# actually call the function
touch_files('.. todolist::')
touch_files can be called with a variable amount of arguments and will edit a file when ONE of the arguments has been found. I tried to optimize the function with regular expression but this did not achieve much. Reading the file content directly from the disk with mmap seemed to have a minor impact.
This is the result of 78 files total from which 36 contain one of two directives.
Command Time Comment
time make html 2.3 s No changes
time sh -c 'make clean && make html' 13.3 s
time make htmlfull 9.4 s sphinx-build -a
time make html 8.4 s with 'touch_files'
'touch_files' 0.2 s tested with testit
Result: Every command has been called just a few times (except 'touch_files') and therefore lack statistical reliability. Sphinx requires roughly 2.3 seconds to check the documentation for changes without doing anything. A clean build requires 13.3 seconds which is much longer than a build with sphinx-build -a. If we just rebuilt 36 out of 78 files, the built process is slightly faster, although I doubt a significant difference could be found here. The overhead of 'touch_files' is rather low. Finding the strings is quite cheap compared to editing the timestamps.
Conclusion: As Steve Piercy pointed out, using sphinx-build -a seems to be the most reasonable approach. At least for my use case. Should a file which does not contain a directive in question result in long building times touch_files might be useful though.

Get the age of a file in the terms of commits in git

I want to get some metrics on which files in my repository are the most recently active, using a measurement that does not require any calculation after storage. Thus number of commits ago that it was last modified.
So the idea that I have is thus:
file_list = subprocess.Popen(['git', 'ls-files'])
(files, _) = proc.communicate()
missing_ages = files
ages = {f: -1 for f in old_ages}
commits_proc = subprocess.Popen(['git', 'ref-list', '--all', '--pretty=format:""'])
(commits, _) = commits_proc.communicate()
age = 0
for commit_sha in [s.split(' ')[1] for s in commits]
commit_list = subprocess.Popen('some', 'git', 'command')
commit_files = commit_list.communicate()
for file in commit_files
if file in missing_ages
ages[file] = age
missing_ages.remove(file)
age += 1
What I need is a non-porcelain git command to get the list of files in a commit given its sha.
git show --stat <commitish> which is able to list the files changes in a commit, but it is not stable,
It can be stable, with the --porcelain option:
Use a special line-based format intended for script consumption.
Added/removed/unchanged runs are printed in the usual unified diff format, starting with a +/-/ character at the beginning of the line and extending to the end of the line.
Newlines in the input are represented by a tilde ~ on a line of its own.
You would still need to do some parsing though.
The final solution was to use git diff-tree -z --name-only --no-commit-id -r <sha>.
Source: the question asker EdJoJob had edited the question to include this.

Using actual branch head revision number in Hudson

We have multiple branches in SVN and use Hudson CI jobs to maintain our builds. We use SVN revision number as part of our application version number. The issue is when a Hudson job check out HEAD of a brach, it is getting HEAD number of SVN not last committed revision of that brach. I know, SVN maintains revision numbers globally, but we want to reflect last committed number of particular brach in our version.
is there a way to get last committed revision number of a brach using python script so that I can checkout that branch using that revision number?
or better if there a way to do it in Hudson itself?
Thanks.
Getting the last committed revision of a path using python:
from subprocess import check_output as run # >=2.7
path = './'
cmd = ['svn', '--username', XXXX, '--password', XXXX, '--non-interactive', 'info', path]
out = run(cmd).splitlines()
out = (i.split(':', 1) for i in out if i)
info = {k:v.strip() for k,v in out}
# you can access the other svn info fields in a similar manner
rev = info['Last Changed Rev']
with open('.last-svn-commit', 'w') as fh:
fh.write(rev)
I don't think the subversion scm plugin can give you the information you need (it exports SVN_URL and SVN_REVISION only). Keep in mind that there is no difference between checking out the 'Last changed Rev' and the HEAD revision - they both refer to the same content in your branch.
You might want to consider using a new job for every branch you have. This way, the commit that triggers a build will be the 'Last changed Rev' (unless you trigger it yourself). You can do this manually by cloning the trunk job and changing the repository url, or you could use a tool like jenkins-autojobs to do it automatically.
Except svn info you can also use svn log -q -l 1 URL or svn ls -v --depth empty URL

Inject the revision number in sourcecode (TortoiseSvn or SVN Shell)

I would like to inject the revision number in source code on commit.
I found out that I could do it through svn shell by doing something like:
find . -name *.php -exec svn propset svn:keywords "Rev"
However someone else said that that would not work as there are no files in the repository (as they files are encrypted), and I should be able to do it in tortoiseSVN. I found the "Hook Scripts" section, but I have completely no experience with this stuff.
Could you give me some indication how the command should look like, if I would like to have the first lines of code look like:
/*
* Version: 154
* Last modified on revision: 150
*/
I know that you could inject by using $ver$ but how to do it so only files in certain directories with certain extensions get this changed.
Don't write your own method for injecting version numbers. Instead,
only introduce the replaced tags $Revision$, etc.) in the files you want the replacement to happen for
only enable replacement (using svn propset svn:keywords Revision or some such) for those files

How does one add a svn repository build number to Python code?

EDIT: This question duplicates How to access the current Subversion build number? (Thanks for the heads up, Charles!)
Hi there,
This question is similar to Getting the subversion repository number into code
The differences being:
I would like to add the revision number to Python
I want the revision of the repository (not the checked out file)
I.e. I would like to extract the Revision number from the return from 'svn info', likeso:
$ svn info
Path: .
URL: svn://localhost/B/trunk
Repository Root: svn://localhost/B
Revision: 375
Node Kind: directory
Schedule: normal
Last Changed Author: bmh
Last Changed Rev: 375
Last Changed Date: 2008-10-27 12:09:00 -0400 (Mon, 27 Oct 2008)
I want a variable with 375 (the Revision). It's easy enough with put $Rev$ into a variable to keep track of changes on a file. However, I would like to keep track of the repository's version, and I understand (and it seems based on my tests) that $Rev$ only updates when the file changes.
My initial thoughts turn to using the svn/libsvn module built in to Python, though I can't find any documentation on or examples of how to use them.
Alternatively, I've thought calling 'svn info' and regex'ing the code out, though that seems rather brutal. :)
Help would be most appreciated.
Thanks & Cheers.
There is a command called svnversion which comes with subversion and is meant to solve exactly that kind of problem.
Stolen directly from django:
def get_svn_revision(path=None):
rev = None
if path is None:
path = MODULE.__path__[0]
entries_path = '%s/.svn/entries' % path
if os.path.exists(entries_path):
entries = open(entries_path, 'r').read()
# Versions >= 7 of the entries file are flat text. The first line is
# the version number. The next set of digits after 'dir' is the revision.
if re.match('(\d+)', entries):
rev_match = re.search('\d+\s+dir\s+(\d+)', entries)
if rev_match:
rev = rev_match.groups()[0]
# Older XML versions of the file specify revision as an attribute of
# the first entries node.
else:
from xml.dom import minidom
dom = minidom.parse(entries_path)
rev = dom.getElementsByTagName('entry')[0].getAttribute('revision')
if rev:
return u'SVN-%s' % rev
return u'SVN-unknown'
Adapt as appropriate. YOu might want to change MODULE for the name of one of your codemodules.
This code has the advantage of working even if the destination system does not have subversion installed.
Python has direct bindings to libsvn, so you don't need to invoke the command line client at all. See this blog post for more details.
EDIT: You can basically do something like this:
from svn import fs, repos, core
repository = repos.open(root_path)
fs_ptr = repos.fs(repository)
youngest_revision_number = fs.youngest_rev(fs_ptr)
I use a technique very similar to this in order to show the current subversion revision number in my shell:
svnRev=$(echo "$(svn info)" | grep "^Revision" | awk -F": " '{print $2};')
echo $svnRev
It works very well for me.
Why do you want the python files to change every time the version number of the entire repository is incremented? This will make doing things like doing a diff between two files annoying if one is from the repo, and the other is from a tarball..
If you want to have a variable in one source file that can be set to the current working copy revision, and does not replay on subversion and a working copy being actually available at the time you run your program, then SubWCRev my be your solution.
There also seems to be a linux port called SVNWCRev
Both perform substitution of $WCREV$ with the highest commit level of the working copy. Other information may also be provided.
Based on CesarB's response and the link Charles provided, I've done the following:
try:
from subprocess import Popen, PIPE
_p = Popen(["svnversion", "."], stdout=PIPE)
REVISION= _p.communicate()[0]
_p = None # otherwise we get a wild exception when Django auto-reloads
except Exception, e:
print "Could not get revision number: ", e
REVISION="Unknown"
Golly Python is cool. :)

Categories