I have written a simple server side git pre-receive hook in Python. Goal is to analyze diffs and reject pushes that have certain text that we consider invalid. I wrote the hook using below set of commands :
git ls-tree
git diff --name-only
git cat-file
however I just noticed that i am scanning entire files that are pushed as part of the commit. But I only want to scan the diff ie the changed lines in this push.
The reason for that is some invalid text can be false positive and is okay. It can be force pushed. However if the same file is edited again and valid text is added, the push will be rejected just because that file previously had invalid text. And this will happen each time the file is edited which is kinda annoying
So basically the question is , how to get just the changed linesdiff in the current push on server side hook code instead of scanning complete files.
Thanks
... how to get just the changed lines
This question is incomplete. Suppose I tell you that there are some people, including Alice, Bob, Carol, and so on. Now I tell you that Bob is different. Different from who or what?
In a pre-receive hook, you must read lines from your standard input. Each line has the form:
old-hash new-hash reference-name
What do these mean? (That's an exercise for you to answer before you go on to the next sections, though the answer is embedded in the last section below.)
Obtaining a diff requires that you select two items
A commit is a snapshot of files—complete copies of every file that was frozen into that commit. There are no differences involved; there are just complete files.
You, however, want differences. To get a difference for some file file.ext, you must pick some other version of file.ext and compare the two. What is the correct "other version"?
For some commits, you are in luck: there's a very clear correct "other version" of file.ext, which is: the copy of file.ext in that commit's parent commit. In fact, this repeats for every file in the commit: we would like to compare that commit's version of that file, to the parent's version of that file, to see what changed.
There's a handy script-able ("plumbing") command for this, which is git diff-tree: given the hash ID of an ordinary non-merge commit, git diff-tree compares the commit's parent to the commit. Add -p or --patch to get a textual difference (this automatically implies the -r option). Consider using -U0 to drop context lines. You will, of course, still need to parse the output lines, to detect hunk headers and the added/deleted markers.
A simple git diff-tree <hash> does not, however, work for two cases of commits:
A root commit has no parent. Fortunately, the empty tree comes to the rescue: git diff-tree -p $(git hash-object -t tree /dev/null) $hash does the trick.
A merge commit has two or more parents. Here git diff-tree producse a combined diff by default. If that's OK, you can ignore this case. If not, you might consider using --first-parent -m or just -m to split the merge and get multiple diffs, against each parent (default) or the first parent (--first-parent).
That gets you the diff for one commit, so now we move on to the last part.
Now it's time to deal with the hook's stdin input lines
As you read each line, it's your job to:
Check the old and new hashes for the special all-zero-digits null hash. In Python, there are multiple ways to express this; one is:
def is_null(hash):
return all(i == '0' for i in hash)
If the old hash is null, the reference is being created at the new hash. If the new hash is null, the reference used to have the given old hash, and is being deleted. Otherwise—neither hash is null—the reference is being updated: it had the old hash, and will have the new hash.
Figure out what to do, if anything, with the change to the particular reference. Is deletion allowed? Is creation allowed? Does it matter if this is a branch name (starts with refs/heads/) vs a tag name (starts with refs/tags/) vs something else entirely?
Creations are especially difficult. The newly introduced name makes the given object reachable by that name. If the object is a tag or commit, that makes additional objects reachable by that name as well. Some or all of these objects may be new. Some or all of these objects may already exist. The classic case is when someone creates a new branch name: it may point to an existing commit, already on some other branch, or it may point to a new commit, the new tip of the new branch, which may have many additional new commits before joining up with some existing branch(es).
Updates are the most common, and usually the simplest to handle. You know that the existing reference name made the old object reachable, and the proposed update is to make the new object reachable. If the reference is a branch name, both objects are in fact commit objects, and it is easy to find which commits, if any, are newly reachable from the proposed new hash, and which commits, if any, are being removed from reachability via the proposed new hash:
git rev-list $old..$new
produces the set of hash IDs that are newly reachable, and:
git rev-list $new..$old
produces the set that are no longer reachable. (Use git rev-list --left-right $old...$new, with three dots, to get both sets of hash IDs at once, with distinguishing markers. You can use $new...$old: the symmetric difference that this produces is itself symmetric, except of course that the left and right sides are reversed.)
Assuming you have handled creation somehow, if your goal is to examine newly-reachable commits—whether or not they are new to the repository overall—you can simply walk through all the new commits, testing each one to see if it is a root commit, an ordinary (single-parent) commit, or a merge commit. (Hint: add --parents to the git rev-list command to get the parent IDs included, so that you can easily tell how many parents each commit has. Also, consider the graph structure of the commit graph fragment you are walking: $old..$new may include merges, which may make many commits reachable that may or may not be new to the repository.)
You now have all the commit hashes, and their parent counts. You also know how to use git diff-tree to compare each commit against its parent(s) or against the empty tree as needed. So now you are ready to write your fancy pre-receive hook.
Related
I am wondering if there is a way to get the list of method names that are modified along with modified lines and file path from git commit or if there are any better solution for python files
I have been going through few answers in stackoverflow where the suggestion was to edit the gitconfig file and .gitattributes but I couldn't get the desired output
Git has no notion of the semantics of the files it tracks. It can just give you the lines added or removed from a file or the bytes changed in binary files.
How these maps to your methods is beyond Git's understanding. However there might be tools and/or scripts that will try and heuristically determine which methods are affected by a (range of) commits.
as a personal project, I'd like to check different python libraries and projects (be it proprietary or open source) and analyze how the code was changed over time in different releases to gather some info about the technical debt (mainly through static code analysis). I'm doing this using gitpython library. However, I'm struggling to filter the merge commits to the master.
I filter the merge commits using git.log("--merges", "--first-parent", "master") from where I extract the commit hashes and filter these particular commits from all repository commits.
As the second part, I'd like to get all changed files in each merge commit. I'm able to access the blobs via git tree, but I don't know how to get only changed files.
Is there some efficient way how to accomplish this? Thanks!
... I'd like to get all changed files in each merge commit. ... but I don't know how to get only changed files.
Once you have your commit list as you described above, loop over them and run the following:
git diff
Use the git diff with the --name-only flag
git diff
--name-only
Show only names of changed files.
--name-status
Show only the names and status of changed files. See the description of the --diff-filter option on what the status letters mean.
I'm trying to set up black to run on any files checked into git.
I've set it up as follows:
git config filter.black.clean 'black -'
echo '*.py filter=black' >> .git/info/attributes
As far as I understand, this should work ok as black with - as the source path will read from STDIN and output to STDOUT, which is what I think the git filter needs it to do.
However this does not work. When I add an un-black file with git add I see the following output:
reformatted -
All done! ✨ 🍰 ✨
1 file reformatted.
And the file is not changed on disk. What am I doing wrong?
The Black documentation recommends using a pre-commit hook, rather than smudge and clean filters. Note that filter.black.clean defines a clean filter and you have not set up any smudge filters.
The reason you see no change to the work-tree version of the file is that a clean filter is used when turning the work-tree version of a file into the index (to-be-committed) version of the file. This has no effect on the work-tree version of the file!
A smudge filter is used in the opposite direction: Git has a file that is in the index—for whatever reason, such as because it was just copied into the index as part of the git checkout operation to switch to a specific commit—and desires to convert that in-the-index, compressed, Git-ized file to one that you can actually see and edit in your editor, or run with python. Git will, at this time, run the (de-compressed) file contents through your smudge filter.
Note that if you convert some file contents, pre-compression, in a clean filter, and then later extract that file from the repository, into the index, and on into your work-tree, you will at that time be able to see what happened in the clean filter (assuming you do not have a countervailing smudge filter that undoes the effect).
In a code-reformatting world, one could conceivably use a clean filter to turn all source files into some sort of canonical (perhaps four-space-indentation) form, and a smudge filter to turn all source files into one's preferred format (two-space or eight-space indentation). If these transformations are all fully reversible, what you would see, in your work-tree, would be in your preferred format, and what others would see in their work-trees would be their preferred format; but what the version control system itself would see would be the canonical, standardized format.
That's not how Black is actually intended to be used, though it probably can be used that way.
It's not obvious how to set this up manually, but using the pre-commit framework seems to work well.
This approach does two things:
checks prior to committing that the files pass the black test
runs black to fix the files if they don't
So if files don't pass, the commit fails, and black fixes the files which you then need to git -add before trying to commit again.
In a separate test, I managed to set up the pre-commit hook manually using black . --check in .git/hooks/pre-commit (this just does the check - doesn't fix anything if it fails), but never figured out how to configure black to work with the clean and smudge filters.
I have a file foo.py. I have made some changes to the working directory, but not staged or commited any changes yet. I know i can use git checkout foo.py to get rid of these changes. I also read about using git reset --hard HEADwhich essentially resets your working directory, staging area and commit history to match the latest commit.
Is there any reason to prefer using one over the other in my case, where my changes are still in working directory?
Is there any reason to prefer using one over the other in my case, where my changes are still in working directory?
No, since they will accomplish the same thing.
Is there any reason to prefer using [git checkout -- path/to/file] over [git reset --hard] in [general but not in my specific case]?
Yes: this will affect only the one file. If your habit is git reset --hard to undo changes to one file, and you have work-tree and/or staged changes to other files too, and you git reset --hard, you may be out of luck getting those changes back without reconstructing them by hand.
Note that there is a third construct: git checkout HEAD path/to/file. The difference between this and the one without HEAD (with -- instead1) is that the one with HEAD means copy the version of the file that is in a permanent, unchangeable commit into the index / staging-area first, and then into the work-tree. The one with -- means copy the version of the file that is in the index / staging-area into the work-tree.
1The reason to use -- is to make sure Git never confuses a file name with anything else, like a branch name. For instance, suppose you name a file master, just to be obstinate. What, then, does git checkout master mean? Is it supposed to check out branch master, or extract file master? The -- in git checkout -- master makes it clear to Git—and to humans—that this means "extract file master".
Summary, or, things to keep in mind
There are, at all times, three active copies of each file:
one in HEAD;
one in the index / staging-area;
one in the work-tree.
The git status command looks at all three, comparing HEAD-vs-index first—this gives Git the list of "changes to be committed"—and then index-vs-work-tree second. The second one gives Git the list of "changes not staged for commit".
The git add command copies from the work-tree, into the index.
The git checkout command copies either from HEAD to the index and then the work-tree, or just from the index to the work-tree. So it's a bit complicated, with multiple modes of operation:
git checkout -- path/to/file: copies from index to work-tree. What's in HEAD does not matter here. The -- is usually optional, unless the file name looks like a branch name (e.g., a file named master) or an option (e.g., a file named -f).
git checkout HEAD -- path/to/file: copies from HEAD commit, to index, then to work-tree. What's in HEAD overwrites what's in the index, which then overwrites what's in the work-tree. The -- is usually optional, unless the file name looks like an option (e.g., -f).
It's wise to use the -- always, just as a good habit.
The git reset command is complicated (it has many modes of operation).
(This is not to say that git checkout is simple: it, too, has many modes of operation, probably too many. But I think git reset is at least a little bit worse.)
Maybe this can help :
Source :
https://www.patrickzahnd.ch
Under your constraints, there's no difference. If there ARE staged changes, there is, though:
reset reverts staged changes.
checkout does not ...
I'm writing a simple parser of .git/* files. I covered almost everything, like objects, refs, pack files etc. But I have a problem. Let's say I have a big 300M repository (in a pack file) and I want to find out all the commits which changed /some/deep/inside/file file. What I'm doing now is:
fetching last commit
finding a file in it by:
fetching parent tree
finding out a tree inside
recursively repeat until I get into the file
additionally I'm checking hashes of each subfolders on my way to file. If one of them is the same as in commit before, I assume that file was not changed (because it's parent dir didn't change)
then I store the hash of a file and fetch parent commit
finding file again and check if hash change occurs
if yes then original commit (i.e. one before parent) was changing a file
And I repeat it over and over until I reach very first commit.
This solution works, but it sucks. In worse case scenario, first search can take even 3 minutes (for 300M pack).
Is there any way to speed it up ? I tried to avoid putting so large objects in memory, but right now I don't see any other way. And even that, initial memory load will take forever :(
Greets and thanks for any help!
That's the basic algorithm that git uses to track changes to a particular file. That's why "git log -- some/path/to/file.txt" is a comparatively slow operation, compared to many other SCM systems where it would be simple (e.g. in CVS, P4 et al each repo file is a server file with the file's history).
It shouldn't take so long to evaluate though: the amount you ever have to keep in memory is quite small. You already mentioned the main point: remember the tree IDs going down to the path to quickly eliminate commits that didn't even touch that subtree. It's rare for tree objects to be very big, just like directories on a filesystem (unsurprisingly).
Are you using the pack index? If you're not, then you essentially have to unpack the entire pack to find this out since trees could be at the end of a long delta chain. If you have an index, you'll still have to apply deltas to get your tree objects, but at least you should be able to find them quickly. Keep a cache of applied deltas, since obviously it's very common for trees to reuse the same or similar bases- most tree object changes are just changing 20 bytes from a previous tree object. So if in order to get tree T1, you have to start with object T8 and apply Td7 to get T7, T6.... etc. it's entirely likely that these other trees T2-8 will be referenced again.