Git:get changes released to master over time

Git:get changes released to master over time - python

as a personal project, I'd like to check different python libraries and projects (be it proprietary or open source) and analyze how the code was changed over time in different releases to gather some info about the technical debt (mainly through static code analysis). I'm doing this using gitpython library. However, I'm struggling to filter the merge commits to the master.
I filter the merge commits using git.log("--merges", "--first-parent", "master") from where I extract the commit hashes and filter these particular commits from all repository commits.
As the second part, I'd like to get all changed files in each merge commit. I'm able to access the blobs via git tree, but I don't know how to get only changed files.
Is there some efficient way how to accomplish this? Thanks!

... I'd like to get all changed files in each merge commit. ... but I don't know how to get only changed files.
Once you have your commit list as you described above, loop over them and run the following:
git diff
Use the git diff with the --name-only flag
git diff
--name-only
Show only names of changed files.
--name-status
Show only the names and status of changed files. See the description of the --diff-filter option on what the status letters mean.

Related

diff in pre-receive hook

I have written a simple server side git pre-receive hook in Python. Goal is to analyze diffs and reject pushes that have certain text that we consider invalid. I wrote the hook using below set of commands :
git ls-tree
git diff --name-only
git cat-file
however I just noticed that i am scanning entire files that are pushed as part of the commit. But I only want to scan the diff ie the changed lines in this push.
The reason for that is some invalid text can be false positive and is okay. It can be force pushed. However if the same file is edited again and valid text is added, the push will be rejected just because that file previously had invalid text. And this will happen each time the file is edited which is kinda annoying
So basically the question is , how to get just the changed linesdiff in the current push on server side hook code instead of scanning complete files.
Thanks

... how to get just the changed lines
This question is incomplete. Suppose I tell you that there are some people, including Alice, Bob, Carol, and so on. Now I tell you that Bob is different. Different from who or what?
In a pre-receive hook, you must read lines from your standard input. Each line has the form:
old-hash new-hash reference-name
What do these mean? (That's an exercise for you to answer before you go on to the next sections, though the answer is embedded in the last section below.)
Obtaining a diff requires that you select two items
A commit is a snapshot of files—complete copies of every file that was frozen into that commit. There are no differences involved; there are just complete files.
You, however, want differences. To get a difference for some file file.ext, you must pick some other version of file.ext and compare the two. What is the correct "other version"?
For some commits, you are in luck: there's a very clear correct "other version" of file.ext, which is: the copy of file.ext in that commit's parent commit. In fact, this repeats for every file in the commit: we would like to compare that commit's version of that file, to the parent's version of that file, to see what changed.
There's a handy script-able ("plumbing") command for this, which is git diff-tree: given the hash ID of an ordinary non-merge commit, git diff-tree compares the commit's parent to the commit. Add -p or --patch to get a textual difference (this automatically implies the -r option). Consider using -U0 to drop context lines. You will, of course, still need to parse the output lines, to detect hunk headers and the added/deleted markers.
A simple git diff-tree <hash> does not, however, work for two cases of commits:
A root commit has no parent. Fortunately, the empty tree comes to the rescue: git diff-tree -p $(git hash-object -t tree /dev/null) $hash does the trick.
A merge commit has two or more parents. Here git diff-tree producse a combined diff by default. If that's OK, you can ignore this case. If not, you might consider using --first-parent -m or just -m to split the merge and get multiple diffs, against each parent (default) or the first parent (--first-parent).
That gets you the diff for one commit, so now we move on to the last part.
Now it's time to deal with the hook's stdin input lines
As you read each line, it's your job to:
Check the old and new hashes for the special all-zero-digits null hash. In Python, there are multiple ways to express this; one is:
def is_null(hash):
return all(i == '0' for i in hash)
If the old hash is null, the reference is being created at the new hash. If the new hash is null, the reference used to have the given old hash, and is being deleted. Otherwise—neither hash is null—the reference is being updated: it had the old hash, and will have the new hash.
Figure out what to do, if anything, with the change to the particular reference. Is deletion allowed? Is creation allowed? Does it matter if this is a branch name (starts with refs/heads/) vs a tag name (starts with refs/tags/) vs something else entirely?
Creations are especially difficult. The newly introduced name makes the given object reachable by that name. If the object is a tag or commit, that makes additional objects reachable by that name as well. Some or all of these objects may be new. Some or all of these objects may already exist. The classic case is when someone creates a new branch name: it may point to an existing commit, already on some other branch, or it may point to a new commit, the new tip of the new branch, which may have many additional new commits before joining up with some existing branch(es).
Updates are the most common, and usually the simplest to handle. You know that the existing reference name made the old object reachable, and the proposed update is to make the new object reachable. If the reference is a branch name, both objects are in fact commit objects, and it is easy to find which commits, if any, are newly reachable from the proposed new hash, and which commits, if any, are being removed from reachability via the proposed new hash:
git rev-list $old..$new
produces the set of hash IDs that are newly reachable, and:
git rev-list $new..$old
produces the set that are no longer reachable. (Use git rev-list --left-right $old...$new, with three dots, to get both sets of hash IDs at once, with distinguishing markers. You can use $new...$old: the symmetric difference that this produces is itself symmetric, except of course that the left and right sides are reversed.)
Assuming you have handled creation somehow, if your goal is to examine newly-reachable commits—whether or not they are new to the repository overall—you can simply walk through all the new commits, testing each one to see if it is a root commit, an ordinary (single-parent) commit, or a merge commit. (Hint: add --parents to the git rev-list command to get the parent IDs included, so that you can easily tell how many parents each commit has. Also, consider the graph structure of the commit graph fragment you are walking: $old..$new may include merges, which may make many commits reachable that may or may not be new to the repository.)
You now have all the commit hashes, and their parent counts. You also know how to use git diff-tree to compare each commit against its parent(s) or against the empty tree as needed. So now you are ready to write your fancy pre-receive hook.

git reset --hard HEAD vs git checkout <file>

I have a file foo.py. I have made some changes to the working directory, but not staged or commited any changes yet. I know i can use git checkout foo.py to get rid of these changes. I also read about using git reset --hard HEADwhich essentially resets your working directory, staging area and commit history to match the latest commit.
Is there any reason to prefer using one over the other in my case, where my changes are still in working directory?

Is there any reason to prefer using one over the other in my case, where my changes are still in working directory?
No, since they will accomplish the same thing.
Is there any reason to prefer using [git checkout -- path/to/file] over [git reset --hard] in [general but not in my specific case]?
Yes: this will affect only the one file. If your habit is git reset --hard to undo changes to one file, and you have work-tree and/or staged changes to other files too, and you git reset --hard, you may be out of luck getting those changes back without reconstructing them by hand.
Note that there is a third construct: git checkout HEAD path/to/file. The difference between this and the one without HEAD (with -- instead1) is that the one with HEAD means copy the version of the file that is in a permanent, unchangeable commit into the index / staging-area first, and then into the work-tree. The one with -- means copy the version of the file that is in the index / staging-area into the work-tree.
1The reason to use -- is to make sure Git never confuses a file name with anything else, like a branch name. For instance, suppose you name a file master, just to be obstinate. What, then, does git checkout master mean? Is it supposed to check out branch master, or extract file master? The -- in git checkout -- master makes it clear to Git—and to humans—that this means "extract file master".
Summary, or, things to keep in mind
There are, at all times, three active copies of each file:
one in HEAD;
one in the index / staging-area;
one in the work-tree.
The git status command looks at all three, comparing HEAD-vs-index first—this gives Git the list of "changes to be committed"—and then index-vs-work-tree second. The second one gives Git the list of "changes not staged for commit".
The git add command copies from the work-tree, into the index.
The git checkout command copies either from HEAD to the index and then the work-tree, or just from the index to the work-tree. So it's a bit complicated, with multiple modes of operation:
git checkout -- path/to/file: copies from index to work-tree. What's in HEAD does not matter here. The -- is usually optional, unless the file name looks like a branch name (e.g., a file named master) or an option (e.g., a file named -f).
git checkout HEAD -- path/to/file: copies from HEAD commit, to index, then to work-tree. What's in HEAD overwrites what's in the index, which then overwrites what's in the work-tree. The -- is usually optional, unless the file name looks like an option (e.g., -f).
It's wise to use the -- always, just as a good habit.
The git reset command is complicated (it has many modes of operation).
(This is not to say that git checkout is simple: it, too, has many modes of operation, probably too many. But I think git reset is at least a little bit worse.)

Maybe this can help :
Source :
https://www.patrickzahnd.ch

Under your constraints, there's no difference. If there ARE staged changes, there is, though:
reset reverts staged changes.
checkout does not ...

How can I use git python get diffs

I am trying to use GitPython to get me diffs in the format I want.
For every file I would like 3 lists. A list of lines that were changed, added, and removed.
I've been looking online and at the documentation and can't seem to figure out how to do this.

As far as I can tell, there is actually no way to do this.
You'd think they'd implement this functionality, but as far as I can tell file-level diffs are not implemented directly in GitPython.
The only way to retrieve changed lines within GitPython is using the git command directly, like so:
from git import Repo
repo = Repo('path/to/repo')
filename = 'path/to/repo/subdir/myfile'
diff_output = repo.git.diff(filename)
With diff_output now containing the same content you'd see if you did:
$ cd path/to/repo
$ git diff subdir/myfile
So it's not very useful. You're going to probably have to write your own code to parse the diff file to get the information you need.
Frustrating, because I'd like a similar functionality and it looks like my only option really is manually parsing git diffs.
I have to assume that GitPython is primarily used for managing changes to branches, etc and not inspecting or dealing with source code changes on the file level.

See this documentation. It could be useful.

Can I implement my own Python-based git merge strategy?

I developed a Python library for merging large numbers of XML files in a very specific way. These XML files are split up and altered by multiple users in my group and it would be much easier to put everything into a Git repo and have git-merge manage everything via my Python code.
It seems that implementing my code for git-mergetool is possible, but I would have to write my own code to manage the conflict returns for the internal git-merge (i.e. parse the >>>>>>> <<<<<<< ======= identifiers), which would be more time consuming.
So, is there a way to have Git's merge command automatically use my Python code instead of its internal git-merge?

You can implement a custom merge driver that's used for certain filetypes instead of Git's default merge driver.
Relevant documentation in gitattributes(5)
Some related StackOverflow questions:
Git - how to force merge conflict and manual merge on selected file
How do I tell git to always select my local version for conflicted merges on a specific file?

Python and git, best way to know when a file has changed from a sync

I am using chef to fetch changes on my production servers once a minute and in the github repo I have 6 spawn-fcgi scripts under control of runit. If only one of the files changes I need to restart runit to pick up the changes. My pseudo code looks like this.
git fetch
for each file:
if file changed from last sync:
sv start myrunit
else:
pass
I am open to nay best practice method.
Thanks

git diff --name-only HEAD#{1}..HEAD
will list files changed between the previous and current checked-out commit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.