Extract commits related to code changes from commit tree - python

Right now I am able to traverse through the commit tree for a github repository using pygit2 library. I am getting all the commits for each file change in the repository. This means that I am getting changes for text files with extensions .rtf as well in the repository. How do I filter out the commits which are related to code changes only? I don't want the changes related to text documents.
Appreciate any help or pointers. Thanks.
last = repo[repo.head.target]
t0=last
f = open(outputFile,'w')
print t0.hex
for commit in repo.walk(last.id):
if t0.hex == commit.hex:
continue
print commit.hex
out=repo.diff(t0,commit)
f.write(out.patch)
t0=commit;
As part of the output, I get the difference in rtf files as well as below:
diff --git a/archived-output/NEW/action-core[best].rtf b/archived-output/NEW/action-core[best].rtf
deleted file mode 100644
index 56cdec6..0000000
--- a/archived-output/NEW/action-core[best].rtf
+++ /dev/null
## -1,8935 +0,0 ##
-{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f0\fbidi \froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fbidi \fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
-{\f2\fbidi \fmodern\fcharset0\fprq1{\*\panose 02070309020205020404}Courier New;}{\f3\fbidi \froman\fcharset2\fprq2{\*\panose 05050102010706020507}Symbol;}
Either I have to filter the commits from the tree or I have to filter the output . I was thinking if I could remove the changes related to rtf files by removing the corresponding commits while walking through the tree.

If that is possible, how do we get the list of modified files?
Ah, now you're asking the right questions! Git, of course, does not store a list of modified files in each commit. Rather, each commit represents the state of the entire repository at a certain point in time. In order to find the modified files, you need to compare the files contained in one commit with the previous commit.
For each commit returned by repo.walk(), the tree attribute refers to the associated Tree object (which is itself a list of TreeEntry objects representing files and directories contained in that particular Tree).
A Tree object has a diff_to_tree() method that can be used to compare it against another Tree object. This returns a Diff object, which acts as an iterator over a list of Patch objects. Each Patch object refers to the changes in a single file between the two Trees that are being compared.
The Patch object is really the key to all this, because this is how
we determine which files have been modified.
The following code demonstrates this. For each commit, it will print
a list of new, modified, or deleted files:
import stat
import pygit2
repo = pygit2.Repository('.')
prev = None
for cur in repo.walk(repo.head.target):
if prev is not None:
print prev.id
diff = cur.tree.diff_to_tree(prev.tree)
for patch in diff:
print patch.status, ':', patch.new_file_path,
if patch.new_file_path != patch.old_file_path:
print '(was %s)' % patch.old_file_path,
print
if cur.parents:
prev = cur
cur = cur.parents[0]
If we run this against a sample repository, we can look at the
output for the first few commits:
c285a21e013892ee7601a53df16942cdcbd39fe6
D : fragments/configure-flannel.sh
A : fragments/flannel-config.service.yaml
A : fragments/write-flannel-config.sh
M : kubecluster.yaml
b06de8f2f366204aa1327491fff91574e68cd4ec
M : fragments/enable-services-master.sh
M : fragments/enable-services-minion.sh
c265ddedac7162c103672022633a574ea03edf6f
M : fragments/configure-flannel.sh
88a8bd0eefd45880451f4daffd47f0e592f5a62b
A : fragments/configure-docker-storage.sh
M : fragments/write-heat-params.yaml
M : kubenode.yaml
And compare that to the output of git log --oneline --name-status:
c285a21 configure flannel via systemd unit
D fragments/configure-flannel.sh
A fragments/flannel-config.service.yaml
A fragments/write-flannel-config.sh
M kubecluster.yaml
b06de8f call daemon-reload before starting services
M fragments/enable-services-master.sh
M fragments/enable-services-minion.sh
c265dde fix json syntax problem
M fragments/configure-flannel.sh
88a8bd0 configure cinder volume for docker storage
A fragments/configure-docker-storage.sh
M fragments/write-heat-params.yaml
M kubenode.yaml
...aaaand, that looks just about identical. Hopefully this is enough
to you started.

This is mainly a rewrite of larsks's excellent answer to
the current pygit2 API
Python3
It also fixes a flaw in the iteration logic: the original code would miss to diff the last revision against its parent when a revision range (a..b) is walked.
The following approximates the command
git log --name-status --pretty="format:Files changed in %h" origin/devel..master
on the sample repository given by larsks.
I was unable to trace file renames, though. This is printed as a deletion and an addition. The code line printing a rename is never reached.
import pygit2
repo = pygit2.Repository('.')
# Show files changed between origin/devel and current HEAD
devel = repo.revparse_single('origin/devel')
walker = repo.walk(repo.head.target)
walker.hide(devel.id)
for cur in walker:
if cur.parents:
print (f'Files changed in {cur.short_id}')
prev = cur.parents[0]
diff = prev.tree.diff_to_tree(cur.tree)
for patch in diff:
print(patch.delta.status_char(), ':', patch.delta.new_file.path)
if patch.delta.new_file.path != patch.delta.old_file.path:
print(f'(was {patch.delta.old_file.path})'.)
print()

Related

How to check that a script has not been modified - tried with git attribute ident $Id$

I am maintaining a collection of python scripts that are distributed on several computers. Users might have the fancy idea to modify the scripts so I am looking for an automatic solution to check the script integrity.
I wanted to use git attribute ident so that the file contains its own sha1 and then use git hash-object to compare.
It looks like this (.gitattributes contains *.py ident):
import subprocess
gitId= '$Id: 98a648abdf1cd8d563c72886a601857c20670013 $' #this sha will be updated automatically at each commit on the file.
gitId=gitId[5:-2]
shaCheck=subprocess.check_output(['git', 'hash-object', __file__]).strip().decode('UTF-8')
if shaCheck != gitId:
print('file has been corrupted \n {} <> {}'.format(shaCheck, gitId))
# below the actual purpose of the script
This is working fine when my script lays inside the git repository but git hash-object returns a different sha when outside of my git repository. I guess there is some git filters issue but I do not know how to get around that issue?
Any other painless way to check my file interity is also welcome.
You could check the file's hash with the Python module hashlib:
import hashlib
filename_1 = "./folder1/test_script.py"
with open(filename_1,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
filename_2 = "./folder2/test_script.py"
with open(filename_2,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
Output:
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117

How to get the last commit of an specific file using python?

I tried with GitPython, but I am just getting the actual commit git hash.
import git
repo = git.Repo(search_parent_directories=True)
repo.head.commit.hexsha
But, for trazability I want to store the git commit hash of an specific file, i.e. the equivalent of this command (using git)
git log -n 1 --pretty=format:%h -- experiments/test.yaml
Is it possible to achive with GitPython?
An issue like how do I get sha key for any repository' file. points to the Tree object, as providing an access for recursive traversal of git trees, with access to all meta-data, including the SHA1 hash.
self.assertEqual(tree['smmap'], tree / 'smmap') # access by index and by sub-path
for entry in tree: # intuitive iteration of tree members
print(entry)
blob = tree.trees[0].blobs[0] # let's get a blob in a sub-tree
assert blob.name
blob.hexsha would be the SHA1 of the blob.

Get changed files using gitpython

I want to get a list of changed files of the current git-repo. The files, that are normally listed under Changes not staged for commit: when calling git status.
So far I have managed to connected to the repository, pulled it and show all untracked files:
from git import Repo
repo = Repo(pk_repo_path)
o = self.repo.remotes.origin
o.pull()[0]
print(repo.untracked_files)
But now I want to show all files, that have changes (not commited). Can anybody push me in the right direction? I looked at the names of the methods of repo and experimented for a while, but I can't find the correct solution.
Obviously I could call repo.git.status and parse the files, but that isn't elegant at all. There must be something better.
Edit: Now that I think about it. More usefull would be a function, that tells me the status for a single file. Like:
print(repo.get_status(path_to_file))
>>untracked
print(repo.get_status(path_to_another_file))
>>not staged
for item in repo.index.diff(None):
print item.a_path
or to get just the list:
changedFiles = [ item.a_path for item in repo.index.diff(None) ]
repo.index.diff() returns git.diff.Diffable described in http://gitpython.readthedocs.io/en/stable/reference.html#module-git.diff
So function can look like this:
def get_status(repo, path):
changed = [ item.a_path for item in repo.index.diff(None) ]
if path in repo.untracked_files:
return 'untracked'
elif path in changed:
return 'modified'
else:
return 'don''t care'
just to catch up on #ciasto piekarz question: depending on what you want to show:
repo.index.diff(None)
does only list files that have not been staged
repo.index.diff('Head')
does only list files that have been staged

gitpython and git diff

I am looking to get only the diff of a file changed from a git repo. Right now, I am using gitpython to actually get the commit objects and the files of git changes, but I want to do a dependency analysis on only the parts of the file changed. Is there any way to get the git diff from git python? Or am I going to have to compare each of the files by reading line by line?
If you want to access the contents of the diff, try this:
repo = git.Repo(repo_root.as_posix())
commit_dev = repo.commit("dev")
commit_origin_dev = repo.commit("origin/dev")
diff_index = commit_origin_dev.diff(commit_dev)
for diff_item in diff_index.iter_change_type('M'):
print("A blob:\n{}".format(diff_item.a_blob.data_stream.read().decode('utf-8')))
print("B blob:\n{}".format(diff_item.b_blob.data_stream.read().decode('utf-8')))
This will print the contents of each file.
You can use GitPython with the git command "diff", just need to use the "tree" object of each commit or the branch for that you want to see the diffs, for example:
repo = Repo('/git/repository')
t = repo.head.commit.tree
repo.git.diff(t)
This will print "all" the diffs for all files included in this commit, so if you want each one you must iterate over them.
With the actual branch it's:
repo.git.diff('HEAD~1')
Hope this help, regards.
Git does not store the diffs, as you have noticed. Given two blobs (before and after a change), you can use Python's difflib module to compare the data.
I'd suggest you to use PyDriller instead (it uses GitPython internally). Much easier to use:
for commit in Repository("path_to_repo").traverse_commits():
for modified_file in commit.modified_files: # here you have the list of modified files
print(modified_file.diff)
# etc...
You can also analyze a single commit by doing:
for commit in RepositoryMining("path_to_repo", single="123213")
If you're looking to recreate something close to what a standard git diff would show, try:
# cloned_repo = git.Repo.clone_from(
# url=ssh_url,
# to_path=repo_dir,
# env={"GIT_SSH_COMMAND": "ssh -i " + SSH_KEY},
# )
for diff_item in cloned_repo.index.diff(None, create_patch=True):
repo_diff += (
f"--- a/{diff_item.a_blob.name}\n+++ b/{diff_item.b_blob.name}\n"
f"{diff_item.diff.decode('utf-8')}\n\n"
)
If you want to do git diff on a file between two commits this is the way to do it:
import git
repo = git.Repo()
path_to_a_file = "diff_this_file_across_commits.txt"
commits_touching_path = list(repo.iter_commits(paths=path))
print repo.git.diff(commits_touching_path[0], commits_touching_path[1], path_to_a_file)
This will show you the differences between two latest commits that were done to the file you specify.
repo.git.diff("main", "head~5")
PyDriller +1
pip install pydriller
But with the new API:
Breaking API: ```
from pydriller import Repository
for commit in Repository('https://github.com/ishepard/pydriller').traverse_commits():
print(commit.hash)
print(commit.msg)
print(commit.author.name)
for file in commit.modified_files:
print(file.filename, ' has changed')
Here is how you do it
import git
repo = git.Repo("path/of/repo/")
# the below gives us all commits
repo.commits()
# take the first and last commit
a_commit = repo.commits()[0]
b_commit = repo.commits()[1]
# now get the diff
repo.diff(a_commit,b_commit)

Turn subversion path into walkable directory

I have a subversion repo ie "http://crsvn/trunk/foo" ... I want to walk this directory or for starters simply to a directory list.
The idea is to create a script that will do mergeinfo on all the branches in "http://crsvn/branches/bar" and compare them to trunk to see if the branch has been merged.
So the first problem I have is that I cannot walk or do
os.listdir('http://crsvn/branches/bar')
I get the value label syntax is incorrect (mentioning the URL)
You can use PySVN. In particular, the pysvn.Client.list method should do what you want:
import pysvn
svncl = pysvn.Client()
entries = svncl.list("http://rabbitvcs.googlecode.com/svn/trunk/")
# Gives you a list of directories:
dirs = (entry[0].repos_path for entry in entries if entry[0].kind == pysvn.node_kind.dir)
list(dirs)
No checkout needed. You could even specify a revision to work on, to ensure your script can ignore other people working on the repository while it runs.
listdir takes a path and not a url. It would be nice if python could be aware of the structure on a remote server but i don't think that is the case.
If you were to checkout your repository locally first you could easly walk the directories using pythons functions.

Categories