I am looking to get only the diff of a file changed from a git repo. Right now, I am using gitpython to actually get the commit objects and the files of git changes, but I want to do a dependency analysis on only the parts of the file changed. Is there any way to get the git diff from git python? Or am I going to have to compare each of the files by reading line by line?
If you want to access the contents of the diff, try this:
repo = git.Repo(repo_root.as_posix())
commit_dev = repo.commit("dev")
commit_origin_dev = repo.commit("origin/dev")
diff_index = commit_origin_dev.diff(commit_dev)
for diff_item in diff_index.iter_change_type('M'):
print("A blob:\n{}".format(diff_item.a_blob.data_stream.read().decode('utf-8')))
print("B blob:\n{}".format(diff_item.b_blob.data_stream.read().decode('utf-8')))
This will print the contents of each file.
You can use GitPython with the git command "diff", just need to use the "tree" object of each commit or the branch for that you want to see the diffs, for example:
repo = Repo('/git/repository')
t = repo.head.commit.tree
repo.git.diff(t)
This will print "all" the diffs for all files included in this commit, so if you want each one you must iterate over them.
With the actual branch it's:
repo.git.diff('HEAD~1')
Hope this help, regards.
Git does not store the diffs, as you have noticed. Given two blobs (before and after a change), you can use Python's difflib module to compare the data.
I'd suggest you to use PyDriller instead (it uses GitPython internally). Much easier to use:
for commit in Repository("path_to_repo").traverse_commits():
for modified_file in commit.modified_files: # here you have the list of modified files
print(modified_file.diff)
# etc...
You can also analyze a single commit by doing:
for commit in RepositoryMining("path_to_repo", single="123213")
If you're looking to recreate something close to what a standard git diff would show, try:
# cloned_repo = git.Repo.clone_from(
# url=ssh_url,
# to_path=repo_dir,
# env={"GIT_SSH_COMMAND": "ssh -i " + SSH_KEY},
# )
for diff_item in cloned_repo.index.diff(None, create_patch=True):
repo_diff += (
f"--- a/{diff_item.a_blob.name}\n+++ b/{diff_item.b_blob.name}\n"
f"{diff_item.diff.decode('utf-8')}\n\n"
)
If you want to do git diff on a file between two commits this is the way to do it:
import git
repo = git.Repo()
path_to_a_file = "diff_this_file_across_commits.txt"
commits_touching_path = list(repo.iter_commits(paths=path))
print repo.git.diff(commits_touching_path[0], commits_touching_path[1], path_to_a_file)
This will show you the differences between two latest commits that were done to the file you specify.
repo.git.diff("main", "head~5")
PyDriller +1
pip install pydriller
But with the new API:
Breaking API: ```
from pydriller import Repository
for commit in Repository('https://github.com/ishepard/pydriller').traverse_commits():
print(commit.hash)
print(commit.msg)
print(commit.author.name)
for file in commit.modified_files:
print(file.filename, ' has changed')
Here is how you do it
import git
repo = git.Repo("path/of/repo/")
# the below gives us all commits
repo.commits()
# take the first and last commit
a_commit = repo.commits()[0]
b_commit = repo.commits()[1]
# now get the diff
repo.diff(a_commit,b_commit)
Related
I am trying to have 2 simultaneous versions of a single package on a server. A production and a testing one.
I want these 2 to be on the same git repository on 2 different branches (The testing would merge into production) however i would love to keep them in the same directory so its not needed to change any imports or paths.
Is it possible to dynamically change the package name in setup.py, depending on the git branch?
Or is it possible to deploy them with different names using pip?
EDIT : i may have found a proper solution for my problem here : Git: ignore some files during a merge (keep some files restricted to one branch)
Gitattributes can be setup to ignore merging of my setup.py, ill close this question after i test it.
This could be done with a setup script that looks like this:
#!/usr/bin/env python3
import pathlib
import setuptools
def _get_git_ref():
ref = None
git_head_path = pathlib.Path(__file__).parent.joinpath('.git', 'HEAD')
with git_head_path.open('r') as git_head:
ref = git_head.readline().split()[-1]
return ref
def _get_project_name():
name_map = {
'refs/heads/master': 'ThingProd',
'refs/heads/develop': 'ThingTest',
}
git_ref = _get_git_ref()
name = name_map.get(git_ref, 'ThingUnknown')
return name
setuptools.setup(
# see 'setup.cfg'
name=_get_project_name(),
)
It reads the current git ref directly from the .git/HEAD file and looks up the corresponding name in a table.
Inspired from: https://stackoverflow.com/a/56245722/11138259.
Using .gitattributes file with content "setup.py merge=ours" and also setting up the git config --global merge.ours.driver true. Makes the merge "omit" the setup.py file (it keeps our file instead). This works only if both master and child branch have changed the file since they firstly parted ways. (it seems)
I want to get a list of changed files of the current git-repo. The files, that are normally listed under Changes not staged for commit: when calling git status.
So far I have managed to connected to the repository, pulled it and show all untracked files:
from git import Repo
repo = Repo(pk_repo_path)
o = self.repo.remotes.origin
o.pull()[0]
print(repo.untracked_files)
But now I want to show all files, that have changes (not commited). Can anybody push me in the right direction? I looked at the names of the methods of repo and experimented for a while, but I can't find the correct solution.
Obviously I could call repo.git.status and parse the files, but that isn't elegant at all. There must be something better.
Edit: Now that I think about it. More usefull would be a function, that tells me the status for a single file. Like:
print(repo.get_status(path_to_file))
>>untracked
print(repo.get_status(path_to_another_file))
>>not staged
for item in repo.index.diff(None):
print item.a_path
or to get just the list:
changedFiles = [ item.a_path for item in repo.index.diff(None) ]
repo.index.diff() returns git.diff.Diffable described in http://gitpython.readthedocs.io/en/stable/reference.html#module-git.diff
So function can look like this:
def get_status(repo, path):
changed = [ item.a_path for item in repo.index.diff(None) ]
if path in repo.untracked_files:
return 'untracked'
elif path in changed:
return 'modified'
else:
return 'don''t care'
just to catch up on #ciasto piekarz question: depending on what you want to show:
repo.index.diff(None)
does only list files that have not been staged
repo.index.diff('Head')
does only list files that have been staged
I am using gitpython library for performing git operations, retrieve git info from python code. I want to retrieve all revisions for a specific file. But couldn't find a specific reference for this on the docs.
Can anybody give some clue on which function will help in this regard? Thanks.
A follow-on, to read each file:
import git
repo = git.Repo()
path = "file_you_are_looking_for"
revlist = (
(commit, (commit.tree / path).data_stream.read())
for commit in repo.iter_commits(paths=path)
)
for commit, filecontents in revlist:
...
There is no such function, but it is easily implemented:
import git
repo = git.Repo()
path = "dir/file_you_are_looking_for"
commits_touching_path = list(repo.iter_commits(paths=path))
Performance will be moderate even if multiple paths are involved. Benchmarks and more code about that can be found in an issue on github.
Right now I am able to traverse through the commit tree for a github repository using pygit2 library. I am getting all the commits for each file change in the repository. This means that I am getting changes for text files with extensions .rtf as well in the repository. How do I filter out the commits which are related to code changes only? I don't want the changes related to text documents.
Appreciate any help or pointers. Thanks.
last = repo[repo.head.target]
t0=last
f = open(outputFile,'w')
print t0.hex
for commit in repo.walk(last.id):
if t0.hex == commit.hex:
continue
print commit.hex
out=repo.diff(t0,commit)
f.write(out.patch)
t0=commit;
As part of the output, I get the difference in rtf files as well as below:
diff --git a/archived-output/NEW/action-core[best].rtf b/archived-output/NEW/action-core[best].rtf
deleted file mode 100644
index 56cdec6..0000000
--- a/archived-output/NEW/action-core[best].rtf
+++ /dev/null
## -1,8935 +0,0 ##
-{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f0\fbidi \froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fbidi \fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
-{\f2\fbidi \fmodern\fcharset0\fprq1{\*\panose 02070309020205020404}Courier New;}{\f3\fbidi \froman\fcharset2\fprq2{\*\panose 05050102010706020507}Symbol;}
Either I have to filter the commits from the tree or I have to filter the output . I was thinking if I could remove the changes related to rtf files by removing the corresponding commits while walking through the tree.
If that is possible, how do we get the list of modified files?
Ah, now you're asking the right questions! Git, of course, does not store a list of modified files in each commit. Rather, each commit represents the state of the entire repository at a certain point in time. In order to find the modified files, you need to compare the files contained in one commit with the previous commit.
For each commit returned by repo.walk(), the tree attribute refers to the associated Tree object (which is itself a list of TreeEntry objects representing files and directories contained in that particular Tree).
A Tree object has a diff_to_tree() method that can be used to compare it against another Tree object. This returns a Diff object, which acts as an iterator over a list of Patch objects. Each Patch object refers to the changes in a single file between the two Trees that are being compared.
The Patch object is really the key to all this, because this is how
we determine which files have been modified.
The following code demonstrates this. For each commit, it will print
a list of new, modified, or deleted files:
import stat
import pygit2
repo = pygit2.Repository('.')
prev = None
for cur in repo.walk(repo.head.target):
if prev is not None:
print prev.id
diff = cur.tree.diff_to_tree(prev.tree)
for patch in diff:
print patch.status, ':', patch.new_file_path,
if patch.new_file_path != patch.old_file_path:
print '(was %s)' % patch.old_file_path,
print
if cur.parents:
prev = cur
cur = cur.parents[0]
If we run this against a sample repository, we can look at the
output for the first few commits:
c285a21e013892ee7601a53df16942cdcbd39fe6
D : fragments/configure-flannel.sh
A : fragments/flannel-config.service.yaml
A : fragments/write-flannel-config.sh
M : kubecluster.yaml
b06de8f2f366204aa1327491fff91574e68cd4ec
M : fragments/enable-services-master.sh
M : fragments/enable-services-minion.sh
c265ddedac7162c103672022633a574ea03edf6f
M : fragments/configure-flannel.sh
88a8bd0eefd45880451f4daffd47f0e592f5a62b
A : fragments/configure-docker-storage.sh
M : fragments/write-heat-params.yaml
M : kubenode.yaml
And compare that to the output of git log --oneline --name-status:
c285a21 configure flannel via systemd unit
D fragments/configure-flannel.sh
A fragments/flannel-config.service.yaml
A fragments/write-flannel-config.sh
M kubecluster.yaml
b06de8f call daemon-reload before starting services
M fragments/enable-services-master.sh
M fragments/enable-services-minion.sh
c265dde fix json syntax problem
M fragments/configure-flannel.sh
88a8bd0 configure cinder volume for docker storage
A fragments/configure-docker-storage.sh
M fragments/write-heat-params.yaml
M kubenode.yaml
...aaaand, that looks just about identical. Hopefully this is enough
to you started.
This is mainly a rewrite of larsks's excellent answer to
the current pygit2 API
Python3
It also fixes a flaw in the iteration logic: the original code would miss to diff the last revision against its parent when a revision range (a..b) is walked.
The following approximates the command
git log --name-status --pretty="format:Files changed in %h" origin/devel..master
on the sample repository given by larsks.
I was unable to trace file renames, though. This is printed as a deletion and an addition. The code line printing a rename is never reached.
import pygit2
repo = pygit2.Repository('.')
# Show files changed between origin/devel and current HEAD
devel = repo.revparse_single('origin/devel')
walker = repo.walk(repo.head.target)
walker.hide(devel.id)
for cur in walker:
if cur.parents:
print (f'Files changed in {cur.short_id}')
prev = cur.parents[0]
diff = prev.tree.diff_to_tree(cur.tree)
for patch in diff:
print(patch.delta.status_char(), ':', patch.delta.new_file.path)
if patch.delta.new_file.path != patch.delta.old_file.path:
print(f'(was {patch.delta.old_file.path})'.)
print()
I have a subversion repo ie "http://crsvn/trunk/foo" ... I want to walk this directory or for starters simply to a directory list.
The idea is to create a script that will do mergeinfo on all the branches in "http://crsvn/branches/bar" and compare them to trunk to see if the branch has been merged.
So the first problem I have is that I cannot walk or do
os.listdir('http://crsvn/branches/bar')
I get the value label syntax is incorrect (mentioning the URL)
You can use PySVN. In particular, the pysvn.Client.list method should do what you want:
import pysvn
svncl = pysvn.Client()
entries = svncl.list("http://rabbitvcs.googlecode.com/svn/trunk/")
# Gives you a list of directories:
dirs = (entry[0].repos_path for entry in entries if entry[0].kind == pysvn.node_kind.dir)
list(dirs)
No checkout needed. You could even specify a revision to work on, to ensure your script can ignore other people working on the repository while it runs.
listdir takes a path and not a url. It would be nice if python could be aware of the structure on a remote server but i don't think that is the case.
If you were to checkout your repository locally first you could easly walk the directories using pythons functions.