How to get the last commit of an specific file using python?

How to get the last commit of an specific file using python? - python

I tried with GitPython, but I am just getting the actual commit git hash.
import git
repo = git.Repo(search_parent_directories=True)
repo.head.commit.hexsha
But, for trazability I want to store the git commit hash of an specific file, i.e. the equivalent of this command (using git)
git log -n 1 --pretty=format:%h -- experiments/test.yaml
Is it possible to achive with GitPython?

An issue like how do I get sha key for any repository' file. points to the Tree object, as providing an access for recursive traversal of git trees, with access to all meta-data, including the SHA1 hash.
self.assertEqual(tree['smmap'], tree / 'smmap') # access by index and by sub-path
for entry in tree: # intuitive iteration of tree members
print(entry)
blob = tree.trees[0].blobs[0] # let's get a blob in a sub-tree
assert blob.name
blob.hexsha would be the SHA1 of the blob.

Related

How to check that a script has not been modified - tried with git attribute ident $Id$

I am maintaining a collection of python scripts that are distributed on several computers. Users might have the fancy idea to modify the scripts so I am looking for an automatic solution to check the script integrity.
I wanted to use git attribute ident so that the file contains its own sha1 and then use git hash-object to compare.
It looks like this (.gitattributes contains *.py ident):
import subprocess
gitId= '$Id: 98a648abdf1cd8d563c72886a601857c20670013 $' #this sha will be updated automatically at each commit on the file.
gitId=gitId[5:-2]
shaCheck=subprocess.check_output(['git', 'hash-object', __file__]).strip().decode('UTF-8')
if shaCheck != gitId:
print('file has been corrupted \n {} <> {}'.format(shaCheck, gitId))
# below the actual purpose of the script
This is working fine when my script lays inside the git repository but git hash-object returns a different sha when outside of my git repository. I guess there is some git filters issue but I do not know how to get around that issue?
Any other painless way to check my file interity is also welcome.

You could check the file's hash with the Python module hashlib:
import hashlib
filename_1 = "./folder1/test_script.py"
with open(filename_1,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
filename_2 = "./folder2/test_script.py"
with open(filename_2,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
Output:
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117

How to grab specific object metadata info from Google Cloud Storage?

I want to:
Access all GCP projects linked to my google account.
Get all buckets that contain the word foobar in their name.
Retrieve some of the metadata from the ones provided by Google (Creation time, Update time, Storage class, Content-Length, Content-Type, Hash (crc32c), Hash, ETag, Generation, Metageneration, ACL, TOTAL) for example Creation time and Content-Type and TOTAL.
Save the results in a .csv / dataframe format with fields like: foobar, Creation time, Content-Type, TOTAL
I don't want to:
Although I think only files have metadata, in case sub-directories have metadata too, I don't want to grab sub-directories' metadata.
Overdo it with the parsing through folders. Some of the buckets have tons of subdirectories. I want the cheapest way possible to get to the objects of interest.
What I have so far:
I use gcloud projects list to get all projects linked to my account.
I manually create a .csv file with the fields: project_id, recursive, selected. recursive TRUE is for those I know they don't have that many folders so I can afford to look through all sub-directories. selected TRUE just helps me to go through some of the projects and not all.
For all the projects where the selected field is TRUE I collect the data and save it in a file with the following command:
gsutil ls -L -p "${project}" gs://*foobar* >> non_recursive.csv
For all the projects where the selected and the recursive fields is TRUE I collect the data and save it in a file with the following command:
gsutil ls -r -L -p "${project}" gs://*secret* >> recursive.csv
So my questions:
How can I modify this: gsutil ls -L -p "${project}" gs://*foobar* >> non_recursive.csv to collect only some of the metadata fields and to output it in the dataframe format mentioned above?
Is there a better way to do the above? (Python or Bash solutions only please)

You can generate a list of the files for which you want to fetch metadata, and then generate a gsutil ls command for each, e.g.,
sed 's/\(.*\)/gsutil ls -L \1/' objects_to_list | sh
If there are a large number of such objects you could do the listings in parallel, e.g.,
sed 's/\(.*\)/gsutil ls -L \1/' objects_to_list | split -l 100 - LISTING_PART
for f in LISTING_PART*; do
sh $f > $f.out &
done
wait

This gets filename and mimeType:
blobs = storage_client.list_blobs(BUCKET)
for blob in blobs:
item = {'content': "gs://{}/{}".format(blob.bucket.name,blob.name), 'mimeType': "{}".format(blob.content_type)}
print(item)
Can get other metatdata.

extracting git time recursivley for subfolders and files

I am trying to create a dictionary with elements in the format filename: timestamp in yy-mm-dd hh:mm:ss . This should recursively include all subfolders and files in the repo . I came across ths piece of code :
import git
repo = git.Repo("./repo")
tree = repo.tree()
for blob in tree:
commit = repo.iter_commits(paths=blob.path, max_count=1).next()
print(blob.path, commit.committed_date)
However, this includes only the main sub folders. How to include sub folders and files recursively
Note: The following solution by Roland here does not include sub folders, only files.Also I need to be in the path where git repo is downloaded and then run the script by giving its absolute path
Get time of last commit for Git repository files via Python?

This works for me
http://gitpython.readthedocs.io/en/stable/tutorial.html#the-tree-object
As per the doc As trees allow direct access to their intermediate child entries only, use the traverse method to obtain an iterator to retrieve entries recursively
It creates a generator object which does the work
print tree.traverse()
<generator object traverse at 0x0000000004129DC8>
d=dict()
for blob in tree.traverse():
commit=repo.iter_commits(paths=blob.path).next()
d[blob.path]=commit.committed_date

Extract commits related to code changes from commit tree

Right now I am able to traverse through the commit tree for a github repository using pygit2 library. I am getting all the commits for each file change in the repository. This means that I am getting changes for text files with extensions .rtf as well in the repository. How do I filter out the commits which are related to code changes only? I don't want the changes related to text documents.
Appreciate any help or pointers. Thanks.
last = repo[repo.head.target]
t0=last
f = open(outputFile,'w')
print t0.hex
for commit in repo.walk(last.id):
if t0.hex == commit.hex:
continue
print commit.hex
out=repo.diff(t0,commit)
f.write(out.patch)
t0=commit;
As part of the output, I get the difference in rtf files as well as below:
diff --git a/archived-output/NEW/action-core[best].rtf b/archived-output/NEW/action-core[best].rtf
deleted file mode 100644
index 56cdec6..0000000
--- a/archived-output/NEW/action-core[best].rtf
+++ /dev/null
## -1,8935 +0,0 ##
-{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f0\fbidi \froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fbidi \fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
-{\f2\fbidi \fmodern\fcharset0\fprq1{\*\panose 02070309020205020404}Courier New;}{\f3\fbidi \froman\fcharset2\fprq2{\*\panose 05050102010706020507}Symbol;}
Either I have to filter the commits from the tree or I have to filter the output . I was thinking if I could remove the changes related to rtf files by removing the corresponding commits while walking through the tree.

If that is possible, how do we get the list of modified files?
Ah, now you're asking the right questions! Git, of course, does not store a list of modified files in each commit. Rather, each commit represents the state of the entire repository at a certain point in time. In order to find the modified files, you need to compare the files contained in one commit with the previous commit.
For each commit returned by repo.walk(), the tree attribute refers to the associated Tree object (which is itself a list of TreeEntry objects representing files and directories contained in that particular Tree).
A Tree object has a diff_to_tree() method that can be used to compare it against another Tree object. This returns a Diff object, which acts as an iterator over a list of Patch objects. Each Patch object refers to the changes in a single file between the two Trees that are being compared.
The Patch object is really the key to all this, because this is how
we determine which files have been modified.
The following code demonstrates this. For each commit, it will print
a list of new, modified, or deleted files:
import stat
import pygit2
repo = pygit2.Repository('.')
prev = None
for cur in repo.walk(repo.head.target):
if prev is not None:
print prev.id
diff = cur.tree.diff_to_tree(prev.tree)
for patch in diff:
print patch.status, ':', patch.new_file_path,
if patch.new_file_path != patch.old_file_path:
print '(was %s)' % patch.old_file_path,
print
if cur.parents:
prev = cur
cur = cur.parents[0]
If we run this against a sample repository, we can look at the
output for the first few commits:
c285a21e013892ee7601a53df16942cdcbd39fe6
D : fragments/configure-flannel.sh
A : fragments/flannel-config.service.yaml
A : fragments/write-flannel-config.sh
M : kubecluster.yaml
b06de8f2f366204aa1327491fff91574e68cd4ec
M : fragments/enable-services-master.sh
M : fragments/enable-services-minion.sh
c265ddedac7162c103672022633a574ea03edf6f
M : fragments/configure-flannel.sh
88a8bd0eefd45880451f4daffd47f0e592f5a62b
A : fragments/configure-docker-storage.sh
M : fragments/write-heat-params.yaml
M : kubenode.yaml
And compare that to the output of git log --oneline --name-status:
c285a21 configure flannel via systemd unit
D fragments/configure-flannel.sh
A fragments/flannel-config.service.yaml
A fragments/write-flannel-config.sh
M kubecluster.yaml
b06de8f call daemon-reload before starting services
M fragments/enable-services-master.sh
M fragments/enable-services-minion.sh
c265dde fix json syntax problem
M fragments/configure-flannel.sh
88a8bd0 configure cinder volume for docker storage
A fragments/configure-docker-storage.sh
M fragments/write-heat-params.yaml
M kubenode.yaml
...aaaand, that looks just about identical. Hopefully this is enough
to you started.

This is mainly a rewrite of larsks's excellent answer to
the current pygit2 API
Python3
It also fixes a flaw in the iteration logic: the original code would miss to diff the last revision against its parent when a revision range (a..b) is walked.
The following approximates the command
git log --name-status --pretty="format:Files changed in %h" origin/devel..master
on the sample repository given by larsks.
I was unable to trace file renames, though. This is printed as a deletion and an addition. The code line printing a rename is never reached.
import pygit2
repo = pygit2.Repository('.')
# Show files changed between origin/devel and current HEAD
devel = repo.revparse_single('origin/devel')
walker = repo.walk(repo.head.target)
walker.hide(devel.id)
for cur in walker:
if cur.parents:
print (f'Files changed in {cur.short_id}')
prev = cur.parents[0]
diff = prev.tree.diff_to_tree(cur.tree)
for patch in diff:
print(patch.delta.status_char(), ':', patch.delta.new_file.path)
if patch.delta.new_file.path != patch.delta.old_file.path:
print(f'(was {patch.delta.old_file.path})'.)
print()

Turn subversion path into walkable directory

I have a subversion repo ie "http://crsvn/trunk/foo" ... I want to walk this directory or for starters simply to a directory list.
The idea is to create a script that will do mergeinfo on all the branches in "http://crsvn/branches/bar" and compare them to trunk to see if the branch has been merged.
So the first problem I have is that I cannot walk or do
os.listdir('http://crsvn/branches/bar')
I get the value label syntax is incorrect (mentioning the URL)

You can use PySVN. In particular, the pysvn.Client.list method should do what you want:
import pysvn
svncl = pysvn.Client()
entries = svncl.list("http://rabbitvcs.googlecode.com/svn/trunk/")
# Gives you a list of directories:
dirs = (entry[0].repos_path for entry in entries if entry[0].kind == pysvn.node_kind.dir)
list(dirs)
No checkout needed. You could even specify a revision to work on, to ensure your script can ignore other people working on the repository while it runs.

listdir takes a path and not a url. It would be nice if python could be aware of the structure on a remote server but i don't think that is the case.
If you were to checkout your repository locally first you could easly walk the directories using pythons functions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.