I am maintaining a collection of python scripts that are distributed on several computers. Users might have the fancy idea to modify the scripts so I am looking for an automatic solution to check the script integrity.
I wanted to use git attribute ident so that the file contains its own sha1 and then use git hash-object to compare.
It looks like this (.gitattributes contains *.py ident):
import subprocess
gitId= '$Id: 98a648abdf1cd8d563c72886a601857c20670013 $' #this sha will be updated automatically at each commit on the file.
gitId=gitId[5:-2]
shaCheck=subprocess.check_output(['git', 'hash-object', __file__]).strip().decode('UTF-8')
if shaCheck != gitId:
print('file has been corrupted \n {} <> {}'.format(shaCheck, gitId))
# below the actual purpose of the script
This is working fine when my script lays inside the git repository but git hash-object returns a different sha when outside of my git repository. I guess there is some git filters issue but I do not know how to get around that issue?
Any other painless way to check my file interity is also welcome.
You could check the file's hash with the Python module hashlib:
import hashlib
filename_1 = "./folder1/test_script.py"
with open(filename_1,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
filename_2 = "./folder2/test_script.py"
with open(filename_2,"rb") as f:
bytes = f.read() # read entire file as bytes
readable_hash = hashlib.sha256(bytes).hexdigest();
print(readable_hash)
Output:
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117
a0c22dc5d16db10ca0e3d99859ffccb2b4d536b21d6788cfbe2d2cfac60e8117
I'm working with databricks and try to find a good setup for my project. Basically I have to process many files and therefore I wrote quite a bit of python code. I'm now faced with the problem to run this code on a spark cluster such that my source code files are on the respective nodes as well. My approach so far (it even works) is to call a map on an rdd which then runs the parse_file function for each element of the rdd. Is there a better approach, such I don't have to clone every time the whole git?
def parse_file(filename, string):
os.system("rm -rf source_code_folder")
os.system("git clone https://user:password#dev.azure.com/company/project/_git/repo source_code_folder")
sys.path.append(os.path.join(subprocess.getoutput("pwd"),"source_code_folder/"))
from my_module1.whatever import my_function
from my_module2.whatever import some_other_function
result = (my_function(string), some_other_function(string))
return result
my_rdd = sc.wholeTextFiles("abfss://raw#storage_account.dfs.core.windows.net/files/*.txt")
processed_rdd = my_rdd(lambda x: (x[0], parse_file(x[0],x[1])))
Thanks!
I am trying to have 2 simultaneous versions of a single package on a server. A production and a testing one.
I want these 2 to be on the same git repository on 2 different branches (The testing would merge into production) however i would love to keep them in the same directory so its not needed to change any imports or paths.
Is it possible to dynamically change the package name in setup.py, depending on the git branch?
Or is it possible to deploy them with different names using pip?
EDIT : i may have found a proper solution for my problem here : Git: ignore some files during a merge (keep some files restricted to one branch)
Gitattributes can be setup to ignore merging of my setup.py, ill close this question after i test it.
This could be done with a setup script that looks like this:
#!/usr/bin/env python3
import pathlib
import setuptools
def _get_git_ref():
ref = None
git_head_path = pathlib.Path(__file__).parent.joinpath('.git', 'HEAD')
with git_head_path.open('r') as git_head:
ref = git_head.readline().split()[-1]
return ref
def _get_project_name():
name_map = {
'refs/heads/master': 'ThingProd',
'refs/heads/develop': 'ThingTest',
}
git_ref = _get_git_ref()
name = name_map.get(git_ref, 'ThingUnknown')
return name
setuptools.setup(
# see 'setup.cfg'
name=_get_project_name(),
)
It reads the current git ref directly from the .git/HEAD file and looks up the corresponding name in a table.
Inspired from: https://stackoverflow.com/a/56245722/11138259.
Using .gitattributes file with content "setup.py merge=ours" and also setting up the git config --global merge.ours.driver true. Makes the merge "omit" the setup.py file (it keeps our file instead). This works only if both master and child branch have changed the file since they firstly parted ways. (it seems)
I am looking to get only the diff of a file changed from a git repo. Right now, I am using gitpython to actually get the commit objects and the files of git changes, but I want to do a dependency analysis on only the parts of the file changed. Is there any way to get the git diff from git python? Or am I going to have to compare each of the files by reading line by line?
If you want to access the contents of the diff, try this:
repo = git.Repo(repo_root.as_posix())
commit_dev = repo.commit("dev")
commit_origin_dev = repo.commit("origin/dev")
diff_index = commit_origin_dev.diff(commit_dev)
for diff_item in diff_index.iter_change_type('M'):
print("A blob:\n{}".format(diff_item.a_blob.data_stream.read().decode('utf-8')))
print("B blob:\n{}".format(diff_item.b_blob.data_stream.read().decode('utf-8')))
This will print the contents of each file.
You can use GitPython with the git command "diff", just need to use the "tree" object of each commit or the branch for that you want to see the diffs, for example:
repo = Repo('/git/repository')
t = repo.head.commit.tree
repo.git.diff(t)
This will print "all" the diffs for all files included in this commit, so if you want each one you must iterate over them.
With the actual branch it's:
repo.git.diff('HEAD~1')
Hope this help, regards.
Git does not store the diffs, as you have noticed. Given two blobs (before and after a change), you can use Python's difflib module to compare the data.
I'd suggest you to use PyDriller instead (it uses GitPython internally). Much easier to use:
for commit in Repository("path_to_repo").traverse_commits():
for modified_file in commit.modified_files: # here you have the list of modified files
print(modified_file.diff)
# etc...
You can also analyze a single commit by doing:
for commit in RepositoryMining("path_to_repo", single="123213")
If you're looking to recreate something close to what a standard git diff would show, try:
# cloned_repo = git.Repo.clone_from(
# url=ssh_url,
# to_path=repo_dir,
# env={"GIT_SSH_COMMAND": "ssh -i " + SSH_KEY},
# )
for diff_item in cloned_repo.index.diff(None, create_patch=True):
repo_diff += (
f"--- a/{diff_item.a_blob.name}\n+++ b/{diff_item.b_blob.name}\n"
f"{diff_item.diff.decode('utf-8')}\n\n"
)
If you want to do git diff on a file between two commits this is the way to do it:
import git
repo = git.Repo()
path_to_a_file = "diff_this_file_across_commits.txt"
commits_touching_path = list(repo.iter_commits(paths=path))
print repo.git.diff(commits_touching_path[0], commits_touching_path[1], path_to_a_file)
This will show you the differences between two latest commits that were done to the file you specify.
repo.git.diff("main", "head~5")
PyDriller +1
pip install pydriller
But with the new API:
Breaking API: ```
from pydriller import Repository
for commit in Repository('https://github.com/ishepard/pydriller').traverse_commits():
print(commit.hash)
print(commit.msg)
print(commit.author.name)
for file in commit.modified_files:
print(file.filename, ' has changed')
Here is how you do it
import git
repo = git.Repo("path/of/repo/")
# the below gives us all commits
repo.commits()
# take the first and last commit
a_commit = repo.commits()[0]
b_commit = repo.commits()[1]
# now get the diff
repo.diff(a_commit,b_commit)
I have a subversion repo ie "http://crsvn/trunk/foo" ... I want to walk this directory or for starters simply to a directory list.
The idea is to create a script that will do mergeinfo on all the branches in "http://crsvn/branches/bar" and compare them to trunk to see if the branch has been merged.
So the first problem I have is that I cannot walk or do
os.listdir('http://crsvn/branches/bar')
I get the value label syntax is incorrect (mentioning the URL)
You can use PySVN. In particular, the pysvn.Client.list method should do what you want:
import pysvn
svncl = pysvn.Client()
entries = svncl.list("http://rabbitvcs.googlecode.com/svn/trunk/")
# Gives you a list of directories:
dirs = (entry[0].repos_path for entry in entries if entry[0].kind == pysvn.node_kind.dir)
list(dirs)
No checkout needed. You could even specify a revision to work on, to ensure your script can ignore other people working on the repository while it runs.
listdir takes a path and not a url. It would be nice if python could be aware of the structure on a remote server but i don't think that is the case.
If you were to checkout your repository locally first you could easly walk the directories using pythons functions.