Check which branches contain a specific git commit sha using Python dulwich?

Check which branches contain a specific git commit sha using Python dulwich? - python

I would like to do the following command from Python script using dulwich:
$ git branch --contains <myCommitSha> | wc -l
What I intend is to check if a particular commit (sha) is placed in more than one branches.
Of course I thought that I can execute the above command from Python and parse the output (parse the number of branches), but that's the last resort solution.
Any other ideas/comments? Thanks in advance.

Just in case someone was wondering how to do this now using gitpython:
repo.git.branch('--contains', YOURSHA)

Since branches are just pointers to random commits and they don't "describe" trees in any way, there is nothing linking some random commit TO a branch.
The only sensible way I would take to look up if a given commit is an ancestor of a commit to which some branch points is to traverse all ancestor chains from branch-top commit down.
In other words, in dulwich I would iterate over branches and traverse backwards to see if a sha is on the chain.
I am rather certain that's exactly what git branch --contains <myCommitSha> does as I am not aware of any other shortcut.
Since your choice is (a) make python do the iteration or (b) make C do same iteration, I'd just go with C. :)

There is no built-in function for this, but you can of course implement this yourself.
You can also just do something like this (untested):
branches = [ref for ref in repo.refs.keys("refs/heads/") if
any((True for commit in repo.get_walker(include=[repo.refs[ref]])
if commit.id == YOURSHA))]
This will give you a list of all the branch heads that contain the given commit, but will have a runtime of O(n*m), n beeing the amount of commits in your repo, m beeing the amount of branches. The git implementation probably has a runtime of O(n).

In case anyone uses GitPython and wants all branches
import git
gLocal = git.Git("<LocalRepoLocation>")
gLocal.branch('-a','--contains', '<CommitSHA>').split('\n')

Related

Python Pip automatically increment version number based on SCM

Similar questions like this were raised many times, but I was not able to find a solution for my specific problem.
I was playing around with setuptools_scm recently and first thought it is exactly what I need. I have it configured like this:
pyproject.toml
[build-system]
requires = ["setuptools_scm"]
build-backend = "setuptools.build_meta"
[project]
...
dynamic = ["version"]
[tool.setuptools_scm]
write_to = "src/hello_python/_version.py"
version_scheme = "python-simplified-semver"
and my __init__.py
from ._version import __version__
from ._version import __version_tuple__
Relevant features it covers for me:
I can use semantic versioning
it is able to use *.*.*.devN version strings
it increments minor version in case of feature-branches
it increments patch/micro version in case of fix-branches
This is all cool. As long as I am on my feature-branch I am able to get the correct version strings.
What I like particularly is, that the dev version string contains the commit hash and is thus unique across multiple branches.
My workflow now looks like this:
create feature or fix branch
commit, (push, ) publish
merge PR to develop-branch
As soon as I am on my feature-branch I am able to run python -m build which generated a new _version.py with the correct version string accordingly to the latest git tag found. If I add new commits, it is fine, as the devN part of the version string changes due to the commit hash. I would even be able to run a python -m twine upload dist/* now. My package is build with correct version, so I simply publish it. This works perfectly fine localy and on CI for both fix and feature branches alike.
The problem that I am facing now, is, that I need a slightly different behavior for my merged PullRequests
As soon as I merge, e.g. 0.0.1.dev####, I want to run my Jenkins job not on the feature-branch anymore, but instead on develop-branch. And the important part now is, I want to
get develop-branch (done by CI)
update version string to same as on branch but without devN, so: 0.0.1
build and publish
In fact, setuptools_scm is changing the version to 0.0.2.dev### now, and I would like to have 0.0.1.
I was tinkering a bit with creating git tags before running setuptools_scm or build, but I was not able to get the correct version string to put into the tag. At this point I am struggling now.
Is anyone aware of a solution to tackle my issue with having?:
minor increment on feature-branches + add .devN
patch/micro increment on fix-branches + add .devN
no increment on develop-branch and version string only containing major.minor.patch of merged branch

TLDR: turning off to write the version number to a file every time setuptools_scm runs could maybe solve your problem, alternatively add the version file to .gitignore.
Explanation:
I also just started using setuptools_scm, so I am not very confident in using it yet.
But, as far as I understand the logic to derive the version number is incremented according to the state of your repository (the detailed logic is documented here: https://github.com/pypa/setuptools_scm/#default-versioning-scheme).
When I am not mistaken, the tool now does exactly what it is expected to: it does NOT set a version name only derived from the tag, but also adds a devSomething because the tag you've set is not referencing the most current commit on the develop branch head in your case.
Also I had the problem that when letting setuptools_scm generate a version and also configuring it to write it to a file, this would lead to another state since the last commit, again generating a dev version number.
To get a "clean" (e.g. v0.0.1) version number I hat to do the tagging after merging (with merge commit) since the merge commit was also taken into account for the version numbering logic.
Still my setup is currently less complex than yours. Just feature and fix branches and just a main branch without develop. So fewer merge commits (I chose to do merge commits, so no linear history). Now after merging with commit I create a Tag manually and formulate its name myself.
And this also only works for me in case I opt-out for writing the version number into a file. This I have done by inserting the following into pyproject.toml:
[tool.setuptools_scm]
# intentionally empty/commented out
# write_to option leads to an unclean workspace during build
# which again is leading setuptools_scm to interpret this during build and producing wheels with unclean version numbers
# write_to = "version.txt"
Since setuptools_scm runs during build, a new version file is also generated, which pollutes your worktree. Since your worktree will never be clean this way, you always get a dev version number. To still have a version file and to let it ignore during build, add the file to your .gitignore.
My approach is not perfect, some manual steps, but for now it works for me.
Certainly not 100% applicable in your CI scenario, but maybe you could change the order of doing merges and tags. I hope this helps somehow.

How would I go about getting the full commit hash in Dulwich?

I would like to get the behavior of git show -s --format=%H in Dulwich; i.e. getting the full commit hash pointed to by HEAD. However, as it turns out the porcelain.show() function behaves pretty much like git show but doesn't seem to know any additional options like the Git CLI.
I am not surprised, given porcelain.describe() behaves similarly. But what alternative means do I have with Dulwich to see the full commit hash of HEAD?
For the abbreviated - albeit hardcoded to 7 characters (!) - hash I can use the aforementioned porcelain.describe().

By consulting the code for porcelain.describe() we can pull the pieces together.
open_repo_closing offers a nice context manager for the dulwich.repo.BaseRepo class, with contextlib.closing behavior
BaseRepo.head() contains the information as bytes
A minimal implementation could look like this:
def get_latest_hash(repo):
from dulwich.porcelain import open_repo_closing
with open_repo_closing(repo) as r:
return r.head().decode("ascii")
Simpler than I initially expected.

git commit miss, can't get it

I clone the 'Apache/tomcat' git repo to use some info about commit.
However, when i use git.repo('repo local address').iter_commits(), i can't get some commits.
Besides, I can't search these in github search engine.
For example, commit 69c56080fb3355507e1b55d014ec0ee6767a6150 is in the 'Apache tomcat' repo, however, search '69c56080fb3355507e1b55d014ec0ee6767a6150' in 'in this repository' get nothing.
It's amazing for me.
It seems like that the commit isn't in the master branch, so can't be searched?
I want to know the theory behind this and how to get info about these 'missing' commits in Python.
Thanks.

repo.iter_commits(), with no arguments, gives you the commits which can be reached by tracing back through the parent(s) of the current commit. In other words, if you are in the master branch, it will only give you commits that are part of the master branch.
You can give it a rev argument which, among other things, can be a branch name. For example, iter_commits(rev='8.5.x') ought to give you all commits in the 8.5.x branch, which will include 69c5608. You can use another function, repo.branches(), if you need to get a list of branches.
Alternatively, if you already know the hash of a single commit that you want to look up, you can use repo.commit(), again with a rev parameter which in this case is the full or abbreviated commit hash: commit(rev='69c5608').

I believe the issue here is that this commit is in the branch 8.5.x and not master. You can see this in the first link. It will show which branches include it. The GitHub search algorithm only searches the master/main/trunk branch.
To find it via git python library, try changing to that branch. See these instructions on how to switch branches: https://gitpython.readthedocs.io/en/stable/tutorial.html#switching-branches

generate alpha-numerically ordered UUID's over time

Following up on my question: Unique Linux filename, sortable by time
I need to generate a UUID that is itself alpha-numerically sequential over time. I assume I'll need to append the system date seconds since epoch and nanoseconds. This means I really just need an UUID algorithm that is alpha-numerically sequential within a given nanosecond.
So for example, I'm thinking of uuid's something like:
SECONDS_SINCE_EPOCH.NANOSECONDS.UID
The following bash:
for i in `seq 1 10`;
do
echo `date '+%s.%N'`.`uuidgen -t`
done
Results in:
1424718695.481439000.c8fef5d4-bb8f-11e4-92c7-00215e673861
1424718695.484130000.c8ff5eb6-bb8f-11e4-ae12-00215e673861
1424718695.486718000.c8ffc2ca-bb8f-11e4-ae15-00215e673861
1424718695.489267000.c90025bc-bb8f-11e4-a624-00215e673861
1424718695.491803000.c90089f8-bb8f-11e4-95ac-00215e673861
1424718695.494381000.c900ed76-bb8f-11e4-9058-00215e673861
1424718695.496899000.c901513a-bb8f-11e4-8018-00215e673861
1424718695.499460000.c901b440-bb8f-11e4-b382-00215e673861
1424718695.502007000.c90217a0-bb8f-11e4-89cd-00215e673861
1424718695.504532000.c90279d4-bb8f-11e4-b515-00215e673861
These files names appear as though they would suffice... but my fear is that I can't promise the names will be alpha-numerically sequential IF two files are created within the same nano-second (think large scale enterprise system with 10's of cores running many concurrent users). Because at that point I'm relying solely on the UUID algorithm for my unique name and all the UUID algorithm promises is uniqueness, not "alpha-numeric-sequential-ness".
Any ideas for a method that can guarantee uniqueness AND alpha-numeric sequential order? Because we're dealing with large enterprise systems, I need to keep my requirements as old-school as possible but I can probably swing some older versions of Python and whatnot if a solution in pure bash isn't readily available.

Based on another answer, you could reorder the time portions of the UUID so that the most-significant value shows up first, on down to the least significant. This is the more "natural" way that, say, UNIX time is presented and produces the sort order that you are looking for.
So the follow BASH should do the trick in your case:
for i in `seq 1 10`; do
echo $(date '+%s.%N').$(uuidgen -t | cut -d- -f3,2,1,4,5)
done
Bare in mind that there are no guarantees. Given enough tries and enough time, a collision will occur. If at all possible, you may want to do some sanity book checking further down the process chain which can correct any such mistakes before the data gets entered into a permanent record.

Maintaining two versions of an ipython notebook

I often need to create two versions of an ipython notebook: One contains tasks to be carried out (usually including some python code and output), the other contains the same text plus solutions. Let's call them the assignment and the solution.
It is easy to generate the solution document first, then strip the answers to generate the assignment (or vice versa). But if I subsequently need to make changes (and I always do), I need to repeat the stripping process. Is there a reasonable workflow that will allow changes in the assignment to be propagated to the solutions document?
Partial self-answer: I have experimented with leveraging mercurial's hg copy, which will let two files with different names share history. But I can only get this to work if assignment and solution are in different directories, in two linked hg repositories. I would much prefer a simpler set-up. I've also noticed that diff gets very confused when one JSON file has more sections than another, making a VCS-based solution even less attractive. (To be clear: Ordinary use of a VCS with notebooks is fine; it's the parallel versions that stumble).
This question covers similar ground, but does not solve my problem. In fact an answer to my question would solve the OP's second remaining problem, "pulling changes" (see the Update section).

It sounds like you are maintaining an assignment and an answer key of some kind and want to be able to distribute the assignments (without solutions) to students, and still have the answers for yourself or a TA.
For something like this, I would create two branches "unsolved" and "solved". First write the questions on the "unsolved" branch. Then create the "solved" branch from there and add the solutions. If you ever need to update a question, update back to the "unsolved" branch, make the update and merge the change into "solved" and fix the solution.
You could try going the other way, but my hunch is that going "backwards" from solved to unsolved might be strange to maintain.

After some experimentation I concluded that it is best to tackle this by processing the notebook's JSON code. Version control systems are not the right approach, for the following reasons:
JSON doesn't diff very well when adding or deleting cells. A minimal change leads to mis-matched braces and a very messy diff.
In my use case, the superset version of the file (containing both the assignments and their solutions) must be the source document. This is because the assignment includes example code and output that depends on earlier parts, to be written by the students. This model does not play well with version control, as pointed out by #ChrisPhillips in his answer.
I ended up filtering the JSON structure for the notebook and stripping out the solution cells; they may be recognized via special metadata (which can be set interactively using the metadata button in the interface), or by pattern-matching on the cell contents. The following snippet shows how to filter out cells whose first line starts with # SOLUTION:
def stripcell(cell, pattern):
"""Check if the first line of the cell's content matches `pattern`"""
if cell["cell_type"] == "code":
content = cell["input"]
else:
content = cell["source"]
return ( len(content) > 0 and re.search(pattern, content[0]) )
pattern = r"^# SOLUTION:"
struct = json.load(open("input.ipynb"))
cells = struct["worksheets"][0]["cells"]
struct["worksheets"][0]["cells"] = [ c for c in cells if not stripcell(c, pattern) ]
json.dump(struct, open("output.ipynb", "wb"), indent=1)
I used the generic json library rather than the notebook API. If there's a better way to go about it, please let me know.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.