Our software is modular and I have about 20 git repos in one project.
If a test fails, it is sometimes hard to find the matching commit since several developers work on these 20 repos.
I know the test worked yesterday and fails reproachable today.
Sometimes I use git-bisec, but this works only for one git repo.
Often changes in two git repos make a test fail.
I could write a dirty script which loops over my N git repos myself, but before doing so, I would like to know how experts would solve this.
I use Python, Django and pytest, but AFAIK this does not matter for this question.
I personally prefer to use repo tool to manage complex projects. Put those 20 repos in manifest.xml and each time when build starts create patch manifest if build fails do repo diff manifests to see what was changed and where.
There is category of QA tool for "reverse dependency" CI builds. So your higher level projects get rebuilt every time a lower level change is made. At scale it can be resource intensive.
The entire class of problem is removed if you stop dealing with repo-to-repo relationships and start following version release methodology for the subcomponents. Then you can track the versions of lower-level dependencies and know when you go to upgrade that it broke. Your CI could build against several versions of dependencies if you wanted to systematize it.
Git submodules accomplish that tracking for individual commits, so you again get to decide when to incorporate changes from lower levels. (Notably, that can also be used like released versions if you only ever update to tagged release commits.)
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I have yet to come across an answer that makes me WANT to start using virtual environments. I understand how they work, but what I don’t understand is how can someone (like me) have hundreds of Python projects on their drive, almost all of them use the same packages (like Pandas and Numpy), but if they were all in separate venv’s, you’d have to pip install those same packages over and over and over again, wasting so much space for no reason. Not to mention if any of those also require a package like tensorflow.
The only real benefit I can see to using venv’s in my case is to mitigate version issues, but for me, that’s really not as big of an issue as it’s portrayed. Any project of mine that becomes out of date, I update the packages for it.
Why install the same dependency for every project when you can just do it once for all of them on a global scale? I know you can also specify —-global-dependencies or whatever the tag is when creating a new venv, but since ALL of my python packages are installed globally (hundreds of dependencies are pip installed already), I don’t want the new venv to make use of ALL of them? So I can specify only specific global packages to use in a venv? That would make more sense.
What else am I missing?
UPDATE
I’m going to elaborate and clarify my question a bit as there seems to be some confusion.
I’m not so much interested in understanding HOW venv’s work, and I understand the benefits that can come with using them. What I’m asking is:
Why would someone with (for example) have 100 different projects that all require tensorflow to be installed into their own venv’s. That would mean you have to install tensorflow 100 separate times. That’s not just a “little” extra space being wasted, that’s a lot.
I understand they mitigate dependency versioning issues, you can “freeze” packages with their current working versions and can forget about them, great. And maybe I’m just unique in this respect, but the versioning issue (besides the obvious difference between python 2 and 3) really hasn’t been THAT big of an issue. Yes I’ve run into it, but isn’t it better practise to keep your projects up to date with the current working/stable versions than to freeze them with old, possibly no longer supported versions? Sure it works, but that doesn’t seem to be the “best” option to me either.
To reiterate on the second part of my question, what I would think is, if I have (for example) tensorflow installed globally, and I create a venv for each of my 100 tensorflow projects, is there not a way to make use of the already globally installed tensorflow inside of the venv, without having to install it again? I know in pycharm and possibly the command line, you can use a — system-site-packages argument (or whatever it is) to make that happen, but I don’t want to include ALL of the globally installed dependencies, cuz I have hundreds of those too. Is —-system-site-packages -tensorflow for example a thing?
Hope that helps to clarify what I’m looking for out of this discussion because so far, I have no use for venv’s, other than from everyone else claiming how great they are but I guess I see it a bit differently :P
(FINAL?) UPDATE
From the great discussions I've had with the contributors below, here is a summation of where I think venv's are of benefit and where they're not:
USE a venv:
You're working on one BIG project with multiple people to mitigate versioning issues among the people
You don't plan on updating your dependencies very often for all projects
To have a clearer separation of your projects
To containerize your project (again, for distribution)
Your portfolio is fairly small (especially in the data science world where packages like Tensorflow are large and used quite frequently across all of them as you'd have to pip install the same package to each venv)
DO NOT use a venv:
Your portfolio of projects is large AND requires a lot of heavy dependencies (like tensorflow) to mitigate installing the same package in every venv you create
You're not distributing your projects across a team of people
You're actively maintaining your projects and keeping global dependency versions up to date across all of them (maybe I'm the only one who's actually doing this, but whatever)
As was recently mentioned, I guess it depends on your use case. Working on a website that requires contribution from many people at once, it makes sense to all be working out of one environment, but for someone like me with a massive portfolio of Tensorflow projects, that do not have versioning issues or the need for other team members, it doesn't make sense. Maybe if you plan on containerizing or distributing the project it makes sense to do so on an individual basis, but to have (going back to this example) 100 Tensorflow projects in your portfolio, it makes no sense to have 100 different venv's for all of them as you'd have to install tensorflow 100 times into each of them, which is no different than having to pip install tensorflow==2.2.0 for specific old projects that you want to run, which in that case, just keep your projects up to date.
Maybe I'm missing something else major here, but that's the best I've come up with so far. Hope it helps someone else who's had a similar thought.
I'm a data scientist and sometimes I run into these things called "virtual environments" and I don't get what the use case is? I already have all of these packages and modules and widgets downloaded! Why should I set up a separate place where I manage all of the stuff I'm already managing globally?
Python is a very powerful tool. In this answer consider two such ways to swing the metaphorical hammer:
Data Science
Software Engineering
For a data scientist (working alone) using Python to write a poc for a research paper, make a lstm nn, or predict the price of TSLA dependent on the frequency of Elon Musk's tweets all that really matters is being able to use the best library (tensorflow, pytorch, sklearn, ...) for whatever task they're trying to get done. In whatever directory they're working in when they need it. It is very tempting to use one global Python installation and just use the same stuff everywhere. Frankly, this is probably fine. As it's just one person managing their own space. So the configuration of their machine would be one single Python environment and everything, everywhere uses it. Or if the data scientist wanted to they could have a single directory that contains a virtual environment and some sub directories containing all the scripts (projects) they work on.
Now consider a software engineer who has multiple git repos with complete CI/CD pipelines that each build into separate entities that then get deployed to some cloud environment. Them and the 9 other people on their team need to be able to be sure that they are all making changes that won't break any piece of the code. For example in Python 3.6 the function dict.popitem subtly changed from returning a random element in a dict to LIFO order guaranteed. It's pretty easy to see that that could cause issues if Jerry had implemented a function that relies on the original random nature of the function and Bob implemented a function with the LIFO behavior guaranteed. This team of engineers would have git repos that each contain a single virtual environment (a single isolated Python environment) that allows them to manage dependencies for that "project".
The data scientist has one Python installation/environment that allows them to do whatever.
The engineer has a Python installation and a bunch of environments so that they can work across multiple repos with multiple people and (hopefully) nothing breaks.
I can see where you're coming from with your question. It can seem like a lot of work to set up and maintain multiple virtual environments (venvs), especially when many of your projects might use similar or even the same packages.
However, there are some good reasons for using venvs even in cases where you might be tempted to just use a single global environment. One reason is that it can be helpful to have a clear separation between your different projects. This can be helpful in terms of organization, but it can also be helpful if you need to use different versions of packages in different projects.
If you try to share a single venv among all of your projects, it can be difficult to use different versions of packages in those projects when necessary. This is because the packages in your venv will be shared among all of the projects that use that venv. So, if you need to use a different version of a package in one project, you would need to change the version in the venv, which would then affect all of the other projects that use that venv. This can be confusing and make it difficult to keep track of what versions of packages are being used in which projects.
Another issue with sharing a single venv among all of your projects is that it can be difficult to share your code with others. This is because they would need to have access to the same environment (which contains lots of stuff unrelated to the single project you are trying to share). This can be confusing and inconvenient for them.
So, while it might seem like a lot of work to set up and maintain multiple virtual environments, there are some good reasons for doing so. In most cases, it is worth the effort in order to have a clear separation between your different projects and to avoid confusion when sharing your code with others.
It's the same principle as in monouser vs multiuser, virtualization vs no virtualization, containers vs no containers, monolithic apps vs micro services, etcetera; to avoid conflict, maintain order, easily identify a state of failure, among other reasons as scalability or portability. If necessary apply it, and always keeping in mind KISS philosophy as well, managing complexity, not creating more.
And as you have already mentioned, considering that resources are finite.
Besides, a set of projects that share the same base of dependencies of course that is not the best example of separation necessity.
In addition to that, technology evolve taking into account not redundancy of knowingly base of commonly used resources.
Well, there are a few advantages:
with virtual environments, you have knowledge about your project's dependencies: without virtual environments your actual environment is going to be a yarnball of old and new libraries, dependencies and so on, such that if you want to deploy a thing into somewhere else (which may mean just running it in your new computer you just bought) you can reproduce the environment it was working in
you're eventually going to run into something like the following issue: project alpha needs version7 of library A, but project beta needs library B, which runs on version3 of library A. if you install version3, A will probably die, but you really need to get B working.
it's really not that complicated, and will save you a lot of grief in the long term.
There are several motivations for venvs,
or for their moral equivalent: conda environments.
1. author a package
You create a cool "scrape my favorite site" package
which graphs a timeseries of some widget product.
Naturally it depends on BeautifulSoup.
You happened to have html5lib 1.1 lying around
due to some previous project, so you tested with that.
A user downloads your scrape-widget package from pypi,
happens to have lxml 4.7.1 available, and finds
that scraping crashes when using that library.
Wouldn't it have been better for your package
to specify that user shall run against the same
deps that you tested with?
2. use a package
Same scenario, but now you're using someone's scrape-widget
package. Author tested with lxml 4.7.1 but you have lxml 4.9.1,
which behaves differently, and this makes the app behave
differently, crashing in ways the author never saw.
3. use two packages
You want to run both scrape-frobozz-magic-widgets
and scrape-acme-widget. Their authors tested using
different versions of requests, and of lxml.
Changing dep changes the app behavior.
You can only use one or the other, unless you're
willing to re-run pip quite frequently.
4. collaborate on a team
You write code that has deps.
So does your colleague.
You have to coordinate things,
so testing on one laptop
instills confidence the test
would succeed on other laptops.
5. use CI
You have a teammate named Jenkins, and
want to communicate to him that you used
a specific version of a dep when you saw the test succeed.
6. get a new laptop
Things were working.
Then your laptop exploded,
you got a new one,
and you (quickly) want to see things work again.
Some of your deps were downrev, due to
recently released bugs and breaking changes.
Reading a file full of dep versions from your github repo
lets you immediately reproduce the state of the world
back when things were working.
While it is possible to simply use pip freeze to get the current environment, it is not suitable to require an environment as bleeding edge as what I am used too.
Moreover, some developer tooling are only available on recent version of packages (think type annotations), but not needed for users.
My target users may want to use my package on slowly upgrading machines, and I want to get my requirements as low as possible.
For example, I cannot require better than Python 3.6 (and even then I think some users may be unable to use the package).
Similarly, I want to avoid requiring the last Numpy or Matplotlib versions.
Is there a (semi-)automatic way of determining the oldest compatible version of each dependency?
Alternatively, I can manually try to build a conda environment with old packages, but I would have to try pretty randomly.
Unfortunately, I inherited a medium-sized codebase (~10KLoC) with no automated test yet (I plan on making some, but it takes some time, and it sadly cannot be my priority).
The requirements were not properly defined either so that I don't know what it has been run with two years ago.
Because semantic versionning is not always honored (and because it may be difficult from a developper standpoint to determine what is a minor or major change exactly for each possible user), and because only a human can parse release notes to understand what has changed, there is no simple solution.
My technical approach would be to create a virtual environment with a known working combination of Python and libraries versions. From there, downgrade one version by one version, one lib at a time, verifying that it still works fine (may be difficult if it is manual and/or long to check).
My social solution would be to timebox the technical approach to take no more than a few hours. Then settle for what you have reached. Indicate in the README that lib requirements may be overblown and that help is welcome.
Without fast automated tests in which you are confident, there is no way to automate the exploration of the N-space (each library is a dimension) to find a some minimums.
Up to now we keep the version number of our python source code in setup.py.
This version gets increased after every successful ci run.
This means the version of central libraries get increased several times per day.
Since the version number is stored in a file in the git repo, every increase of the version number is a new commit.
This means roughly 50% of all commits are not made by humans, but by CI.
I have got the feeling, that we are on the wrong track. Maybe it is no good solution to keep the version number in ci.
How could we avoid the "useless" CI commits which just increase the version number?
How to avoid keeping version number in source code?
Update
We live without manual release since several years. We do not have a versioning scheme like MAJOR.MINOR. And we have not missed this in the past. I know that this does not work for all environments. But it works for my current environment.
We have a version number which looks like this: YEAR.MONTH.X
This means every commit which passes CI is a new release.
After reading the answers I realize: I need to asks myself: Do I have a version number at all? I think no. I have a build number. More is not needed in this context.
(thank you for the up-votes. Before asking this question I was sure that this question will get closed because people will think it is "unclear" or "too broad")
It is a common practice to keep a version number in the source code, there is nothing wrong in that.
You need to separate CI procedures to regular builds, release publishing and release deployment.
Regular builds: run daily or even after each commit, can include static code analysis and automatic tests, check if the code can be built at all. Regular builds should not change the version number.
Release publishing: can only be triggered by explicit manual action by release manager.
The trigger action could be tagging a commit with a new version number, new merge into the release branch, or just a commit that changes version number kept in a special file (e.g. pom.xml). Refer to git flow for example.
Release publishing assigns a new version number (either automatically or manually), commits it into the source code if necessary, builds a binary package with a new version and uploads it to the binary package repository (e.g. Nexus, devpi, local APT repository, Docker registry and so on).
Release deployment: another manually triggered action that takes a ready binary package from a package repository and installs it to the target environment (dev, QA / UAT / staging, part of production for canary deployments or to the whole production environment).
Premises:
I assume these are the premises under which the solution is discussed.
Currently version number is kept in a git-tracked source file, but you are OK to get rid of it.
No one manually manages version number, nor triggers a release procedure, which includes: (a) increase version number, (b) build from source and (c) store the built result somewhere. These are taken cared by CI, and SHOULD remain that way.
Solution:
Instead of writing to a source file and create new commit, CI simply tag the specific commit that passed CI check, then push the tag to remote repo.
The build script should read the tag of current HEAD commit, and use it as the version number for publishing a release.
Optionally, you might want to use git filter-branch to rewrite your existing git repo history, tag previous release commits for consistency, remove and stop tracking the version number source cile, then get rid of those CI commits.
I think you should use git flow. And create a master branch and a develop branch. Every time the CI checks the develop the version number remains the same. Everytime you create a release e.g. merge develop into master, you can increase the version number by CI.
Or have i missing something, but in my Opinion there is no reason that the version number is increased everytime ci runs.
So all in all you better should think about when to "release" changes to a new version!!
If the project's kept in a git repo for production use, just use whichever variant of git describe floats your boat, no need to store it in a tracked file because the result identifies the particular history, and you've got that history right there.
If the source is shipped separately, you can use git archive and the export-subst attribute to embed pretty much anything you want in the exported source.
PS : Being a new user cannot add comment.
support and expand on this answer by #VibrantVivek.
For Continuous-Integration , tagging of repository back is very important and whether you keep it in your code or simply by any other git way , after every successful CI there must be corresponding tag/version.
And if you're having CI tags/version which are not against commit , then something really wrong is at work here.
And +1 for Martin Fowler , here another link for a more detailed article (more or less by same person) https://www.thoughtworks.com/continuous-integration (recommend to read please)
On your first question :
How could we avoid the "useless" CI commits which just increase the
version number?
Please se Continuous Integration (CI) is a development practice which verifies each check-in by an automated build , allowing teams to detect problems early.
Having said that, would like to articulate here :
From practice : Every commit should build on an integration
machine
Under how to do it : The CI server monitors the repository and checks out changes when they occur
In simple words, the CI server should enhance the version only and only when there is a new commit and thus making sure every code commit is releasable.
Looks like from OP , that in your area there more (as you said) "useless" commits from CI server.
Based on your CI mechanism, I hope you should/must be able to control it , almost there are ways to handle in every tool we use. (Eg: webhooks in bitbucket, version plugin etc).
So, making sure only after a new commit we have a new version.
Now if you're thinking about those regular nightly integration builds , then read below :
Many organizations do regular builds on a timed schedule, such as every night. This is not the same thing as a continuous build and isn't enough for continuous integration. The whole point of continuous integration is to find problems as soon as you can. Nightly builds mean that bugs lie undetected for a whole day before anyone discovers them. Once they are in the system that long, it takes a long time to find and remove them.
Also you have mentioned : Every commit which passes CI is a new release, thus in a way you're already on true CI.
Despite this, if you're still unable to figure out how you can avoid "useless" commits of version number, then I would suggest to add another question with detail on how your CI mecahnism works and why it is difficult with given conditions.
I bet there must be a solution. Also have look on GithubFlowVsGitFlow.
source : Martin fowler's white paper on CI
How to avoid keeping version number in source code?
On this, would like to expand on #void answer as it is said there there:
It is a common practice to keep a version number in the source code, there is nothing wrong in that.
There are projects which have to know the exact version deployed (for some important xy reasons) in such scenarios they keep version in source and HTTP GET API to fetch from deployed code (one way of doing it) to know the version currently deployed on X server.
However it is more on the requirement, suppose for another project there is no such situation then recommended way to keep version is using commit hash / tagging each successful CI build.
You can have more details here :
Hope this helps.
I'm working on deploying a Python module composed of several dozen files and folders; I use Mercurial for managing the software changes.
I want to keep the same module in three branches: the official one (which the team uses), the development one (this may be more than one development branch), and the testing branch (not the testing of the official branch, but a collection of test related to a third party module used by my module - regression testing when the third party module makes new releases).
How can I accomplish this in Mercurial? Simply name three branches in the same folder or cloning one version into three places an maintain them separately?
Any insight on how to manage this in general would be appreciated.
Thank you.
The "official" way would be cloning your repo in as many branch as you need.
But named branches within a repo is also acceptable, especially if you don't need to work simultaneously on different development efforts (each associated to their respective branch)
I find the "Guide to Branching Model in Mercurial" very instructive on this kind of choice.
Other information on Mercurial branches in this SO question as well.
I've been using zc.buildout more and more and I'm encountering problems with some recipes that I have solutions to.
These packages generally fall into several categories:
Package with no obvious links to a project site
Package with links to free hosted service like github or google code
Setup #2 is better then #1, but not much better because for both of these situations, I would have to wait for the developer to apply these changes before i can use the updated package buildout.
What I've been doing up to this point is basically forking the package, giving it a different name and uploading it to pypi, but this is creating redundancy and I think only aggravating the problem.
One possible solution, is to use to use a personal server package index where I would upload updated versions of the code until the developer updates he/her package. This is doable, but it adds additional work, that I would prefer to avoid.
Is there a better way to do this?
Thank you
Your "upload my personalized fork" solution sounds like a terrible idea. You should try http://pypi.python.org/pypi/collective.recipe.patch which lets you automatically patch eggs. Try setting up a local PyPi-compatible index. I think you can also point find-links = at a directory (not just a http:// url) containing your personal versions of those "almost good enough" packages. You can also try monkey patching the defective package, or take advantage of the Zope component model to override the necessary bits in a new package. Often the real authors are listed somewhere in the source code of a package, even if they decided not to put their names up on PyPi.
I've been trying to cut down on the number of custom versions of packages I use. Usually I work with customized packages as develop eggs by linking src/some.project to my checkout of that project's code. I don't have to build a new egg or reinstall every time I edit those packages.
A lot of Python packages used in buildouts are hosted in Plone's svn collective. It's relatively easy to get commit access to that repository.