Azure databricks toggle environment vars contain quotes in python

Azure databricks toggle environment vars contain quotes in python - python

I know there are a lot of questions here about how to handle quotes in environment variables. This question has a different focus so please read on:
Before last week we had set our environment variables on our databricks cluster (7.3 LTS, includes Apache Spark 3.0.1, Scala 2.12) like this:
EXAMPLE_FOO="gaga"
For whatever reason (don't remember) we needed the quotes to get this result in python:
print(os.environ["EXAMPLE_FOO"]) => gaga
Since last week the behavior changed, now we get:
print(os.environ["EXAMPLE_FOO"]) => "gaga"
with the quotes. We have now clue why this suddenly changed. There was no software update or alike from our side on this production system. We would like to understand the root cause. Has some library on databricks changed or is there a setup flag in the databricks configuration where you can toggle this behavior?
Note: We know how to handle both cases in python so ne need to tell me how to handle the variables. We need to know what suddenly may have caused the issue.

It looks like your workspace was already upgraded to incorporate this breaking change that was highlighted in the release notes. You also should have communication from the Databricks support about this change. Basically, you don't need to use escaping anymore, so you can remove the quotes.
But it's really better to raise a support ticket with Microsoft to understand impact of this issue, and define the next steps.

Related

How to avoid keeping version number in source code?

Up to now we keep the version number of our python source code in setup.py.
This version gets increased after every successful ci run.
This means the version of central libraries get increased several times per day.
Since the version number is stored in a file in the git repo, every increase of the version number is a new commit.
This means roughly 50% of all commits are not made by humans, but by CI.
I have got the feeling, that we are on the wrong track. Maybe it is no good solution to keep the version number in ci.
How could we avoid the "useless" CI commits which just increase the version number?
How to avoid keeping version number in source code?
Update
We live without manual release since several years. We do not have a versioning scheme like MAJOR.MINOR. And we have not missed this in the past. I know that this does not work for all environments. But it works for my current environment.
We have a version number which looks like this: YEAR.MONTH.X
This means every commit which passes CI is a new release.
After reading the answers I realize: I need to asks myself: Do I have a version number at all? I think no. I have a build number. More is not needed in this context.
(thank you for the up-votes. Before asking this question I was sure that this question will get closed because people will think it is "unclear" or "too broad")

It is a common practice to keep a version number in the source code, there is nothing wrong in that.
You need to separate CI procedures to regular builds, release publishing and release deployment.
Regular builds: run daily or even after each commit, can include static code analysis and automatic tests, check if the code can be built at all. Regular builds should not change the version number.
Release publishing: can only be triggered by explicit manual action by release manager.
The trigger action could be tagging a commit with a new version number, new merge into the release branch, or just a commit that changes version number kept in a special file (e.g. pom.xml). Refer to git flow for example.
Release publishing assigns a new version number (either automatically or manually), commits it into the source code if necessary, builds a binary package with a new version and uploads it to the binary package repository (e.g. Nexus, devpi, local APT repository, Docker registry and so on).
Release deployment: another manually triggered action that takes a ready binary package from a package repository and installs it to the target environment (dev, QA / UAT / staging, part of production for canary deployments or to the whole production environment).

Premises:
I assume these are the premises under which the solution is discussed.
Currently version number is kept in a git-tracked source file, but you are OK to get rid of it.
No one manually manages version number, nor triggers a release procedure, which includes: (a) increase version number, (b) build from source and (c) store the built result somewhere. These are taken cared by CI, and SHOULD remain that way.
Solution:
Instead of writing to a source file and create new commit, CI simply tag the specific commit that passed CI check, then push the tag to remote repo.
The build script should read the tag of current HEAD commit, and use it as the version number for publishing a release.
Optionally, you might want to use git filter-branch to rewrite your existing git repo history, tag previous release commits for consistency, remove and stop tracking the version number source cile, then get rid of those CI commits.

I think you should use git flow. And create a master branch and a develop branch. Every time the CI checks the develop the version number remains the same. Everytime you create a release e.g. merge develop into master, you can increase the version number by CI.
Or have i missing something, but in my Opinion there is no reason that the version number is increased everytime ci runs.
So all in all you better should think about when to "release" changes to a new version!!

If the project's kept in a git repo for production use, just use whichever variant of git describe floats your boat, no need to store it in a tracked file because the result identifies the particular history, and you've got that history right there.
If the source is shipped separately, you can use git archive and the export-subst attribute to embed pretty much anything you want in the exported source.

PS : Being a new user cannot add comment.
support and expand on this answer by #VibrantVivek.
For Continuous-Integration , tagging of repository back is very important and whether you keep it in your code or simply by any other git way , after every successful CI there must be corresponding tag/version.
And if you're having CI tags/version which are not against commit , then something really wrong is at work here.
And +1 for Martin Fowler , here another link for a more detailed article (more or less by same person) https://www.thoughtworks.com/continuous-integration (recommend to read please)

On your first question :
How could we avoid the "useless" CI commits which just increase the
version number?
Please se Continuous Integration (CI) is a development practice which verifies each check-in by an automated build , allowing teams to detect problems early.
Having said that, would like to articulate here :
From practice : Every commit should build on an integration
machine
Under how to do it : The CI server monitors the repository and checks out changes when they occur
In simple words, the CI server should enhance the version only and only when there is a new commit and thus making sure every code commit is releasable.
Looks like from OP , that in your area there more (as you said) "useless" commits from CI server.
Based on your CI mechanism, I hope you should/must be able to control it , almost there are ways to handle in every tool we use. (Eg: webhooks in bitbucket, version plugin etc).
So, making sure only after a new commit we have a new version.
Now if you're thinking about those regular nightly integration builds , then read below :
Many organizations do regular builds on a timed schedule, such as every night. This is not the same thing as a continuous build and isn't enough for continuous integration. The whole point of continuous integration is to find problems as soon as you can. Nightly builds mean that bugs lie undetected for a whole day before anyone discovers them. Once they are in the system that long, it takes a long time to find and remove them.
Also you have mentioned : Every commit which passes CI is a new release, thus in a way you're already on true CI.
Despite this, if you're still unable to figure out how you can avoid "useless" commits of version number, then I would suggest to add another question with detail on how your CI mecahnism works and why it is difficult with given conditions.
I bet there must be a solution. Also have look on GithubFlowVsGitFlow.
source : Martin fowler's white paper on CI
How to avoid keeping version number in source code?
On this, would like to expand on #void answer as it is said there there:
It is a common practice to keep a version number in the source code, there is nothing wrong in that.
There are projects which have to know the exact version deployed (for some important xy reasons) in such scenarios they keep version in source and HTTP GET API to fetch from deployed code (one way of doing it) to know the version currently deployed on X server.
However it is more on the requirement, suppose for another project there is no such situation then recommended way to keep version is using commit hash / tagging each successful CI build.
You can have more details here :
Hope this helps.

Case sensitivity with names of modules and files in python 2.7.15

I have encountered a rather funny situation: I work in a big scientific collaboration whose major software package is based on C++ and python (2.7.15 still). This collaboration also has multiple servers (SL6) to run the framework on. Since I joined the collaboration recently, I received instructions on how to set up the software and run it. All works perfectly on the server. Now, there are reasons not to connect to the server to do simple tasks or code development, instead it is preferrable to do these kind of things on your local laptop. Thus, I set up a virtual machine (docker) according to a recipe I received, installed a couple of things (fuse, cvmfs, docker images, etc.) and in this way managed to connect my MacBook (OSX 10.14.2) to the server where some of the libraries need to be sourced in order for the software to be compiled and run. And after 2h it does compile! So far so good..
Now comes the fun part: you run the software by executing a specific python script which is fed as argument another python script. Not funny yet. But somewhere in this big list of python scripts sourcing one another, there is a very simple task:
import logging
variable = logging.DEBUG
This is written inside a script that is called Logging.py. So the script and library only are different by the first letter: l or L. On the server, this runs perfectly smooth. On my local VM set up, I get the error
AttributeError: 'module' object has no attribute 'DEBUG'
I checked the python versions (which python) and the location of the logging library (print logging.__file__), and in both set ups I get the same result for both commands. So the same python version is run, and the same logging library is sourced but in one case there is a mix up with the name of the file that sources the library.
So I am wondering, if there is some "convention file" (like a .vimrc for vi) sourced somewhere where this issue could be resolved by setting some tolerance parameter to some other value...?
Thanks a lot for the help!
conni

as others have said, OSX treats names as case-insensitive by default, so the Python bundled logging module will be confused with your Logging.py file. I'd suggest the better fix would be to get the Logging.py file renamed, as this would improve compatibility of the code base. otherwise, you could create a "Case-sensitive" APFS file system using "Disk Utility"
if you go with creating a file system, I'd suggest not changing the root/system partition to case-sensitive as this will break various programs in subtle ways. you could either repartition your disk and create a case-sensitive filesystem, or create an "Image" (this might be slower, not sure how much) and work in there. Just make sure you pick the "APFS (Case-sensitive)" format when creating the filesystem!

pycharm will not allow me to use my python on an Ubuntu operating system

I have been working on getting pycharm to use python easy right.. wrong, when I try to add python 3.5 the gui, keeps going back to no SDK at all and it won't let me add one. to better explain the problem I made this video to help. what I am doing should work but it is not. note this is on an Ubuntu operating system.
https://www.youtube.com/watch?v=g5dy1jlIHCs

It seems weird beucase JetBrains' product detects local SDK first without pain, but you have really problem with that.
Anyway, I think, we can solve the problem.
Try to look PyCharm log. You may see interesting things over there.
Delete VirtualEnv and use /usr/bin/python3.5. (Maybe, it can solve)
Check JetBrains owner (chown), read and write permissions (chmod). Also, check .idea file
NOTE:
This invalid VCS doesn't effect your usage of SDK because JetBrains has to read .idea to find your VCS settings (or other things)
Thank you

You have two interpreters with the same name, not sure how it happened but PyCharm doesn't allow it
remove one, also you select an interpreter for a wrong project (you have multiple opened)

Programmatically determine if running in DSX

How can I programmatically determine if the python code in my notebook is running under DSX?
I'd like to be able to do different things under a local Jupyter notebook vs. DSX.

While the method presented in another answer (look for specific environment variables) works today, it may stop working in the future. This is not an official API that DSX exposes. It will obviously also not work if somebody decides to set these environment variables on their non-DSX system.
My take on this is that "No, there is no way to reliably determine whether the notebook is running on DSX".
In general, (in my opinion) notebooks are not really designed as artifacts that you can arbitrarily deploy anywhere; there will always need to be someone wearing the "application developer" hat and transform them - how to do that, you could put into a markdown cell inside the notebook.

You can print your environment or look for some specific environment variable. I am sure you will find some differences.
For example:
import os
if os.environ.get('SERVICE_CALLER'):
print ('In DSX')
else:
print ('Not in DSX')

Self-updating python Scripts

I wrote 2-3 Plugins for pyload.
Sometimes they change and i let users know over forum that theres a new version.
To avoid that i'd like to give my scripts an auto selfupdate function.
https://github.com/Gutz-Pilz/pyLoad-stuff/blob/master/FileBot.py
Something like that easy to setup ?
Or someone can point me in a direction ?
Thanks in advance!

It is possible, with some caveats. But it can easily become very complicated. Before you know it, your auto-update "feature" will be bigger than the original code!
First you need to have an URL that always contains the latest version. Since you are using github, using raw.githubusercontent might do very well.
Have your code download the latest version from that URL (e.g. using requests), and compare the version with that in the current code. For this purpose I would recommend a simple integer version number, so you don't need any complicated parsing logic.
However, you might want to consider only running that check once per day, or once per week. If you do it every time your file is run, the server might get hammered! So now you have to save a file with the date when the check was last done, and read that to see if it is time to run the check again. This file will need to be saved in a location that you can access on every platform your code is liable to run on. That in itself can be a challenge.
If it is just a single python file, which is installed as the user that is running it, updating is relatively easy. But if the original was installed as root in the global Python directory and your script is running as a nonprivileged user it will be difficult. Especially if it is running as a plugin and cannot ask the user for (temporary) root credentials to install the file.
And what are you going to do if a newer version has more dependencies outside the standard library?
Last but not least, as a sysadmin I don't really like auto-updating software. Especially for critical system infrstructure I like to be able to estimate the consequences before an update.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.