I would like to force Python to use pre-compiled bytecode files that I provide, and avoid any other behavior that 'runs the script' but wastes time trying to regenerate the bytecode. Python however is designed to make this behavior generally transparent to the user, so I can't find any obvious ways to control (or even inspect) what it's doing in this regard.
My question(s):
Is my only option to remove the .py source, leaving only .pyc files? (I don't want to do that because keeping the source around has proven very useful for debugging in this scenario.)
How can I check up on this behavior to see what's really happening?
Background & Motivation
I am creating a package of Python modules to be distributed over read-only network file system, and I include pre-compiled bytecode as .pyc along with the Python packages. I can guarantee that the end users will always be using it with the same version of Python, on the same Linux OS, with the same kernel, in the same environment, etc, and therefore should never have to regenerate the bytecode. (This is a special setup for a mid-sized scientific experiment where we enforce these things in the environment we provide for our software developers. The package manager in this environment refuses to set up packages that don't match each other in the ways we specify.)
In this context, I want to ensure that Python does not try to regenerate bytecode or write .pyc's because we want to avoid slow startup times importing some large modules (matplotlib, scipy, and some others). Python's default behavior, as I understand it, is to
always check if the .pyc files are stale
always regenerate this bytecode if it is stale
always try to overwrite the .pyc file with the new bytecode, and
never tell the user if overwriting the .pyc failed.
Step 1 should not be necessary, but I'm not too worried about it because it only requires checking a few bytes. I think it doesn't take much time (even on our network file system, which is optimized for serving up small or partial binary files). I suppose I could be wrong about this for packages that have many .pyc files?
Step 2 should never be necessary after the package is built. If this happens then there is a bug in the package management.
Step 3 will always fail here because this is a read-only filesystem. If this happens then we need to know about it.
Step 4 is frustrating because I can't tell what's going on. And I have seen modules (e.g. matplotlib) which load more quickly on subsequent imports in this scenario, which I don't understand.
Related
I have written a python code which takes an input data file, performs some processing on the data and writes another data file as output.
I should distribute my code now but the users should not see the source code but be able to just giving the input and getting the output!
I have never done this before.
I would appreciate any advice on how to achieve this in the easiest way.
Thanks a lot in advance
As Python is an interpreted language by design; and as it compiles code to a bytecode (- which doesn't help the fact you're trying to conceal it, as bytecodes are easier to reverse -) there's no real secure way to hide your source code whereby it is not recoverable, as is true for any programming language, really.
Initially, if you'd wanted to work with a language that can't be so easily reversed- you should've gone for a more native language which compiles directly to the underlying architecture's machine code which is significantly harder to reproduce in the original language let alone read due to neat compiler optimizations, the overhead given by CISC et cetera.
However, some libraries that do convert your source code into an executable format (by packing the Python interpreter and the bytecode alongside it) can be used such as:
cx_Freeze - for freezing any code >=Python 2.7 for any platform, allegedly.
PyInstaller - for freezing general purpose code, it does state additionally that it works with third-party libraries.
py2exe -for freezing code into Windows-only executable format.
Or you might consider a substitute for this, which is code obfuscation which still allows the user to read the source code however make it near-to-impossible to read.
However, an issue brought up with this is that, it'd be harder for code addition as bad code obfuscation techniques could make the code static. Also, on the latter case, the code could have overhead brought by redundant code meant to fool or trick the user into thinking the code is doing something which it is not.
Also in general it negates the standard practice of open-source which is what Python loves to do and support.
So to really conclude, if you don't want to read everything above; the first thing you did wrong was choose Python for this, a language that supports open source and is open source as well. Thus to mitigate the issue you should either reconsider the language, or follow the references above to links to modules which might help aide basic source code concealment.
Firstly, as Python is an interpreted language, I think you cannot completely protect your Python code, .pyc files can be uncompiled to get back .py files (using uncompyle6 for example).
So the only thing you can do is make it very hard to read.
I recommend to have a look at code obfuscation, which consists in making your code unreadable by changing variables/function names, removing comments and docstrings, removing useless spaces, etc. Pyminifier does that kind of things.
You can also write your own obfuscation script.
Then you can also turn your program into a single executable (using pyinstaller for example). I am pretty sure there is a way to get .py files back from the executable, but it just makes it harder. Also beware of cross-platform compatibility when making an executable.
Going through above responses, my understanding is that some of the strategies mentioned may not work if your client wants to execute your protected script along with other unprotected scripts.
One other option is to encrypt your script and then use an interpreter that can decrypt and execute it. It too has some limitations.
ipepycrypter is a suite that helps protect python scripts. This is accomplished by hiding script implementation through encryption. The encrypted script is executed by modifed python interpreter. ipepycrypter consists of encryption tool ipepycrypt and python interpreter ipepython.
More information is available at https://ipencrypter.com/user-guides/ipepycrypter/
One other option, of course, is to expose the functionality over the web, so that the user can interact through the browser without ever having access to the actual code.
There are several tools which compile Python code into either (a) compiled modules usable with CPython, or (b) a self-contained executable.
https://cython.org/ is the best known, and probably? oldest, and it only takes a very small amount of effort to prepare a traditional Python package so that it can be compiled with Cython.
http://numba.pydata.org/ and https://pythran.readthedocs.io/ can also be used in this way, to produce Python compiled modules such that the source doesnt need to be distributed, and it will be very difficult to decompile the distributable back into usable source code.
https://mypyc.readthedocs.io is newer player, an offshoot of the mypy toolkit.
Nuitka is the most advanced at creating a self-contained executable. https://github.com/Nuitka/Nuitka/issues/392#issuecomment-833396517 shows that it is very hard to de-compile code once it has passed through Nuitka.
https://github.com/indygreg/PyOxidizer is another tool worth considering, as it creates a self-contained executable of all the needed packages. By default, only basic IP protection is provided, in that the packages inside it are not trivial to inspect. However for someone with a bit of knowledge of the tool, it is trivial to see the packages enclosed within the binary. However it is possible to add custom module loaders, so that the "modules" in the binary can be stored in unintelligible formats.
Finally, there are many Python to C/go/rust/etc transpilers, however these will very likely not be usable except for small subsets of the language (e.g. will 3/0 throw the appropriate exception in the target language?), and likely will only support a very limited subset of the standard library, and are unlikely to support any imports of packages beyond the standard library. One example is https://github.com/py2many/py2many , but a search for "Python transpiler" will give you many to consider.
I have been developing a fairly extensive library of python modules that automate the more time consuming parts of "3D character development" for games/film/tv.
All of my code up until a few months ago has been run within Maya's dedicated python interpreter, however, my GUIs are built in PySide/PyQt, and so, run just fine in mac/windows/linux or a few other Graphics programs such as Nuke, XSI, Max.
What I would really like to figure out is a "simple" way to distribute my code to various different people ---> using various different operating Systems ---> potentially using various applications (Nuke, XSI, Max), which, in turn, have their own dedicated python interpreters.
The obvious option would be pip and easy_install.. These modules are clearly the "right" way to go, but its not really clear how a user would install/run them under the dedicated python installs that ship with Maya/Nuke/ etc...Though, it does seem possible (as explained here). Still Its going to be a pretty big barrier for a less-technical user.
Any help or points in the right direction would be immensely appreciated..
I would not say that pip/easy_install are the 'right' way for this problem. They are pretty good (not quite 'great') tools for motivated, technically inclined users -- but even in that context they have issues (such as unintended upgrades or deletions). Most importantly, they are opt-in methods: nobody can make you pip unless you want to. This means users can accidentally or deliberately get themselves into very different positions from each other, which makes support and maintenance a nightmare.
I've had very good luck in Maya distributing a zipped file containing a complete environment - all the modules etc. userSetup.py adds that zip to the path and the Python's native zipimport functionality handles the rest. This makes sure that there is only one file to maintain and distribute. It also fixes the common problem of leftover .pyc files creating havok after .py files get moved or renamed. Since this is all standard python, I'd assume this will work for any app-specific python that uses a 2.6+ version of python, though I've never tried it in Nuke or Max.
The main wrinkle will be modules with .pyd or other binary components, typically these don't work inside the zip files. I include a bootstrap routine which unpacks those to a (disposable) location on the user's disk and adds that to the path.
There's a detailed discussion of the method here and some background here
I searched around for a while and have found a number of reasonable claims that CPython's compilation allows faster execution of Python code. I was wondering, though, if anyone knows of any benchmarks demonstrating the degree of the speedup.
Alternatively, perhaps there's an easy way for me to benchmark it. Is there a Python flag that can be given at runtime to turn off compilation?
All code run by cpython must be compiled to bytecode before it can be run. That's just how the interpreter works, and you probably can't reasonably change this (without writing your own interpreter).
However, by default the compiled bytecode for modules that are loaded will be cached in .pyc files. This means it won't need to be compiled again the next time you load it. Bytecode caching is probably what you heard about, as it can speed up the importing of previously used modules by a fair amount. It doesn't change performance after the modules are loaded though.
You can disable bytecode caching with the -B command line option or the PYTHONDONTWRITEBYTECODE environment variable. If you wanted to do a test of the speed difference, you may also need to delete any existing cache. In Python 2, the compiled bytecode would be written to a .pyc file right next to the .py souce file. In Python 3, this was changed to use a __pycache__ folder which can hold several .pyc files from different versions of Python (so you could have several cached versions at once, see PEP 3147 for more details).
I'm working on an Inno Setup installer for a Python application for Windows 7, and I have these requirements:
The app shouldn't write anything to the installation directory
It should be able to use .pyc files
The app shouldn't require a specific Python version, so I can't just add a set of .pyc files to the installer
Is there a recommended way of handling this? Like give the user a way to (re)generate the .pyc files? Or is the shorter startup time benefit from the .pyc files usually not worth worrying about?
PYC files aren't guaranteed to be compatible for different python versions. If you don't know that all your customers are running the same python versions, you really don't want to distribute pyc's directly. So, you have to choose between distributing PYCs and supporting multiple python versions.
You could create build process that compiles all your files using py_compile and zips them up into a version-specific package. You can do this with setuptools.; however it will be awkward to do because you'll have to run py_compile in every version you need to support.
If you are basically distributing a closed application and don't want people to have trivial access to your source code, then py2exe is probably a simpler alternative. If your python is supposed to be integrated into the user's python install, then it's probably simpler to just create a zip of your .py files and add a one-line .py stub that imports the zipped package(s) using zipfile
if it makes you feel better, PYC doesn't provide much extra security and it doesn't really boost perf much either :)
If you haven't read PEP 3147, that will probably answer your questions.
I don't mean the solution described in that PEP and implemented as of Python 3.2. That's great if your "multiple Python versions" just means "3.2, 3.3, and probably future 3.x". Or even if it means "2.6+ and 3.1+, but I only really care about 3.2 and 3.3, so if I don't get the pyc speedups for other ones that's OK".
But when I asked your supported versions, you said, "2.7", which means you can't rely on PEP 3147 to solve your problems.
Fortunately, the PEP is full of discussion of earlier attempts to solve the problem, and the pitfalls of each, and there should be more than enough there to figure out what the options are and how to implement them.
The one problem is that the PEP is very linux-centric—mainly because it's primarily linux distros that tried to solve the problem in the past. (Apple also did so, but their solution was (a) pretty much working, and (b) tightly coupled with the whole Mac-specific "framework" thing, so they were mostly ignored…)
So, it largely leaves open the question of "Where should I put the .pyc files on Windows?"
The best choice is probably an app-specific directory under the user's local application data directory. See Known Folders if you can require Vista or later, CSIDL if you can't. Either way, you're looking for the FOLDERID_LocalAppData or CSIDL_LOCAL_APPDATA, which is:
The file system directory that serves as a data repository for local (nonroaming) applications. A typical path is C:\Documents and Settings\username\Local Settings\Application Data.
The point is that it's a place for applications to store data that's separate for each user (and inside that user's profile directory), and also for each machine the user's roaming profile might end up on, which means you can safely put stuff there and know that the user has the permissions to write there without UAC getting involved, and also know (as well as you ever can) that no other user or machine will interfere with what's there.
Within that directory, you create a directory for your program, and put whatever you want there, and as long as you picked a unique name (e.g., My Unique App Name or My Company Name\My App Name or a UUID), you're safe from accidental collision with other programs. (There used to be specific guidelines on this in MSDN, but I can no longer find them.)
So, how do you get to that directory?
The easiest way is to just use the env variable %LOCALAPPDATA%. If you need to deal with older Windows, you can use %USERPROFILE% and tack \Local Settings\Application Data onto the end, which is guaranteed to either be the same, or end up in the same place via junctions.
You can also use pywin32 or ctypes to access the native Windows APIs (since there are at least 3 different APIs for this and at least two ways to access those APIs, I don't want to give all possible ways to write this… but a quick google or SO search for "pywin32 SHGetFolderPath" or "ctypes SHGetKnownFolderPath" or whatever should give you what you need).
Or, there are multiple third-party modules to handle this. The first one both Google and PyPI turned up was winshell.
Re-reading the original question, there's a much simpler answer that probably fits your requirements.
I don't know much about Inno, but most installers give you a way to run an arbitrary command as a post-copy step.
So, you can just use python -m compileall to create the .pyc files for you at install time—while you've still got elevated privileges, so there's no problem with UAC.
In fact, if you look at pywin32, and various other Python packages that come as installer packages, they do exactly this. This is an idiomatic thing to do for installing libraries into the user's Python installation, so I don't see why it wouldn't be considered reasonable for installing an executable that uses the user's Python installation.
Of course if the user later decides to uninstall Python 2.6 and install 2.7, your .pyc files will be hosed… but from your description, it sounds like your entire program will be hosed anyway, and the recommended solution for the user would probably be to uninstall and reinstall anyway, right?
I write tools that are used in a shared workspace. Since there are multiple OS's working in this space, we generally use Python and standardize the version that is installed across machines. However, if I wanted to write some things in C, I was wondering if maybe I could have the application wrapped in a Python script, that detected the operating system and fired off the correct version of the C application. Each platform has GCC available and uses the same shell.
One idea was to have the C compiled to the users local ~/bin, with timestamp comparison with C code so it is not compiled each run, but only when code is updated. Another was to just compile it for each platform, and have the wrapper script select the proper executable.
Is there an accepted/stable process for this? Are there any catches? Are there alternatives (assuming the absolute need to use native C code)?
Clarification: Multiple OS's are involved that do not share ABI. Eg. OS X, various Linuxes, BSD etc. I need to be able to update the code in place in shared folders and have the new code working more or less instantaneously. Distributing binary or source packages is less than ideal.
Launching a Python interpreter instance just to select the right binary to run would be much heavier than you need. I'd distribute a shell .rc file which provides aliases.
In /shared/bin, you put the various binaries: /shared/bin/toolname-mac, /shared/bin/toolname-debian-x86, /shared/bin/toolname-netbsd-dreamcast, etc. Then, in the common shared shell .rc file, you put the logic to set the aliases according to platform, so that on OSX, it gets alias toolname=/shared/bin/toolname-mac, and so forth.
This won't work as well if you're adding new tools all the time, because the users will need to reload the aliases.
I wouldn't recommend distributing tools this way, though. Testing and qualifying new builds of the tools should be taking up enough time and effort that the extra time required to distribute the tools to the users is trivial. You seem to be optimizing to reduce the distribution time. Replacing tools that quickly in a live environment is all too likely to result in lengthy and confusing downtime if anything goes wrong in writing and building the tools--especially when subtle cross-platform issues creep in.
Also, you could use autoconf and distribute your application in source form only. :)
You know, you should look at static linking.
These days, we all have HUGE hard drives, and a few extra megabytes (for carrying around libc and what not) is really not that big a deal anymore.
You could also try running your applications in chroot() jails and distributing those.
Depending on your mix os OSes, you might be better off creating packages for each class of system.
Alternatively, if they all share the same ABI and hardware architecture, you could also compile static binaries.