modular scipy install to fit on space constrained environment

modular scipy install to fit on space constrained environment - python

Is it possible to only install part of a python package such as scipy to reduce the total size. The latest version (0.19.0) appears to be using 174MB, I'm only using it for its spectrogram feature. I'm putting this on AWS Lambda and I'm over the 512MB storage limit.
I realize there are other options I could take, e.g. using another spectrogram implementation, manually removing scipy files, etc. I'm wondering if there is any automated way to do this?

There is no supported way of doing it. You may try removing manually some parts which you do not use. However, you're on your own if you go down this route.

Related

As a python package maintainer, how can I determine lowest working requirements

While it is possible to simply use pip freeze to get the current environment, it is not suitable to require an environment as bleeding edge as what I am used too.
Moreover, some developer tooling are only available on recent version of packages (think type annotations), but not needed for users.
My target users may want to use my package on slowly upgrading machines, and I want to get my requirements as low as possible.
For example, I cannot require better than Python 3.6 (and even then I think some users may be unable to use the package).
Similarly, I want to avoid requiring the last Numpy or Matplotlib versions.
Is there a (semi-)automatic way of determining the oldest compatible version of each dependency?
Alternatively, I can manually try to build a conda environment with old packages, but I would have to try pretty randomly.
Unfortunately, I inherited a medium-sized codebase (~10KLoC) with no automated test yet (I plan on making some, but it takes some time, and it sadly cannot be my priority).
The requirements were not properly defined either so that I don't know what it has been run with two years ago.

Because semantic versionning is not always honored (and because it may be difficult from a developper standpoint to determine what is a minor or major change exactly for each possible user), and because only a human can parse release notes to understand what has changed, there is no simple solution.
My technical approach would be to create a virtual environment with a known working combination of Python and libraries versions. From there, downgrade one version by one version, one lib at a time, verifying that it still works fine (may be difficult if it is manual and/or long to check).
My social solution would be to timebox the technical approach to take no more than a few hours. Then settle for what you have reached. Indicate in the README that lib requirements may be overblown and that help is welcome.
Without fast automated tests in which you are confident, there is no way to automate the exploration of the N-space (each library is a dimension) to find a some minimums.

Is there a way to bypass AWS Lambda package size limit?

I'm pretty sure this question have been asked before multiple times, however, the solutions are normally about using npm, which afaik isn't applicable to Python scripts. So the problem is I hit the package size limit when trying to upload a package that contains Chromium binary, which by itself exceeds the limit, let alone other libraries and the code itself. If I understood correctly, Lambda layers won't help either as a singular file's size is already more than the allowed limit. Is there a workaround to such issue?
Note: the package contains Selenium library, ChromeDriver and an unpacked Linux Chromium version

You can use docker container as Lambda images. This will allow you use up to 10GB.

I ran into the same issue you did when trying to create a Chromium layer. I found out that you can load a zip file that exceeds the size limit into a layer if you use S3 as the source, so I used that method which did work. By all means use Docker if you are familiar with it but after struggling with the initial issue I chose the easier S3 option.
Edit: removed reference to console since most here should be comfortable using the CLI.

scipy.spatial.ckdtree running slowly

I've been using spatial.cKDTree in scipy to calculate distances between points. It has always run very quickly (~1 s) for my typical data sets (finding distances for ~1000 points to an array of ~1e6 points).
I'm running this code in python 2.7.6 on a computer with Ubuntu 14.10. Up until this morning, I had managed most python packages with apt-get, including scipy and numpy. I wanted up-to-date versions of a few packages though, so I decided to packages installed in /usr/lib/python2.7/ by apt-get, and re-installed all packages with pip install (taking care of scipy dependencies like liblapack-dev with apt-get, as necessary). Everything installed and is importable without a problem.
import scipy
import cython
scipy.__version__
'0.16.0'
cython.__version__
'0.22.1'
Now, running spatial.cKDTree on the same size data sets is going really slowly. I'm seeing run time of ~500 s rather than ~1 s. I'm having trouble figuring out what is going on.
Any suggestions as to what I might have done in installing using pip rather than apt-get that would have caused scipy.spatial.cKDTree to run so slowly?

In 0.16.x I added options to build the cKDTree with median or sliding midpoint rules, as well as choosing whether to recompute the bounding hyperrectangle at each node in the kd-tree. The defaults are based on experiences about the performance of scipy.spatial.cKDTree and sklearn.neighbors.KDTree. In some contrived cases (data that are highly streched along a dimension) it can have negative impact, but usually it should be faster. Experiment with bulding the cKDTree with balanced_tree=False and/or compact_nodes=False. Setting both to False gives you the same behavior as 0.15.x. Unfortunately it is difficult to set defaults that make everyone happy because the performance depends on the data.
Also note that with balanced_tree=True we compute medians by quickselect when the kd-tree is constructed. If the data for some reason is pre-sorted, it will be very slow. In this case it will help to shuffle the rows of the input data. Or you can set balanced_tree=False to avoid the partial quicksorts.
There is also a new option to multithread the nearest-neighbor query. Try to call query with n_jobs=-1 and see if it helps for you.
Update June 2020:
SciPy 1.5.0 will use a new algorithm (introselect based partial sort, from C++ STL) which solves the problems reported here.

In the next release of SciPy, balanced kd-trees will be created with introselect instead of quickselect, which is much faster on structured datasets. If you use cKDTree on a structured data set such as an image or a grid, you can look forward to a major boost in performance. It is already available if you build SciPy from its master branch on GitHub.

Excludes for Cx Freeze Builds

I have a question regarding building distribution files using CX Freeze.
I have built several distribution packages for different set of codes and application that I have written in Python.
I usually use Cx_Freeze to make my build and distribution packages.
One of the key target most of the time will be the size of the package before and after installation.
Although Cx_freeze picks up the necessary module, most of the time you end up adding certain libraries such matplotlib backends, numpy library etc when you use them as a part of your code.
The Key trick to reduce the size could be excluding the modules that you don't need as part of your code.
Most of the time for me it will be trial and error.
But how one could decide the most optimized build with stripping all non essential modules during the build?
Say for example if my application is not GUI based, I end up removing tkinter, but once matplotlib backend was using and I had to bring it back again.
Is it always iterative process?

Alternative to scipy and numpy for linear algebra?

Is there a good (small and light) alternative to numpy for python, to do linear algebra?
I only need matrices (multiplication, addition), inverses, transposes and such.
Why?
I am tired of trying to install numpy/scipy - it is such a pita to get
it to work - it never seems to install correctly (esp. since I have
two machines, one linux and one windows): no matter what I do: compile
it or install from pre-built binaries. How hard is it to make a
"normal" installer that just works?

I'm surprised nobody mentioned SymPy, which is written entirely in Python and does not require compilation like Numpy.
There is also tinynumpy, which is a pure python alternative to Numpy with limited features.

Given your question, I decided just factor out the matrix code from where I were using it, and put it in a publicly accessible place -
So, this is basically a pure python ad-hoc implementation of a Matrix class which can perform addition, multiplication, matrix determinant and matrix inversion - should be of some use -
Since it is in pure python, and not worried with performance at all it unsuitable for any real calculation - but it is good enough for playing around with matrices in an interactive way, or where matrix algebra is far from being the critical part of the code.
The repository is here,
https://bitbucket.org/jsbueno/toymatrix/
And you can download it straight from here:
https://bitbucket.org/jsbueno/toymatrix/downloads/toymatrix_0.1.tar.gz

I hear you, I have been there as well. Numpy/scipy are really wonderful libraries and it is a pity that installation problems get somewhat often in the way of their usage.
Also, as far as I understand there are not very many good (easier to use) options either. The only possibly easier solution for you I know about is the "Yet Another Matrix Module" (see NumericAndScientific/Libraries listing on python.org). I am not aware of the status of this library (stability, speed, etc.). The possibility is that in the long run your needs will outgrow any simple library and you will end up installing numpy anyway.
Another notable downside on using any other library is that your code will potentially be incompatible with numpy, which happens to be the de facto library for linear algebra in python. Note also that numpy has been heavily optimized - speed is something you are not guaranteed to get with other libraries.
I would really just put more effort on solving the installation/setup problems. The alternatives are potentially much worse.

Have you ever tried anaconda? https://www.anaconda.com/download
This should allow it to install those packages easily.
conda install -c conda-forge scipy
conda install -c conda-forge numpy
Apart from offering you an easy way to install them in linux/mac/linux you will get virtualenviroments management too

I sometimes have this problem..not sure if this works but I often install it using my own account then try to run it in an IDE(komodo in my case) and it doesn't work. Like your issue it says it cannot find it. The way I solve this is to use sudo -i to get into root and then install it from there.
If that does not work can you update your answer to provide a bit more info about the type of system your using(linux, mac, windows), version of python/numpy and how your accessing it so it'll be easier to help.

For people who still have the problem: Try python portable:
http://portablepython.com/wiki/Download/

Have a look: tinynumpy, tinyarray and sympy
https://github.com/wadetb/tinynumpy
https://github.com/kwant-project/tinyarray
https://docs.sympy.org/latest/index.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

modular scipy install to fit on space constrained environment - python

There is no supported way of doing it. You may try removing manually some parts which you do not use. However, you're on your own if you go down this route.

Related

As a python package maintainer, how can I determine lowest working requirements

Is there a way to bypass AWS Lambda package size limit?

scipy.spatial.ckdtree running slowly

Excludes for Cx Freeze Builds

Alternative to scipy and numpy for linear algebra?

Categories

Resources