long-time R and Python user here. I use R for my daily data analysis and Python for tasks heavier on text processing and shell-scripting. I am working with increasingly large data sets, and these files are often in binary or text files when I get them. The type of things I do normally is to apply statistical/machine learning algorithms and create statistical graphics in most cases. I use R with SQLite sometimes and write C for iteration-intensive tasks; before looking into Hadoop, I am considering investing some time in NumPy/Scipy because I've heard it has better memory management [and the transition to Numpy/Scipy for one with my background seems not that big] - I wonder if anyone has experience using the two and could comment on the improvements in this area, and if there are idioms in Numpy that deal with this issue. (I'm also aware of Rpy2 but wondering if Numpy/Scipy can handle most of my needs). Thanks -
R's strength when looking for an environment to do machine learning and statistics is most certainly the diversity of its libraries. To my knowledge, SciPy + SciKits cannot be a replacement for CRAN.
Regarding memory usage, R is using a pass-by-value paradigm while Python is using pass-by-reference. Pass-by-value can lead to more "intuitive" code, pass-by-reference can help optimize memory usage. Numpy also allows to have "views" on arrays (kind of subarrays without a copy being made).
Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when dealing with numpy arrays (benchmark). Fortunately, Cython lets one get serious speed improvements easily.
If working with Big Data, I find the support for storage-based arrays better with Python (HDF5).
I am not sure you should ditch one for the other but rpy2 can help you explore your options about a possible transition (arrays can be shuttled between R and Numpy without a copy being made).
I use NumPy daily and R nearly so.
For heavy number crunching, i prefer NumPy to R by a large margin (including R packages, like 'Matrix') I find the syntax cleaner, the function set larger, and computation is quicker (although i don't find R slow by any means). NumPy's Broadcasting functionality for instance, i do not think has an analog in R.
For instance, to read in a data set from a csv file and 'normalize' it for input to an ML algorithm (e.g., mean center then re-scale each dimension) requires just this:
data = NP.loadtxt(data1, delimiter=",") # 'data' is a NumPy array
data -= NP.mean(data, axis=0)
data /= NP.max(data, axis=0)
Also, i find that when coding ML algorithms, i need data structures that i can operate on element-wise and that also understand linear algebra (e.g., matrix multiplication, transpose, etc.). NumPy gets this and allows you to create these hybrid structures easily (no operator overloading or subclassing, etc.).
You won't be disappointed by NumPy/SciPy, more likely you'll be amazed.
So, a few recommendations--in general and in particular, given the facts in your question:
install both NumPy and Scipy. As a rough guide, NumPy provides the
core data structures (in particular
the ndarray) and SciPy (which is
actually several times larger than
NumPy) provides the domain-specific
functions (e.g., statistics, signal
processing, integration).
install the repository versions, particularly w/r/t NumPy because the
dev version is 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other of course, but both are the best in their respective class among python libraries. You can get all three via easy_install, which i assume you already.
NumPy/SciPy have several modules
specifically directed to Machine
Learning/Statistics, including the Clustering package and the Statistics package.
As well as packages directed to
general computation, but which are
make coding ML algorithms a lot
faster, in particular,
Optimization and Linear Algebra.
There are also the SciKits, not included in the base NumPy or
SciPy libraries; you need to install them separately.
Generally speaking, each SciKit is a
set of convenience wrappers to
streamline coding in a given domain. The SciKits you are likely to find most relevant are: ann (approximate Nearest Neighbor), and learn (a set of ML/Statistics regression and classification algorithms, e.g., Logistic Regression, Multi-Layer Perceptron, Support Vector Machine).
Related
I've been working with spatstat in R for quite a while now and am curious what the big differences between the packages (PySAL in python and spatstat in R) are functionality-wise. Is either more potent or faster, does one have more pre-set functions?
Thanks loads
This is not really an appropriate question for this forum.
However, the main differences between the two packages can be seen by reading their documentation: spatstat is designed for analysing spatial point patterns, is written by statisticians following statistical principles and conventions, contains current techniques from the statistical literature, and does not handle file input/output directly. PySAL is designed for spatial data in general (with relatively less functionality for spatial point patterns), is written by geographers, and includes capabilities for reading spatial data file formats.
While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean(), np.std() (with ddof=0 or 1)
statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
scipy.statistics package
That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.
Why does NumPy duplicate features of SciPy?
From the SciPy FAQ What is the difference between NumPy and SciPy?:
In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.
It recommends using SciPy over NumPy:
In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.
When should I use the statistics library?
From the statistics library documentation:
The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
Thus I would not use it for serious (i.e. resource intensive) computation.
What is the difference between statsmodels and SciPy?
From the statsmodels about page:
The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.
Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.
For example the SciPy documentation for scipy.stats.probplot notes that
Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.
Thus in cases like these you will need to turn to statistical libraries beyond SciPy.
Just a simple benchmark.
statistics vs NumPy
import statistics as stat
import numpy as np
from timeit import default_timer as timer
l = list(range(1_000_000))
start = timer()
m, std = stat.mean(l), stat.stdev(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
start = timer()
m, std = np.mean(l), np.std(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
(no iteration for brevity.)
Result
NumPy is way faster.
# statistics
3.603845 sec elapsed.
mean, std: 499999.5 288675.2789323441
# numpy
0.2315261999999998 sec elapsed.
mean, std: 499999.5 288675.1345946685
recently I'm working on a problem which requires
diagonalizing a huge hermitian matrix to get all the eigenvalues.
Currently I'm using Mathematica to do the job.
However it is not applicable due to the limitation of memory
when the matrix size approaches (2^15,2^15), where the diagonalization costs approximately 32 GBs memory.
I've tried using python by importing the matrix from mathematica,
import numpy as np
from scipy.io import mmread
from scipy.sparse import csc_matrix
#importing sparse matrix to save space
h = mmread("h.mtx")
h = csc_matrix(h)
#diagonlizing the dense one
ev = np.linalg.eigvalsh(h.todense())
It works but unfortunately an order of magnitude slower than Mathematica.
So, is there any other possible solutions, say, C++?
I know nothing about C++ so I guess the simplest way may be importing the
matrix to C++ and diagonalizing.
Thanks!
Running some preliminary test using this matrix:
http://math.nist.gov/MatrixMarket/data/NEP/h2plus/qc2534.html
I determined that the conversion to dense does not take up much of the time. The eigenvalue calculation does.
Numpy uses highly-optimized Lapack routines to calculate. These are the same you'd use in C++. Therefore C++ won't give you much of a speedup. If you want a speedup use the sparseness as a property, go to a better computer or switch to a distributed matrix storage(lot's of labor here).
P.S: if you do this for a university project you might want to look around if your university has a cluster of some sort. A cluster node typically has lots of memory. If not, check amazons AWS EC2 or googles compute engine for instances with lot's of ram.
Edit:
Here Wolfram says what Mathematica does behind the scenes: http://reference.wolfram.com/language/tutorial/LinearAlgebraAppendix.html#83486633
Arpack is a (arnoldi)subspace solver, giving you only the highest or lowest k-eigenvalues, ATLAS is just a Lapack implementation and the rest seems to be for solving linear systems.
All methods giving you the full eigenspectrum will require the matrix decomposition of a NxN matrix. If you only want k vectors there are methods which reduce it to a decomposition of a k x k-matrix.
There are modern alternatives to Arpack(http://slepc.upv.es/ or the one that comes with MKL), but they all give you a subspace.
c++ won't help much.
In python you can delegate easily to C++ and a lot of scipy routines will do just that (for performance). I also expect that if you only time the eigen value line you will get similar performance to Matematica and the difference in performance comes from reading the data.
The best solution is to look for a more appropriate algorithm, maybe something that operates on the sparse matrix directly, or decompose the original into smaller matrices and combine them.
To make the original solution more tractable you could try increasing the amount of swap space. In linux it's a dedicated partition, in windows it's a setting. This should allow Matematica/python to use more memory, but it's going to be much slower due to memory trashing. Get an SSD to speed this setup up, but note that it's going to be destroyed faster due to often writes. Or even better buy more RAM.
I am interested in the performance of Pyomo to generate an OR model with a huge number of constraints and variables (about 10e6). I am currently using GAMS to launch the optimizations but I would like to use the different python features and therefore use Pyomo to generate the model.
I made some tests and apparently when I write a model, the python methods used to define the constraints are called each time the constraint is instanciated. Before going further in my implementation, I would like to know if there exists a way to create directly a block of constraints based on numpy array data ? From my point of view, constructing constraints by block may be more efficient for large models.
Do you think it is possible to obtain performance comparable to GAMS or other AML languages with pyomo or other python modelling library ?
Thanks in advance for your help !
While you can use NumPy data when creating Pyomo constraints, you cannot currently create blocks of constraints in a single NumPy-style command with Pyomo. Fow what it's worth, I don't believe that you can in languages like AMPL or GAMS, either. While Pyomo may eventually support users defining constraints using matrix and vector operations, it is not likely that that interface would avoid generating the individual constraints, as the solver interfaces (e.g., NL, LP, MPS files) are all "flat" representations that explicit represent individual constraints. This is because Pyomo needs to explicitly generate representations of the algebra (i.e., the expressions) to send out to the solvers. In contrast, NumPy only has to calculate the result: it gets its efficiency by creating the data in a C/C++ backend (i.e., not in Python), relying on low-level BLAS operations to compute the results efficiently, and only bringing the result back to Python.
As far as performance and scalability goes, I have generated raw models with over 13e6 variables and 21e6 constraints. That said, Pyomo was designed for flexibility and extensibility over speed. Runtimes in Pyomo can be an order of magnitude slower than AMPL when using cPython (although that can shrink to within a factor of 4 or 5 using pypy). At least historically, AMPL has been faster than GAMS, so the gap between Pyomo and GAMS should be smaller.
I was also wondering the same when I came across this piece of code from Jonas Hörsch and Tom Brown and it was very useful to me:
https://github.com/FRESNA/PyPSA/blob/master/pypsa/opt.py
They define classes to define constraints more efficiently than the original Pyomo parser do. I did some tests on a large model that I have and it reduced the generation time considerably.
You can build big linear (LP) and mix-integer (MILP) optimization problems in Python with the open-source tool Linopy. Linopy promises a speedup of times 4-6 and a memory reduction of roughly 50% reaching roughly Julia JUMP performance. See the benchmark:
The tool is part of the PyPSA ecosystem. This tool is the next-level version of the PyPSA 'opt.py' developments that Jon Cardodo mentioned. It has roughly the same speed, same performance but better usability -- reported by developers.
I want to simulate a propagating wave with absorption and reflection on some bodies in three dimensional space. I want to do it with python. Should I use numpy? Are there some special libraries I should use?
How can I simulate the wave? Can I use the wave equation? But what if I have a reflection?
Is there a better method? Should I do it with vectors? But when the ray diverge the intensity gets lower. Difficult.
Thanks in advance.
If you do any computationally intensive numerical simulation in Python, you should definitely use NumPy.
The most general algorithm to simulate an electromagnetic wave in arbitrarily-shaped materials is the finite-difference time domain method (FDTD). It solves the wave equation, one time-step at a time, on a 3-D lattice. It is quite complicated to program yourself, though, and you are probably better off using a dedicated package such as Meep.
There are books on how to write your own FDTD simulations: here's one, here's a document with some code for 1-D FDTD and explanations on more than 1 dimension, and Googling "writing FDTD" will find you more of the same.
You could also approach the problem by assuming all your waves are plane waves, then you could use vectors and the Fresnel equations. Or if you want to model Gaussian beams being transmitted and reflected from flat or curved surfaces, you could use the ABCD matrix formalism (also known as ray transfer matrices). This takes into account the divergence of beams.
If you are solving 3D custom PDEs, I would recommend at least a look at FiPy. It'll save you the trouble of building a lot of your matrix conditioners and solvers from scratch. It uses numpy and/or trilinos. Here are some examples.
I recommend you use my project GarlicSim as the framework in which you build the simulation. You will still need to write your algorithm yourself, probably in Numpy, but GarlicSim may save you a bunch of boilerplate and allow you to explore your simulation results in a flexible way, similar to version control systems.
Don't use Python. I've tried using it for computationally expensive things and it just wasn't made for that.
If you need to simulate a wave in a Python program, write the necessary code in C/C++ and export it to Python.
Here's a link to the C API: http://docs.python.org/c-api/
Be warned, it isn't the easiest API in the world :)