I have a nD array, say of dimensions: (144, 522720) and I need to compute its FFT.
PyFFTW seems slower than numpy and scipy, that it is NOT expected.
Am I doing something obviously wrong?
Below is my code
import numpy
import scipy
import pyfftw
import time
n1 = 144
n2 = 522720
loops = 2
pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_ESTIMATE'
# pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
Q_2 = pyfftw.empty_aligned([n1, n2], dtype='complex_')
Q_ref = pyfftw.empty_aligned([n1, n2], dtype='complex_')
# repeat a few times to see if pyfft planner helps
for i in range(0,loops):
Q_1 = numpy.random.rand(n1,n2)
s1 = time.time()
Q_ref = numpy.fft.fft(Q_1, axis=0)
print('NUMPY - elapsed time: ', time.time() - s1, 's.')
s1 = time.time()
Q_2 = scipy.fft.fft(Q_1, axis=0)
print('SCIPY - elapsed time: ', time.time() - s1, 's.')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
s1 = time.time()
Q_2 = pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0)
print('PYFFTW NUMPY - elapsed time = ', time.time() - s1, 's.')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
s1 = time.time()
Q_2 = pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0)
print('PYFFTW SCIPY - elapsed time = ', time.time() - s1, 's.')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
s1 = time.time()
fft_object = pyfftw.builders.fft(Q_1, axis=0)
Q_2 = fft_object()
print('FFTW PURE Elapsed time = ', time.time() - s1, 's')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
Firstly, if you turn on the cache before you main loop, the interfaces work largely as expected:
pyfftw.interfaces.cache.enable()
pyfftw.interfaces.cache.set_keepalive_time(30)
It's interesting that despite wisdom that should be stored, the construction of the pyfftw objects is still rather slow when the cache is off. No matter, this is exactly the purpose of the cache. In your case you need to make the cache keep-alive time quite long because your loop is very long.
Secondly, it's not a fair comparison to include the construction time of the fft_object in the final test. If you move it outside the timer, then calling fft_object is a better measure.
Thirdly, it's also interesting to see that even with cache turned on, the call to numpy_fft is slower than the call to scipy_fft. Since there is no obvious difference in the code path, I suggest that is caching issue. This is the sort of issue that timeit seeks to mitigate. Here's my proposed timing code which is more meaningful:
import numpy
import scipy
import pyfftw
import timeit
n1 = 144
n2 = 522720
pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
pyfftw.interfaces.cache.enable()
pyfftw.interfaces.cache.set_keepalive_time(30)
times = timeit.repeat(lambda: numpy.fft.fft(Q_1, axis=0), repeat=5, number=1)
print('NUMPY fastest time = ', min(times))
times = timeit.repeat(lambda: scipy.fft.fft(Q_1, axis=0), repeat=5, number=1)
print('SCIPY fastest time = ', min(times))
times = timeit.repeat(
lambda: pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0), repeat=5, number=1)
print('PYFFTW NUMPY fastest time = ', min(times))
times = timeit.repeat(
lambda: pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0), repeat=5, number=1)
print('PYFFTW SCIPY fastest time = ', min(times))
fft_object = pyfftw.builders.fft(Q_1, axis=0)
times = timeit.repeat(lambda: fft_object(Q_1), repeat=5, number=1)
print('FFTW PURE fastest time = ', min(times))
On my machine this gives an output like:
NUMPY fastest time = 0.6622681759763509
SCIPY fastest time = 0.6572431400418282
PYFFTW NUMPY fastest time = 0.4003451430471614
PYFFTW SCIPY fastest time = 0.40362057799939066
FFTW PURE fastest time = 0.324020683998242
You can do a bit better if you don't force it to copy the input array into a complex data type by changing Q_1 to be complex128:
NUMPY fastest time = 0.6483533839927986
SCIPY fastest time = 0.847397351055406
PYFFTW NUMPY fastest time = 0.3237176960101351
PYFFTW SCIPY fastest time = 0.3199474769644439
FFTW PURE fastest time = 0.2546963169006631
That interesting scipy slow-down is repeatable.
That said, if your input is real, you should be doing a real transform (for >50% speed-up with pyfftw) and manipulating the resultant complex output.
What's interesting about this example is (I think) how important the cache is in the results (which I suggest is why switching to a real transform is so effective in speeding things up). You see something dramatic also when you use change the array size to 524288 (the next power of two, which you think might perhaps speed things up, but not slow it down dramatically). In this case everything slows down quite a bit, scipy particularly. It feels to me that scipy is more cache sensitive, which would explain the slow down with changing the input to complex128 (522720 is quite a nice number for FFTing though, so perhaps we should expect a slowdown).
Finally, if speed is secondary to accuracy, you can always use 32-bit floats as the data type. If you combine that with doing a real transform, you get a better than factor of 10 speed-up over the initial numpy best given above:
PYFFTW NUMPY fastest time = 0.09026529802940786
PYFFTW SCIPY fastest time = 0.1701313250232488
FFTW PURE fastest time = 0.06202622700948268
(numpy and scipy don't change much as I think they use 64-bit floats internally).
Edit: I forgot that the Scipy's fftpack real FFTs have a weird output structure, which pyfftw replicates with some slowdown. This is changed to be more sensible in the new FFT module.
The new FFT interface is implemented in pyFFTW and should be preferred. There was unfortunately a problem with the docs being rebuilt so the docs were a long time out of date and didn't show the new interface - hopefully that is fixed now.
Related
I have a need to do very very fast and efficient way of rolling linear regression.
I looked through these two threads :
Efficient way to do a rolling linear regression
Rolling linear regression
From them, I had inferred numpy was (computationally) the fastest. However, using my (limited) python skills, I found the time to compute the same set of rolling data, was *** the same ***.
Is there a faster way to compute than either of the 3 methods I post below? I would have thought the numpy way is much faster, but unfortunately, it wasn't.
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.empty(0)
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np = np.append(roll_np, grad_)
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
tl;dr
My answer is
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
And it takes 1.2 ms to compute, compared to 2 seconds for your pandas and numpy version, and 3.5 seconds for your stat version.
Long version
One method could be to use sliding_window_view to transform your tmp_ array, into an array of window (a fake one: it is just a view, not really a 10000x30 array of data. It is just tmp_ but viewed differenty. Hence the _view in the function name).
No direct advantage. But then, from there, you can try to take advantage of vectorization.
I do that two different way: an easy one, and one that takes a minute of thinking. Since I put the best answer first, the rest of this message can appear inconsistent chronologically (I say things like "in my previous answer" when the previous answer come later), but I tried to redact both answer consistently.
New answer : matrix operations
One method to do that (since lstsq is of the rare numpy method that wouldn't just do it naturally) is to go back to what lstsq(X,Y) does in reality: it computes (XᵀX)⁻¹Xᵀ Y
So let's just do that. In python, with xxx being the X array (of arange and 1 in your example) and view the array of windows to your data (that is view[i] is tmp_[i:i+win_]), that would be np.linalg.inv(xxx.T#xxx)#xxx.T#view[i] for i being each row. We could vectorize that operation with np.vectorize to avoid iterating i, as I did for my first solution (see below). But the thing is, we don't need to. That is just a matrix times a vector. And the operation computing a matrix times a vector for each vector in an array of vectors, is just matrix multiplication!
Hence my 2nd (and probably final) answer
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
roll_mat is still identical (with one extra row because your roll_np stopped one row short of the last possible one) to roll_np (see below for graphical proof with my first answer. I could provide a new image for this one, but it is indistinguishable from the one I already used). So same result (unsurprisingly I should say... but sometimes it is still a surprise when things work exactly like theory says they do)
But timing, is something else. As promised, my previous factor 4 was nothing compared to what real vectorization can do. See updated timing table:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize (see below)
0.46 s
numpy view/matmult
1.2 ms
The important part is 'ms', compared to other 's'.
So, this time factor is 1700 !
Old-answer : vectorize
A lame method, once we have this view could be to use np.vectorize from there. I call it lame because vectorize is not supposed to be efficient. It is just a for loop called by another name. Official documentation clearly says "not to be used for performance". And yet, it would be an improvement from your code
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
f = np.vectorize(lambda y: np.linalg.lstsq(xxx,y,rcond=None)[0][0], signature='(n)->()')
roll_vectorize=f(view)
Firt let's verify the result
plt.scatter(f(view)[:-1], roll_np))
So, obviously, same results as roll_np (which, I've checked the same way, are the same results as the two others. With also the same variation on indexing since all 3 methods have not the same strategy for border)
And the interesting part, timings:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize
0.46 s
So, you see, it is not supposed to be for performance, and yet, I gain more that x4 times with it.
I am pretty sure that a more vectorized method (alas, lstsq doesn't allow directly it, unlike most numpy functions) would be even faster.
First if you need some tips for optimizing your python code, I believe this playlist might help you.
For making it faster; "Append" is never a good way, you think of it in terms of memory, every time you append, python may create a completely new list with a bigger size (maybe n+1; where n is old size) and copy the last items (which will be n places) and for the last one will be added at last place.
So when I changed it to be as follows
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.zeros(len(tmp_)-win_) ### Change
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np[cnt_] = grad_ ### Change
# roll_np = np.append(roll_np, grad_) ### Change
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
I initialized the array from first place with the size of how it's expected to turn to be(len(tmp_)-win_ in range), and just assigned values to it later, and it was much faster.
there are also some other tips you can do, Python is interpreted language, meaning each time it takes a line, convert it to machine code, then execute it, and it does that for each line. Meaning if you can do multiple things at one line, meaning they will get converted at one time to machine code, it shall be faster, for example, think of list comprehension.
I would like to know, is there any simple method to parallel einsum in Numpy?
I found some discussions
Numpy np.einsum array multiplication using multiple cores
Any chance of making this faster? (numpy.einsum)
numpy.tensordot() only for binary contraction with a single axis, Numba needs to specify certain loops. Is there any simple and robust approach to parallel einsum (possibly including opt-einsum, tf-einsum etc) with arbitrary contractions?
A sample code is as following (if necessary I can use more complicated contraction as the example)
import numpy as np
import timeit
import time
na = nc = 1000
nb = 1000
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
start = time.time()
C = np.einsum('ij,jk->ik', A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)
If I have an array on the GPU, it is really slow (order of hundreds of seconds) to copy back an array of shape (20, 256, 256).
My code is the following:
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
xt_gpu = cp.asarray(xt)
# Also very fast...
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
# Very very very very very slow....
result_cpu = cp.asnumpy(result_gpu)
I measured the times using cp.cuda.Event() with record and synchronize to avoid measuring any random times, but is still the same result, the GPU->CPU transfer is incredible slow. However, using PyTorch or TensorFlow this is not the case (out of experience for similar data size/shape)... What am I doing wrong?
I think you might be timing it wrong. I modified the code to synchronize between every GPU operation and it seems like the convolution takes the majority of the time with both transfer operations being very fast.
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
import time
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
t0 = time.time()
xt_gpu = cp.asarray(xt)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Also very fast...
t0 = time.time()
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Very very very very very slow....
t0 = time.time()
result_cpu = cp.asnumpy(result_gpu)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
Output:
0.1380000114440918
4.032999753952026
0.0010001659393310547
To me it seems like you are not actually synchronizing between calls when you tested it. Until the transfer back to a numpy array all operations are simply queued up and seem to finish instantly without the synchronize calls. This would lead to the measured GPU->CPU transfer time actually being the time for the convolution and the transfer.
I also meet the same problem, I found that accessing Float64 data is way faster than Float32, maybe you can try to .astype(float64).
My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)
How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.
The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.
Maybe I'm doing something odd, but maybe found a surprising performance loss when using numpy, seems consistent regardless of the power used. For instance when x is a random 100x100 array
x = numpy.power(x,3)
is about 60x slower than
x = x*x*x
A plot of the speed up for various array sizes reveals a sweet spot with arrays around size 10k and a consistent 5-10x speed up for other sizes.
Code to test below on your own machine (a little messy):
import numpy as np
from matplotlib import pyplot as plt
from time import time
ratios = []
sizes = []
for n in np.logspace(1,3,20).astype(int):
a = np.random.randn(n,n)
inline_times = []
for i in range(100):
t = time()
b = a*a*a
inline_times.append(time()-t)
inline_time = np.mean(inline_times)
pow_times = []
for i in range(100):
t = time()
b = np.power(a,3)
pow_times.append(time()-t)
pow_time = np.mean(pow_times)
sizes.append(a.size)
ratios.append(pow_time/inline_time)
plt.plot(sizes,ratios)
plt.title('Performance of inline vs numpy.power')
plt.ylabel('Nx speed-up using inline')
plt.xlabel('Array size')
plt.xscale('log')
plt.show()
Anyone have an explanation?
It's well known that multiplication of doubles, which your processor can do in a very fancy way, is very, very fast. pow is decidedly slower.
Some performance guides out there even advise people to plan for this, perhaps even in some way that might be a bit overzealous at times.
numpy special-cases squaring to make sure it's not too, too slow, but it sends cubing right off to your libc's pow, which isn't nearly as fast as a couple multiplications.
I suspect the issue is that np.power always does float exponentiation, and it doesn't know how to optimize or vectorize that on your platform (or, probably, most/all platforms), while multiplication is easy to toss into SSE, and pretty fast even if you don't.
Even if np.power were smart enough to do integer exponentiation separately, unless it unrolled small values into repeated multiplication, it still wouldn't be nearly as fast.
You can verify this pretty easily by comparing the time for int-to-int, int-to-float, float-to-int, and float-to-float powers vs. multiplication for a small array; int-to-int is about 5x as fast as the others—but still 4x slower than multiplication (although I tested with PyPy with a customized NumPy, so it's probably better for someone with the normal NumPy installed on CPython to give real results…)
The performance of numpys power function scales very non-linearly with the exponent. Constrast this with the naive approach which does. The same type of scaling should exist, regardless of matrix size. Basically, unless the exponent is sufficiently large, you aren't going to see any tangible benefit.
import matplotlib.pyplot as plt
import numpy as np
import functools
import time
def timeit(func):
#functools.wraps(func)
def newfunc(*args, **kwargs):
startTime = time.time()
res = func(*args, **kwargs)
elapsedTime = time.time() - startTime
return (res, elapsedTime)
return newfunc
#timeit
def naive_power(m, n):
m = np.asarray(m)
res = m.copy()
for i in xrange(1,n):
res *= m
return res
#timeit
def fast_power(m, n):
# elementwise power
return np.power(m, n)
m = np.random.random((100,100))
n = 400
rs1 = []
ts1 = []
ts2 = []
for i in xrange(1, n):
r1, t1 = naive_power(m, i)
ts1.append(t1)
for i in xrange(1, n):
r2, t2 = fast_power(m, i)
ts2.append(t2)
plt.plot(ts1, label='naive')
plt.plot(ts2, label='numpy')
plt.xlabel('exponent')
plt.ylabel('time')
plt.legend(loc='upper left')