scipy.signal.lfilter in Python - python

I have used
scipy.signal.lfilter(coefficient, 1, input, axis=0)
for filtering a signal in python with 9830000 samples (I have to use axis=0 to get similar answer to matlab), compere to matlab
filter(coefficient, 1, input)(seconds),
it takes very long time (minus) and it becomes worse when I have to filtering several of time. is there any suggestion for that? I have tried numpy as well same timing issue. Is there any other filter that I can received similar answers?

You can use ifilter from itertools to implement a faster filter.
The problem with the build-in function filter from Python is that it returns a list that is memory-consuming when you are dealing with lots of data. Therefore, itertools.ifilter is a better option because it calls the function only when needed.
Some other ideas depending on your implementation are:
If you are designing an FIR filter, you can apply convolve.
Or you can use correlation, to get faster results, because it uses FFT.

Related

Why is Scipy.spatial.distance.cdist fast if it uses a double for loop?

I've been told that double for loops are slow in Python. However, while digging into the implementation of Scipy.spatial.distance.cdist(), I found that it's just a double for loop and not some clever numpy trick. And yet, many answers (e.g. https://stackoverflow.com/a/62416370) recommend using cdist().
My question is: why is the function still so fast if it's just naively iterating over all elements in O(n^2)?
Function with two for loops that you are pointing to, _cdist_callable(), is not used in most cases. When you read cdist() function you can see that _cdist_callable() is run for example when you provide your own metric function.
In typical scenario, when you provide metric in form of a string: euclidean, chebyshev, cityblock, etc., C-optimized functions are being used instead.
And "handles" to those C-optimized functions for those metrics are listed here, in the same file.

Is `built-in method numpy.core._multiarray_umath.implement_array_function` a performance bottleneck?

I'm using numpy v1.18.2 in some simulations, and using inbuilt functions such as np.unique, np.diff and np.interp. I'm using these functions on standard objects, i.e lists or numpy arrays.
When I checked with cProfile, I saw that these functions make a call to an built-in method numpy.core._multiarray_umath.implement_array_function and that this method accounts for 32.5% of my runtime! To my understanding this is a wrapper that performs some checks to make sure the that the arguments passed to the function are compatible with the function.
I have two questions:
Is this function (implement_array_function) actually taking up so much time or is it actually the operations I'm doing (np.unique, np.diff, np.interp) that is actually taking up all this time? That is, am I misinterpreting the cProfile output? I was confused by the hierarchical output of snakeviz. Please see snakeviz output here and details for the function here.
Is there any way to disable it/bypass it, since the inputs need not be checked each time as the arguments I pass to these numpy functions are already controlled in my code? I am hoping that this will give me a performance improvement.
I already saw this question (what is numpy.core._multiarray_umath.implement_array_function and why it costs lots of time?), but I was not able to understand what exactly the function is or does. I also tried to understand NEP 18, but couldn't make out how to exactly solve the issue. Please fill in any gaps in my knowledge and correct any misunderstandings.
Also I'd appreciate if someone can explain this to me like I'm 5 (r/explainlikeimfive/) instead of assuming developer level knowledge of python.
All the information below is taken from NEP 18.
Is this function (implement_array_function) actually taking up so much time or is it actually the operations I'm doing (np.unique, np.diff, np.interp) that is actually taking up all this time?
As #hpaulj correctly mentioned in the comment, the overhead of the dispatcher adds 2-3 microseconds to each numpy function call. This will probably be shorten to 0.5-1 microseconds once it is implemented in C. See here.
Is there any way to disable it/bypass it
Yes, from NumPy 1.17, you can set the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION to 0 (before importing numpy) and this will disable the use of implement_array_function (See here). Something like
import os
os.environ['NUMPY_EXPERIMENTAL_ARRAY_FUNCTION'] = '0'
import numpy as np
However, disabling it probably would not give you any notable performance improvement as the overhead of it is just few microseconds, and this will be the default in later numpy version too.

Using dask for loop parallelization in nested loops

I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.

Vectorizing CONDITIONAL logic in Python?

Old-school c programmer trying to get with the times and learn Python. Struggling to see how to use vectorization effectively to replace for loops. I get the basic concept that Python can do mathematical functions on entire matricies in a single statement, and that's really cool. But I seldom work with mathematical relationships. Almost all my for loops apply CONDITIONAL logic.
Here's a very simple example to illustrate the concept:
import numpy as np
# Initial values
default = [1,2,3,4,5,6,7,8]
# Override values should only replace initial values when not nan
override = [np.nan,np.nan,3.5,np.nan,5.6,6.7,np.nan,8.95]
# I wish I knew how to replace this for loop with a single line of vectorized code
for i in range(len(default)):
if(np.isnan(override[i])==False): #Only override when override value is other than nan
default[i]=override[i]
default
I have a feeling that for loop could be eliminated with a single python statement that only overwrites values of default with values of override that are not np.nan. But I can't see how to do it.
This is just a simplified example to illustrate the concept. My real question is whether or not vectorization is generally useful to replace for loops with conditional logic, or if it's only applicable to mathematical relationships, where the benefits and method of achieving them are obvious. All of my real code challenges are much more complex and the conditional logic is more complex than just a simple "only use this value if it's non-nan".
I found hundreds of articles online about how to use vectorization in Python, but they all seem to focus on replacing mathematical calculations in for loops. All my for loops involve conditional logic. Can vectorization help me or am I trying to fit a square peg in a round hole?
Thanks!
First thing's first, the vectorized version:
override_is_not_nan = np.logical_not(np.isnan(override))
np.where(override_is_not_nan, override, default)
As for your real question, vectorization is useful for multiprocessing.
And not just for multi-core CPUs.
Considering today's GPUs have thousands of cores, using tensors with similar code can make it run much faster.
How much faster? That depends on your data, implementation and hardware.
Evidently, the combination of vectorization with GPUs is part of what enabled the huge progress in the field of Deep Learning.
List comprehension is usually the preferred one line alternative to for loops in Python. It is possible to throw in a conditional into the comprehension as well.
In this specific case we iterate over elements of default and override by zipping them together and replace values of default according to the conditional check.
>>> [y if not(np.isnan(y)) else x for (x,y) in zip(default, override)]
[1, 2, 3.5, 4, 5.6, 6.7, 7, 8.95]
To answer your broader question about vectorization and speedups, the answer unfortunately is it depends. There are situations where a simple for loop performs better than its vectorized counterparts. List comprehensions for example, is just for improving the readability of code as opposed to providing a serious speedup.
The answers on this question address this in more detail.
First find the indices where the non-nan values are located.
Replace the values as indices in the default array with override array values.
import numpy as np
np_default = np.array(default).asdtype(float) # Convert np_default to numpy array with float values
non_nan_indices = np.where(~np.isnan(override)) # Get non nan indices
np_default[non_nan_indices] = np.array(override)[non_nan_indices] # Replacing the values at non-nan indices
np_default # Returns array([1. , 2. , 3.5 , 4. , 5.6 , 6.7 , 7. , 8.95])
Vectorization is where numpy comes to your help it takes the advantage of typed natured of array which result in much faster operations. See [BlogPost] for detail.

Should the DataFrame function groupBy be avoided?

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.
The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.
The documentation is pretty all over the place.
There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."
There is a lot of heavy wizardry behind the scenes.
I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.
Also, for performance improvements, I suggest fiddling (through trial and error) with:
the spark configurations Doc
shuffle.partitions Doc

Categories