Custom Interpolation on Pandas quantile function? - python

I need to implement R's quantile function type 3 in pandas, which is according to documentation (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile), nearest even order statistic.
Pandas only has basic ones like lower, higher, linear, nearest. (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)
How do I implement the nearest even order interpolation on pandas?

Related

Rolling first derivative and second derivative in pandas

I'm trying to create a function to find the rolling derivatives (first and second) in Pandas.
I find that df.diff() is quite convenient.
I want to find the derivatives with the rolling window value = 40.
For the first derivative,
noise = np.random.normal(size=int(1e4))
noise=pd.DataFrame(noise)
first_derivative=noise.diff(periods=40)
Is it correct if I use this for the second derivative?
second_derivative=noise.diff(periods=40).diff()
I'm confused, but if I put periods=40 again in the second .diff() then it would be 40*40 rolling window (for the second derivative).
Thank you!
Pandas is not a mathematical library, and its diff() operation just take discrete differences among elements, not derivatives.
In order to take derivatives, I would recommend you to use SymPy, a nice Python library for symbolic mathematics. Check documentation for further details.
Example:
from sympy import *
>> diff(cos(x), x)
-sin(x)

How to calculate statistical significance of correlation in pandas or some other python module

I am using the Pandas library in Python.
So, I am calculating correlations between attributes in my data-set.
df.corr()
This is a nice function that returns a data-frame of correlations between all attributes.
There is this term called statistical significance of correlation, which test the probability that this difference was obtain by chance (description here).
So does anybody know if there is a function to return this in pandas or in some other Python library.

Scala: Class that is similar to QuantileTransformer in python

I am looking for a Scala implementation of Python's sklearn.preprocessing.QuantileTransformer class. There doesn't seem to be a single Class that can implement the entire functionality in scala.
The Python implementation has 3 major parts:
1) Compute quantiles for given data and percentile array using numpy.percentile(). If quantile lies between two input data points, then linear interpolation is used. The closest I can find in Scala is in breeze, which has percentile() function (Observation: The DataFrame.stats.approxQuantile() does not perform the linear interpolation and thus can't be used here).
2) Uses numpy.interp() to convert the input range of values to a given range. Eg If input data range is 1-100, it can be converted to any given range say 0-1. Again this uses linear interpolation when input data is present between 2 quantiles. The closest I can find in Scala is breeze.interpolation class.
3)Calculate the inverse CDF using numpy.ppf(). I believe, for this I can use the NormalDistribution class as one answer below or StandardScaler class.
Anything better to make the coding short and simple?
The Apache Commons Math library has a NormalDistribution class, which has an inverseCumulativeProbability method that calculates the specified quantile value. That should suit your purposes.

Python - Log Transformation on variables using numpy

I am working on feature engineering process as part of a machine learning project. And currently I have to determine whether to do log transformation for certain columns or not.
I came to know that log transformation should be done on those columns which are having skew distribution of values.
Now here are my questions / doubts for which I need clarifications.
How do I determine in Python whether a particular column values belong to skew distributions(either right skew or left skew) ?
And assume that I have determined the columns over which I need to apply log transformation, there are many bases to the log function such as loge, log10, log2, etc... So do I use natural log (i.e) loge or log10 or anything else in this machine learning approach ?
And if I am not wrong, log transformation can be applied only on numeric variables. Is this right ?
You can use Pandas DataFrame.skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs) Pandas skew to see weather the values of a particular column is skewed or not.
Basically, natural log transormation is preferable and it can be applied only numerical values except zero and negative values.
The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.
Yes, you can apply log transformation to numerical data only. There are other alternative ways to convert text data into numerical ex. one-hot-encoding.
Plot the histogram of columns to check if there is any skewness in the data. Box plot also helps in this regards.
If you are using pandas then hist() function will be helpful. Try to plot with different bin sizes.
For log transformation select any base, that won't impact much. Generally loge and log10 are used.
to measure the skew you can use scipy.stats.skew or scipy.stats.skewtest
You can also use scipy.stats.lognorm.fit() to get the parameters for the lognormal distribution

Is there an equivalent of the matlab 'idealfilter' for Python in Scipy (or other libraries)?

I am looking for an equivalent of the time series idealfilter that is implemented in Matlab, for Python.
My goal is to implement an ideal filter using Discrete Cosine Transform as is used in the Eulerian Video Magnification paper in Python in order to obtain the heartbeat of a human being from standard video. I am using their video as my input and I have implemented the bandpass filter method, but I have not been able to find an idealfilter method to use in my script.
They state that they implement an ideal filter using DCT from 0.83 - 1.0Hz.
My problem is that the idealfilter in Matlab takes in as input the cutoff frequencies, but I don't think it is implemented with dct.
In contrast, the DCT filter found in scipy.fftpack does not take in as input the frequency cutoffs.
If I have to use these in some type of succession please let me know.
If such a function equivalent exists I would like to attempt to use it in order to see if it yields similar results to what they have obtained.
Non-causal means that your filter depends on future inputs.
DCT is a transform, not a filter. You want a filter.
You want to apply a bandpass filter to your data within the range you specified, so I would use a butterworth filter.
Here is some example code: https://stackoverflow.com/a/12233959/1097117
The trickiest part of all of this is getting everything in terms of your Nyquist frequency.
I maybe worth to have a look to the time series analysis module of the statsmodel library. This module implements several time series filters, including the Hodrick-Prescott filter, which I think is non-causal.

Categories