I am working on feature engineering process as part of a machine learning project. And currently I have to determine whether to do log transformation for certain columns or not.
I came to know that log transformation should be done on those columns which are having skew distribution of values.
Now here are my questions / doubts for which I need clarifications.
How do I determine in Python whether a particular column values belong to skew distributions(either right skew or left skew) ?
And assume that I have determined the columns over which I need to apply log transformation, there are many bases to the log function such as loge, log10, log2, etc... So do I use natural log (i.e) loge or log10 or anything else in this machine learning approach ?
And if I am not wrong, log transformation can be applied only on numeric variables. Is this right ?
You can use Pandas DataFrame.skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs) Pandas skew to see weather the values of a particular column is skewed or not.
Basically, natural log transormation is preferable and it can be applied only numerical values except zero and negative values.
The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.
Yes, you can apply log transformation to numerical data only. There are other alternative ways to convert text data into numerical ex. one-hot-encoding.
Plot the histogram of columns to check if there is any skewness in the data. Box plot also helps in this regards.
If you are using pandas then hist() function will be helpful. Try to plot with different bin sizes.
For log transformation select any base, that won't impact much. Generally loge and log10 are used.
to measure the skew you can use scipy.stats.skew or scipy.stats.skewtest
You can also use scipy.stats.lognorm.fit() to get the parameters for the lognormal distribution
Related
I have a data set of 15,497 sets of values. The graph shows the raw data angle of pendulum vs. sample number which, obviously, looks awful. It should look like the second picture filtered data. A part of the assignment is introducing a mean filter to "smoothen" the data, making it look like the data on the 2nd graph. The data is put into np.array's in Python. But, using np.array's, I can't seem to figure out how to introduce a mean filter.
I'm interested in applying a mean filter on theta in the code screenshot of Python code, as theta are the values on the y axis on the plots. The code is added for you to easily see how the data file is introduced in the code.
There is a whole world of filtering techniques. There is a not a single unique 'mean filter'. Moreover, there are causal and non-causal filters (i.e. the difference between not using future values in the filter vs. using the future values in the filter.) I'm going to assume you are desiring a mean filter of size N, as that is pretty standard. Then, to apply this filter, convolve your 'theta' vector with a mean kernel.
I suggest printing the mean kernel and studying how it looks with different N. Then you may understand how it is averaging the values in the signal. I also urge you to think about why convolution is applying this filter to theta. I'll help you by telling you to think about the equivalent multiplication in the frequency domain. Also, investigate the different modes in the convolution function, as this may be more tailored for the specific solution you desire.
N=2
mean_kernel = np.ones(N)/N
filtered_sig = np.convolve(sig, mean_kernel, mode='same')
My dataset contains columns describing abilities of certain characters, filled with True/False values. There are no empty values. My ultimate goal is to make groups of characters with similar abilities. And here's the question:
Should i change True/False values to 1 and 0? Or there's no need for that?
What clustering model should i use? Is KMeans okay for that?
How do i interpret the results (output)? Can i visualize it?
The thing is i always see people perform clustering on numeric datasets that you can visualize and it looks much easier to do. With True/False i just don't even know how to approach it.
Thanks.
In general there is no need to change True/False to 0/1. This is only necessary if you want to apply a specific algorithm for clustering that cannot deal with boolean inputs, like K-means.
K-means is not a preferred option. K-means requires continuous features as input, as it is based on computing distances, like many clustering algorithms. So no boolean inputs. And although binary input (0-1) works, it does not compute distances in a very meaningful way (many points will have the same distance to each other). In case of 0-1 data only, I would not use clustering, but would recommend tabulating the data and see what cells occur frequently. If you have a large data set you might use the Apriori algorithm to find cells that occur frequently.
In general, a clustering algorithm typically returns a cluster number for each observation. In low-dimensions, this number is frequently used to give a color to an observation in a scatter plot. However, in your case of boolean values, I would just list the most frequently occurring cells.
After playing around with some auto correlations I recognized that the values shown on the plot by pandas auto correlation plotting module mentioned above differ from the values I receive when I calculate them manually with x.corr(x.shift(lag)) or with x.autocorr(lag).
Does anybody know if pd.plotting.autocorrelation_plot(x) uses another calculation method for the auto correlation than the standard approach?
Usually this happens because there are NaN values in the series. This may help
pd.plotting.autocorelation_plot(x.dropna())
Note, depending on how you want to deal with NaNs, this solution may not suit you. Nevertheless, if you want acf plot, you need to get rid of them.
For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an πR implementation of this
(ππππππππππ::ππππππ‘()robustbase::adjbox()) as well as
a matlab one (in a library called πππππlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.
Does anyone know if it is possible to find a power spectral density of a signal with gaps in it. For example (in matlab syntax cause that is what I'm familiar with)
ta=1:1000;
tb=1200:3000;
t=[ta tb]; % this is the timebase
signal=randn(size(t)); this is a signal
figure(101)
plot(t,signal,'.')
I'd like to be able to determine frequencies on a longer time base that just the individual sections of data. Obviously I could just take the PSD of individual sections but that will limit the lowest frequency. I could interpolate the data, but this would colour the PSD.
Any thoughts would be much appreciated.
The Lomb-Scargle periodogram algorithm is usually used to perform analysis on unevenly spaced data (sampled at arbitrary time points) or when a proportion of the data is missing.
Here's a couple of MATLAB implementations:
lombscargle.m (FEX)
Lomb (Lomb-Scargle) Periodogram (FEX)
lomb.m - ECG tools by Gari Clifford
I found this Non Uniform FFT but I'm not sure that its exactly what I need as it might really be for data that is mostly sampled on an uneven time base, rather than evenly spaced data with significant gaps. I'll give it a go!
Leaving out segments of the Fourier basis vectors results in exactly the same FT, thus PSD, as using the complete basis, but multiplying by zeros within a zero padding in any signal "gaps".