Remove skew from one column in DataFrame by deleting rows?

Remove skew from one column in DataFrame by deleting rows? - python

I have a DataFrame from which I’m trying to build a multiple linear regression model. The problem I have is that one of my Y variables is heavily skewed within the data set, so it’s weighting one side far too heavily. I need a way to normalize that one column, and the only way I can think to do that is to select and delete rows until I have an evenly distributed data set. I’ve built a simple example of what I’m talking about below. I would want column [0] to end up normally distributed by getting rid of the low tail. What’s the best way to go about doing this?
import pandas as pd
from matplotlib import pyplot as plt
from numpy.random import seed
from numpy.random import randn
from numpy.random import rand
from numpy import append
seed(1)
data=5*randn(100) + 10
tail = 10 + (rand(50) * 100)
data=append(data, tail)
data2=5*randn(150)+ 10
s1 = pd.Series(data)
s2 = pd.Series(data2)
df = pd.concat([s1, s2], axis=1)

First you need to figure out a threshold value to discriminate which values belong to the tail (are too higher) and which not.
A very empirical way to do it is by visual inspection: plot an histogram of your data, and see where the tail starts.
plt.hist(df[0])
plt.show()
Using the sample data you provided, you could see that the tail starts at 20, so you can consider each value greater than 20 due to the tail of the distribution.
Of course, this is a very rough way. Depending on your real data, you may have a better way to define your threshold, maybe based on the theoretical model behind the data. I mean, I guess you should know or at least have an idea about why there is a tail in your distribution.
In any case, whatever criteria you use to define a threshold value (this is really up to you), once you have it you can simply set to NaN all the values greater than the threshold:
df[0].loc[df[0] > threshold] = np.nan
Disclaimer:
This approach may be considered unappropiate or wrong, because you are tampering with the data. I don't know what your final goal is, but be careful.

You can try to use RANSAC for this. Use Skewness as objective function and try to minimize it. This should give you the samples that belong to an unskewed distribution. (Example, Example with different model, Example)

Related

How to determine multiple Periodicities present in Timeseries data?

My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?

I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.

using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.

Classification method for unevenly spread data

I have several datasets with very unevenly distributed values: Most values are very low, but a few are very high, for example, in the histogram screenshot or even more extreme.
I am actually interested in the differences in the high values.
So what I am looking for is a classification method that sets many break values where there are few data values and large classes where there are many values. Maybe something like a reversed quantile classification.
Do you have a suggestion on which algorithm could help with this task, preferably in Python?

if you are using pandas, couldn't you just select the values above your chosen threshold and analyze the difference seperately?
import pandas as pd
df = pd.DataFrame(your data)
df_to_analyze_large_values = df[df.your_Column_of_interest > 100000]

Linregress output seems incorrect

I plotted a scatter plot on my dataframe which looks like this:
with code
from scipy import stats
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
subset = df[:,1:10080]
df['mean'] = subset.mean(axis=1)
df.plot(x='mean', y='Result', kind = 'scatter')
sns.lmplot('mean', 'Result', df, order=1)
I wanted to find the slope of the regression in the graph using code
scipy.stats.mstats.linregress(Result,average)
but from the output it seems like the slope magnitude is too small:
LinregressResult(slope=-0.0001320534706614152, intercept=27.887336813241845, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=2.55977061451773e-05)
if I switched the Resultand average positions,
scipy.stats.mstats.linregress(average,Result)
it still doesn't look right as the intercept is too large
LinregressResult(slope=-213.12489536011773, intercept=7138.48783135982, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=41.31287437069993)
Why is this happening? Do these output values need to be rescaled?

The signature for scipy.stats.mstats.linregress is linregress(x,y) so your second ordering, linregress(average, Result) is the one that is consistent with the way your graph is drawn. And on that graph, an intercept of 7138 doesn't seem unreasonable—are you getting confused by the fact that the x-axis limits you're showing don't go down to 0, where the intercept would actually happen?
In any case, your data really don't look like they follow a linear law, so the slope (or any parameter from a completely-misspecified model) will not actually tell you much. Are the x and y values all strictly positive? And is there a particular reason why x can never logically go below 25? The data-points certainly seem to be piling up against that vertical asymptote. If so, I would probably subtract 25 from x, then fit a linear model to logged data. In other words, do your plot and your linregress with x=numpy.log(average-25) and y=numpy.log(Result). EDIT: since you say x is temperature there’s no logical reason why x can’t go below 25 (it is meaningful to want to extrapolate below 25, for example—and even below 0). Therefore don’t subtract 25, and don’t log x. Just log y.
In your comments you talk about rescaling the slope, and eventually the suspicion emerges that you think this will give you a correlation coefficient. These are different things. The correlation coefficient is about the spread of the points around the line as well as slope. If what you want is correlation, look up the relevant tools using that keyword.

How to generate noisy mock time series or signal (in Python)

Quite often I have to work with a bunch of noisy, somewhat correlated time series. Sometimes I need some mock data to test my code, or to provide some sample data for a question on Stack Overflow. I usually end up either loading some similar dataset from a different project, or just adding a few sine functions and noise and spending some time to tweak it.
What's your approach? How do you generate noisy signals with certain specs? Have I just overlooked some blatantly obvious standard package that does exactly this?
The features I would generally like to get in my mock data:
Varying noise levels over time
Some history in the signal (like a random walk?)
Periodicity in the signal
Being able to produce another time series with similar (but not exactly the same) features
Maybe a bunch of weird dips/peaks/plateaus
Being able to reproduce it (some seed and a few parameters?)
I would like to get a time series similar to the two below [A]:
I usually end up creating a time series with a bit of code like this:
import numpy as np
n = 1000
limit_low = 0
limit_high = 0.48
my_data = np.random.normal(0, 0.5, n) \
+ np.abs(np.random.normal(0, 2, n) \
* np.sin(np.linspace(0, 3*np.pi, n)) ) \
+ np.sin(np.linspace(0, 5*np.pi, n))**2 \
+ np.sin(np.linspace(1, 6*np.pi, n))**2
scaling = (limit_high - limit_low) / (max(my_data) - min(my_data))
my_data = my_data * scaling
my_data = my_data + (limit_low - min(my_data))
Which results in a time series like this:
Which is something I can work with, but still not quite what I want. The problem here is mainly that:
it doesn't have the history/random walk aspect
it's quite a bit of code and tweaking (this is especially a problem if i want to share a sample time series)
I need to retweak the values (freq. of sines etc.) to produce another similar but not exactly the same time series.
[A]: For those wondering, the time series depicted in the first two images is the traffic intensity at two points along one road over three days (midnight to 6 am is clipped) in cars per second (moving hanning window average over 2 min). Resampled to 1000 points.

Have you looked into TSimulus? By using Generators, you should be able generate data with specific patterns, periodicity, and cycles.
The TSimulus project provides tools for specifying the shape of a time series (general patterns, cycles, importance of the added noise, etc.) and for converting this specification into time series values.
Otherwise, you can try "drawing" the data yourself and exporting those data points using Time Series Maker.

(Python) Estimating regression parameter confidence intervals with scikits bootstrap

I've just started to try out a nice bootstrapping package available through scikits:
https://github.com/cgevans/scikits-bootstrap
but I've encountered a problem when trying to estimate confidence intervals for the correlation coefficient from linear regression. The confidence intervals returned lie completely outside the range of the original statistic.
Here is the code:
import numpy as np
from scipy import stats
import bootstrap as boot
np.random.seed(0)
x = np.arange(10)
y = 10 + 1.5*x + 2*np.random.randn(10)
r0 = stats.linregress(x, y)[2]
def my_function(y):
return stats.linregress(x, y)[2]
ci = boot.ci(y, statfunction=my_function, alpha=0.05, n_samples=1000, method='pi')
This yields a result of ci = [-0.605, 0.644], but the original statistic is r0=0.894.
I've tried this in R and it seems to work fine there: the ci straddles r0 as expected.
Please help!

Could you provide your R code? I'd be interested in knowing how this is dealt with in R.
The problem here is that you're only passing y to boot.ci, but every time it runs my_function, it uses the entire, original x (note the lack of x input to my_function). Bootstrapping applies the statistic function to resampled data, so if you're applying your statistic function using the original x and a sample of y, you're going to have a nonsensical result. This is why the BCA method doesn't work at all, actually: it can't apply your statistic function to jackknife samples, which don't have the same number of elements.
Samples are taken along axis 0 (rows), so if you want to pass multiple 1D arrays to your statistic function, you can use multiple columns: xy = vstack((x,y)).T would work, and then use a statfunction that takes data from those columns:
def my_function(xysample):
return stats.linregress(xysample[:,0], xysample[:,1])[2]
Alternatively, if you wanted to avoid messing with your data at all, you could define a function that operates on indexes, and then just pass indexes to boot.ci:
def my_function2(i):
return stats.linregress(x[i], y[i])[2]
boot.ci(np.arange(len(x)), statfunction=my_function2, alpha=0.05, n_samples=1000, method='pi')
Note that in either of these cases, BCA works, so you may as well use method='bca' unless you really do want to use percentage intervals; BCA is pretty much always better.
I do realize that both of these methods are less than ideal. Honestly, I've never had a need to pass multiple arrays like this to my statfunction, and the majority of people are likely using mean as their statfunction. I think the best idea here may be to allow lists of equally-size[0] arrays to be passed, eg, boot.ci([x,y],...), and then sample all of those at the same time and pass them all to the statfunction as separate arguments. In that case, you could just have a my_function(x,y). I'll see if I can do this, but if you can show me your R code, that would be great, as I'd like to see if there is a better way of dealing with this.
Update:
In the most recent version of scikits.bootstrap (v0.3.1), a tuple of arrays can be provided, and samples from them will be passed as separate arguments to statfunction. Additionally, statfunction can provide array output, and confidence intervals will be calculated for each point in the output. Thus, this is now very easy to do. The following will give confidence intervals for every output of linregress:
cis = boot.ci( (x,y), statfunction=stats.linregress )
cis[:,2] in this case will be the desired confidence interval.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.