I have several datasets with very unevenly distributed values: Most values are very low, but a few are very high, for example, in the histogram screenshot or even more extreme.
I am actually interested in the differences in the high values.
So what I am looking for is a classification method that sets many break values where there are few data values and large classes where there are many values. Maybe something like a reversed quantile classification.
Do you have a suggestion on which algorithm could help with this task, preferably in Python?
if you are using pandas, couldn't you just select the values above your chosen threshold and analyze the difference seperately?
import pandas as pd
df = pd.DataFrame(your data)
df_to_analyze_large_values = df[df.your_Column_of_interest > 100000]
Related
My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?
I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.
using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.
I have a DataFrame from which I’m trying to build a multiple linear regression model. The problem I have is that one of my Y variables is heavily skewed within the data set, so it’s weighting one side far too heavily. I need a way to normalize that one column, and the only way I can think to do that is to select and delete rows until I have an evenly distributed data set. I’ve built a simple example of what I’m talking about below. I would want column [0] to end up normally distributed by getting rid of the low tail. What’s the best way to go about doing this?
import pandas as pd
from matplotlib import pyplot as plt
from numpy.random import seed
from numpy.random import randn
from numpy.random import rand
from numpy import append
seed(1)
data=5*randn(100) + 10
tail = 10 + (rand(50) * 100)
data=append(data, tail)
data2=5*randn(150)+ 10
s1 = pd.Series(data)
s2 = pd.Series(data2)
df = pd.concat([s1, s2], axis=1)
First you need to figure out a threshold value to discriminate which values belong to the tail (are too higher) and which not.
A very empirical way to do it is by visual inspection: plot an histogram of your data, and see where the tail starts.
plt.hist(df[0])
plt.show()
Using the sample data you provided, you could see that the tail starts at 20, so you can consider each value greater than 20 due to the tail of the distribution.
Of course, this is a very rough way. Depending on your real data, you may have a better way to define your threshold, maybe based on the theoretical model behind the data. I mean, I guess you should know or at least have an idea about why there is a tail in your distribution.
In any case, whatever criteria you use to define a threshold value (this is really up to you), once you have it you can simply set to NaN all the values greater than the threshold:
df[0].loc[df[0] > threshold] = np.nan
Disclaimer:
This approach may be considered unappropiate or wrong, because you are tampering with the data. I don't know what your final goal is, but be careful.
You can try to use RANSAC for this. Use Skewness as objective function and try to minimize it. This should give you the samples that belong to an unskewed distribution. (Example, Example with different model, Example)
For the clustering algorithms in sklearn, is there a way to specify how many clusters you want the algorithm to find (instead of the algorithm finding its own number of clusters)? From my inputted data, I'm hoping for 2 clusters instead of the 3 it outputs for me.
If it helps, I'm using the MeanShift algorithm (but my question applies to all of them). Also, most tutorials seem to use make_blobs, but I'm using pandas's read_csv to upload my data instead if that changes anything.
This is the beginning part of my code:
df = pd.read_csv(filename, header = 0)
original_headers = list(df.columns.values)
df = df._get_numeric_data()
data = df.values
ms = MeanShift()
ms.fit(data)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)
As some users said above, it is not possible set the number of clusters wanted in MeanShift algorithm.
When we talk about clustering, there are a lot of models to be employed depending on your problem. Density based models, like MeanShift and DBSCAN, try to find areas of higher density than the remainder of the data set. So, the number of clusters will be defined by the data itself.
On the other hand, for example, centroid based methods like K-Means, starts its iterations based on the number of centroids passed as parameter.
The following link shows a lot of clustering algorithms of sklearn. Try to figure out which one suits best in your problem.
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
References:
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
https://en.wikipedia.org/wiki/Cluster_analysis
I have a pandas.dataframe with a column passengers with a range which may vary greatly depending on the function creating the dataframe.
The other columns are often more or less of constant ranges (they're economy indicators).
segments.head(2);
passengers gdp gdp_per_capita inflation unemployment \
Month
2002-01-01 11688 4461.087 31634.953 150.847 14.418
2002-02-01 9049 4142.153 29321.702 204.132 14.738
population
Month
2002-01-01 339.59
2002-02-01 343.32
My most valuable data is the number of passengers, so I do not want to transform it. However, the differences of scale of the other measures, which I want to use as predictors, make it difficult to track the variations (sometimes in tens of thousands, sometimes in decimals).
How could I standardize the range of all my columns to be consistent with the mean(passengers)?
There are different ways you can approach that problem, you can make/apply a manual transformation function, or you can use a pre existing function, such as sklearn.preprocessing.StandardScaler.
StandardScaler will "Standardize features by removing the mean and scaling to unit variance". You can hence shift mean and adjust unit variance accordingly to your desires/needs.
However, it looks to me you are going to try and build a predictive model on that data, if so,the best approach would be to test all hypothesis, and keep what works best, my advice is:
Remove skew from passagers (if present) - Log & Log1p are most common transforms, but depending on your data other transforms might be better. You should test arbitrary functions as well (inverse, or 1/(X+1) for example) and use the best transform (skew closest to 0)
Test both scaled / non scaled features. If data is skewed test both with transform/without as above.
If outliers are present test both with and without (outliers converted to borderline values / outliers converted to np.nan) Make a boolean feature column identifying outliers for each feature. Test to see if its valuable information or just noise to the model.
Hope that helps,
I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.
There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.
always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.