I am trying to perform a K Means Clustering on a set of data that all texts. I have tried these lines of code and I am getting an error saying "ValueError: could not convert string to float: 'GIAC'".
I think the program is still having problems converting my text into vectors to be able to perform a clustering.
I really do not know how to solve this.
Here are the lines of code:
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd
from sklearn.cluster import KMeans
Cert = pd.read_csv('Certification.csv')
X = Cert.iloc[:,:].values
wcss =[]
for i in range(1,5):
kmeans = KMeans(n_clusters = i, init='k-means++', random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plot.plot(range(1,5),wcss)
plot.title('Elbow Method')
plot.xlabel('Number of Clusters')
plot.ylabel('WCSS')
plot.show()
I also have attached a screenshot of the error message.error message
enter code here
K-means requires your data to be continuous variables.
Clearly, 'GIAC' is not a number, is it?
K-means cannot be used on this data. You'd need to do one-hot encoding or similar, but that comes with it's very own set of problems with k-means... Usually when you have data with values such as 'GIAC' there just is no sound way to cluster the data in a statistically meaningful way. Too many heuristic choice along he way to get a result, that you could get pretty much any other result, too. Try to approach the problem mathematically, not with copy&pasting code.
Related
I am trying to select the best feature based on Fisher's score. In the following code, X_train and y_train are pandas dataframe.
from skfeature.function.similarity_based import fisher_score
ranks = fisher_score.fisher_score(X_train, y_train)
V = X_train.columns[ranks[:5]]
print(V)
The above code is giving the error Length of values (1) does not match the length of index (13).
But if I convert the data frame into NumPy array, then the code is executing. The following code is executing perfectly.
from skfeature.function.similarity_based import fisher_score
ranks = fisher_score.fisher_score(np.array(X_train), np.array(y_train))
V = X_train.columns[ranks[:5]]
print(V)
But the above code is making kernel dead, probably because of large size of numpy array. Is there any way to solve this without using Numpy array or any other way where I may not face kernel dead issue.
I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.
This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.
The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:
am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using to analyze the data is below
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated
I think in your case, the problem is the column name, especially [W/m**2/um/sr].
Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
scaler = StandardScaler(with_mean=True, with_std=True)
pca= PCA(n_components=2)
pca.fit(scaler.fit_transform(data))
print(pca.components_)
Worked well for me. Maybe you just need to specify:
data.columns = data.columns.astype(str)
Could someone guide me how could i cluster twitter data using DBSCAN in python? I am totally new to DBSCAN. Also, how to determine the eps value and the iloc or loc value.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def clusterEvaluate(cluster):
count_cluster = np.bincount(cluster)
count_cluster = np.argmax(count_cluster)
same_clusters = np.count_nonzero(cluster == count_cluster)/np.size(cluster)
return same_clusters
dataset = np.loadtxt('tweetdata.csv') # not sure if this works
X = StandardScaler().fit_transform(dataset)
y_valid = dataset.iloc[:6].values()
dbscan = DBSCAN(eps= 0.5,min_samples=5,metric='euclidean')
y = dbscan.fit_predict(X)
cluster_labels = np.unique(y)
same_clusters = []
i = 0
for index in cluster_labels:
cluster = y_valid[y == index]
same_clusters.insert((i, clusterEvaluate(cluster)))
You need to choose and appropriate data representation and distance function for this. Furthermore, scalability will kill you.
I do not think it will work well. I have it seen anything that gives insightful results beyond counting frequent words in a unnecessary complex fashion. Twitter data is a bitch. The messages are just too short. All the good approaches like LDA need much longer documents.
I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467
it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps
I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2
Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.