How to fit multidimensional output using scikit-learn? - python

I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?

This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.

Related

Generate simulated data in Python while meeting a range of correlations with respect to a predefined variable

Let's denote refVar, a variable of interest that contains experimental data.
For the simulation study, I would like to generate other variables V0.05, V0.10, V0.15 until V0.95.
Note that for the variable name, the value following V represents the correlation between the variable and refVar (in order to quick track in the final dataframe).
My readings led me to multivariate_normal() from numpy. However, when using this function, it generates 2 1D-arrays both with random numbers. What I want is to always keep refVar and generate other arrays filled with random numbers, while meeting the specified correlation.
Please, find below my my code. To cut it short, I've no clue how to generate other variables relative to my experimental variable refVar. Ideally, I would like to build a data frame containing the following columns: refVar,V0.05,V0.10,...,V0.95. I hope you get my point and thank you in advance for your time
import numpy as np
import pandas as pd
from numpy.random import multivariate_normal as mvn
refVar = [75.25,77.93,78.2,61.77,80.88,71.95,79.88,65.53,85.03,61.72,60.96,56.36,23.16,73.36,64.18,83.07,63.25,49.3,78.2,30.96]
mean_refVar = np.mean(refVar)
for r in np.arange(0,1,0.05):
var1 = 1
var2 = 1
cov = r
cov_matrix = [[var1,cov],
[cov,var2]]
data = mvn([mean_refVar,mean_refVar],cov_matrix,size=len(refVar))
output = 'corr_'+str(r.round(2))+'.txt'
df = pd.DataFrame(data,columns=['refVar','v'+str(r.round(2)])
df.to_csv(output,sep='\t',index=False) # Ideally, instead of creating an output for each correlation, I would like to generate a DF with refVar and all these newly created Series
Following this answer we can generate the sequence as follow:
def rand_with_corr(refVar, corr):
# center and normalize refVar
X = np.array(refVar) - np.mean(refVar)
X = X/np.linalg.norm(X)
# random sampling Y
Y = np.random.rand(len(X))
# centralize Y
Y = Y - Y.mean()
# find the orthorgonal component to X
Y = Y - Y.dot(X) * X
# normalize Y
Y = Y/np.linalg.norm(Y)
# output
return Y + (1/np.tan(np.arccos(corr))) * X
# test
out = rand_with_corr(refVar, 0.05)
pd.Series(out).corr(pd.Series(refVar))
# out
# 0.050000000000000086

How to make a data frame combining different regression results in python?

I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)

Python. How to import my own dataset to "k means" algorithm

I want to import my own data (sentences which are located in a .txt file) into this example algorithm, which can be found at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
The problem is that this code uses a make_blobs dataset and i have a hard time understanding how to replace it with data from .txt file.
All I predict is that I need to replace this piece of code right here:
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
Also I do not understand these variables X, y . I assume that X is an array of data, and what about y?
Should I just assign everything to the X as like this and that example code would work? But what about those make_blobs features like centers, n_features etc.? Do I need to specify them somehow differently?
# open and read from the txt file
path = "C:/Users/user/Desktop/sentences.txt"
file = open(path, 'r')
# assign it to the X
X = file.readlines()
Any help is appreciated!
Firstly, you need to create a mapping of your words to a number that your k-means algorithm can use.
For example:
I ride a bike and I like it.
1 2 3 4 5 1 6 7 # <- number ids
After that you have a new embedding for you dataset and you can apply k-means. If you want a homogeneous appearance for your sample you must convert them to one-hot-representation (which is that you create a N-length array for each sample, where N is the total number of unique words you have, which has one to the corresponding position which is the same as the index of the sample).
Example for the above for N = 7 would be
1 -> 1000000
2 -> 0100000
...
So, now you can have a X variable containing your data in a proper format. You don't need y which is the corresponding labels for your samples.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
...

Using Machine Learning in Python to load custom datasets?

Here's the problem:
It take 2 variable inputs, and predict a result.
For example: price and volume as inputs and a decision to buy/sell as a result.
I tried implementing this using K-Neighbors with no success. How would you go about this?
X = cleanedData['ES1 End Price'] #only accounts for 1 variable, don't know how to use input another.
y = cleanedData["Result"]
print(X.shape, y.shape)
kmm = KNeighborsClassifier(n_neighbors = 5)
kmm.fit(X,y) #ValueError for size inconsistency, but both are same size.
Thanks!
X needs to be a matrix/2d array where each column stands for a feature, which doesn't seem true from your code, try reshape X to 2d with X[:,None]:
kmm.fit(X[:,None], y)
Or without resorting to reshape, you'd better always use a list to extract features from a data frame:
X = cleanedData[['ES1 End Price']]
OR with more than one columns:
X = cleanedData[['ES1 End Price', 'volume']]
Then X would be a 2d array, and can be used directly in fit:
kmm.fit(X, y)

Spark: how to get cluster's points (KMeans)

I'm trying to retrieve the data points belonging to a specific cluster in Spark. In the following piece of code, the data is made up but I actually obtain the predicted clustered.
Here is the code I have so far:
import numpy as np
# Example data
flight_routes = np.array([[1,3,2,0],
[4,2,1,4],
[3,6,2,2],
[0,5,2,1]])
flight_routes = sc.parallelize(flight_routes)
model = KMeans.train(rdd=flight_routes, k=500, maxIterations=10)
route_test = np.array([[0,2,3,4]])
test = sc.parallelize(route_test)
prediction = model.predict(test)
cluster_number_predicted = prediction.collect()
print cluster_number_predicted # it returns [100] <-- COOL!!
Now, I'd like to have all the data points belonging to the cluster number 100. How do I get those ?
What I want achieve is something like the answer given to this SO question: Cluster points after Means (Sklearn)
Thank you in advance.
If you both record and prediction (and not willing to switch to Spark ML) you can zip RDDs:
predictions_and_values = model.predict(test).zip(test)
and filter afterwards:
predictions_and_values.filter(lambda x: x[1] == 100)

Categories