Non overlapping data in train test validation split python - python

I'm trying to create a function for some deep learning issues for satellite images classification. I have searched through a lot of libraries and I haven't found my needs I tried this sikit-learn but I feel that it is not what I need
Any hint for a specialised function that I may not see?

The sklearn train_test_split seems to fit all your needs.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

This should do the trick. You can use the permutation array on the X and y data separately if you like.
num_tr, num_va = int(len(data)*0.5), int(len(data)*0.2)
perm = np.random.permutation(len(data))
tr_data = data[perm[:num_tr]]
va_data = data[perm[num_tr:num_tr+num_va]]
te_data = data[perm[num_tr+num_va:]]

Related

What does fit() exactly does here?

Well, basically i want to know what does the fit() function does in general, but especially in the pieces of code down there.
Im taking the Machine Learning A-Z Course because im pretty new to Machine Learning (i just started). I know some basic conceptual terms, but not the technical part.
CODE1:
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', verbose = 0)
missingvalues = missingvalues.fit(X[:, 1:3])
X[:, 1:3] = missingvalues.transform(X[:, 1:3])
Some other example where I still have the doubt
CODE 2:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
print(sc_X)
X_train = sc_X.fit_transform(X_train)
print(X_train)
X_test = sc_X.transform(X_test)
I think that if I know like the general use for this function and what exactly does in general, I'll be good to go. But certaily I'd like to know what is doing on that code
Here is also a nice check-up possibility: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
The fit-method is always to learn something in machine learning.
You normally have the following steps:
Seperate your data into two/three datasets
Pick one part of your data to learn/train something (normally X_train) with fit
Use the learned algorithm you predict something to unseen data (normally X_test) with predict
In your first example: missingvalues.fit(X[:, 1:3])
You are training SimpleImputerbased on your data Xwhere you are only using column 1,2,3, with transform you used this training to overwrite this data.
In your second example: You are training StandardScalerwith X_trainand are using this training for both datasets X_train, X_test, the StandardScaler learnes from X_trainthat means if he learned that 10 has to be converted to 2, he will convert 10 to 2 in both sets X_train, X_test.
Sklearn uses Classes. See the Python documentation for more info about Classes in Python. For more info about sklearn in particular, take a look at this sklearn documentation.
Here's a short description of how you are using Classes in sklearn.
First you instantiate your sklearn Classes with sc_X = StandardScaler() or missingvalues = SimpleImputer(...).
The objects, sc_X and missingvalues, each have methods. You can use the methods typing object_name.method_name(...). For example, you used the fit_transform() method of the sc_X instance when you typed, sc_X.fit_transform(...). This method will take your data and return a scaled version of it. It both fits (determines the scaling parameters) and transforms (applies scaling) to your data. The transform() method will transform new data, using the same scaling parameters it learned for your previous data.
In the first example, you have separated the fit and transform methods into two separate lines, but the idea is similar -- you first learn the imputation parameters with the fit method, and then you transform your data.
By the way, I think missingvalues = missingvalues.fit(X[:, 1:3]) could be changed to missingvalues.fit(X[:, 1:3]).

feed composite inputs to model

I need to feed an image and a vector sampled from normal distribution simultaneously. As the image dataset I'm using is too large, I create a ImageDeserializer for that part. But I also need to add random vector (sampled from numpy normal distribution), to the input map before feed it to the network. Is there any way to achieve this?
I also test:
mb_data = reader_train.next_minibatch(mb_size, input_map=input_map)
mb_data[random_input_node] = np.random.normal((mb_size, 100))
but get the following error:
TypeError: cannot convert value of dictionary to N4CNTK13MinibatchDataE
The problem solved with the following snippet to feed data to trainer:
mb_data = reader_train.next_minibatch(mb_size, input_map=input_map)
z = np.random.normal(mb_size)
my_trainer.train_minibatch({feature_image: mb_data[image].data, feature_z: z})
Also thanks to #mewahl. Defining new reader is another suitable way to solve the problem, and I think it must be faster than what I have done.

formatting design matrix for regression

I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.
I am having trouble formatting the test design matrix so that it would be compatible.
I am using patsy library to construct the matrix.
I want to do something like this, except the code below does not work:
X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')
What is the right approach? thanks
If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":
# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")
Alternatively, you could build a new matrix from scratch:
# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")
The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.
Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.

How can we predict using RandomForestClassifier obtained from pyspark.ml

I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.
So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.
I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.

Issue in understanding the Spark MLlib's LinearRegressionWithSGD example in python?

So, I am a rookie to machine learning and Spark and was going through Spark MLlibs documentation on Regression especially LinearRegressionWithSGD at this page. I am having a bit of difficulty in understanding the python code. Here iss what I have understood so far - The code loads the data and then forms labeledpoint. After that the model is build and then it is evaluated on the training data and MSE is calculated.
Now that part that is confusing me is that during the normal machine learning process we first divide the data into training set and test set. The we build the model using training set and finally evaluate using test set. Now in the code at the Spark MLlib's documentation I do not see any division into training and test set. And on top of that I see them building the model using the data and then evaluating using the same data.
Is there something that I am not able to understand in the code? Any help to understand the code will be helpful.
NOTE: THis is the code at Spark MLlib's documentation page for LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")
The procedure you are talking about is cross-validation. As you observed, the example above didn't do cross-validation. But this doesn't mean it's wrong.
The sole purpose of that example is to illustrate how to train and use a model. You are free to split the data and cross-validate the model, the procedure will be the same. Only the data changed.
And in addition, performance on training set is also valuable. It can tell you whether your model is overfitter or underfitting.
So to summurize, the example is all right, what you need is another example on cross-validation.

Categories