Predicting from SciKitLearn RandomForestClassification with Categorical Data - python

I created a RandomForestClassification model using SkLearn using 10 different text features and a training set of 10000. Then, I pickled the model (76mb) in hopes of using it for prediction.
However, in order to produce the Random Forest, I used the LabelEncoder and OneHotEncoder for best results on the categorical/string data.
Now, I'd like to pull up the pickled model and get a classification prediction on 1 instance. However, I'm not sure how to encode the text on the 1 instance without loading the entire training & test dataset CSV
again and going through the entire encoding process.
It seems quite laborious to load the csv files every time. I'd like this to run 1000x per hour so it doesn't seem right to me.
Is there a way to quickly encode 1 row of data given the pickle or other variable/setting? Does encoding always require ALL the data?
If loading all the training data is required to encode a single row, would be advantageous to encode the text data myself in a database where each feature assigned to a table, auto-incremented with a numeric id and a UNIQUE key on the text/categorical field, then pass this id to the RandomForestClassification? Obviously I would need to refit and pickle this new model, but then I would know exactly the (encoded) numeric representation of a new row and simply request a prediction on those values.
It's highly likely that I'm missing a feature or misunderstanding SkLearn or Python, I only started both a 3 days ago. Please excuse my naivety.

Using Pickle you should save your Label and One Hot Encoder. You can then read this each time and easily transform new instances. For example,
import cPickle as pickle
from sklearn.externals import joblib
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)
# Save your encoding
joblib.dump(le, '/path/to/save/model')
# OR
pickle.dump(le, open( '/path/to/model', "wb" ) )
# Load those encodings
le = joblib.load('/path/to/save/model')
# OR
le = pickle.load( open( '/path/to/model', "rb" ) )
# Then use as normal
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])

Related

Trying to make use of a library to conduct some topic modeling, but it's not going well

I have a .csv term-document matrix, and I wanna perform some latent dirichlet allocation using gensim in python. However, I'm not particularly familiar with Python or LDA.
I posted in the gensim...forum? I dunno if that's what it's called. The guy that wrote the package responded and had this to say:
how big is your term-document CSV matrix?
If it's small enough = fits in RAM, you could:
1) use numpy.loadtxt()
to load your CSV into an in-memory matrix
2) convert the matrix to a corpus with gensim.matutils.Dense2Corpus() . Check out its documents_columns flag, it lets you switch between document-term and term-document transposition easily.
3) use that corpus to train your LDA model.
So that leads me to believe that the answer to this question isn't correct.
It seems like a dictionary is a necessary input to a LDA model; is this not correct? Here's what I have that I think successfully sticks the .csv into a corpus.
file = np.genfromtxt(fname=fPathName, dtype="int", delimiter=",", skip_header=True, missing_values="", filling_values=0)
corpus = gensim.matutils.Dense2Corpus(file, documents_columns=False)
Any help would be appreciated.
Edit: turns out that a Gensim dictionary and a Python dictionary are not exactly the same things.
So, from Gensim documentation I took this snip of code:
from gensim.models import LdaModel
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=10)
The file you want to analyse is a csv so to open it you can use pandas
import pandas as pd
df = pd.read_csv(filename) # add header=None if the file has no column names
Once you import the file you have everything loaded into a data frame, you need to combine all text into a unique list (see first comment of gensim code snip) that should look like this
["text one..", "text 2..", "text 3..."]
You can do that by iterating through the data frame and iteratevely adding text to an empty list. Before to do that you also need to check which column of your csv file contain the text to analyse.
common_texts = [] # initialise empty list
for ind, row in df.iteritem():
text = row[name_column_with_text]
common_texts.append(text)
Once you get your list of text you can simply apply the code from gensim documentation.
Of course you might get memory problems, it depends on the size of your csv file.

python- logistic regression, save predicted probabilities & predictions to csv

I am new to data analysis and python. I have mostly been following the logistic regression demonstrated in this link on titanic survivors:
http://hamelg.blogspot.ca/2015/11/python-for-data-analysis-part-28.html
I am using my own non titanic dataset though. I am at the end of the example where I want to export the results to a csv file. I made a small modification though which is getting me stuck. In addition to the prediction, i also explicitly generated the predicted probabilities which I would also like to export to the csv file.
test_probs=log_model.predict_proba(X=test_features)
print(test_probs)
# Create a submission for Kaggle
submission = pd.DataFrame({"AccountNumber":titanic_test["AccountNumber"],
"PolarPredict":test_preds,**"probabilities":test_probs** })
This is the message i get: Exception: Data must be 1-dimensional
here is the original code from the tutorial:
# Make test set predictions
test_preds = log_model.predict(X=test_features)
# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})
# Save submission to CSV
submission.to_csv("tutorial_logreg_submission.csv",
index=False)
How can I export the prediction, probabilities and the "ID" into a csv file?
you can try converting them into list
submission = pd.DataFrame({
"AccountNumber":list(titanic_test["AccountNumber"]),
"PolarPredict":list(test_preds),
"probabilities":list(test_probs)
})
submission.to_csv("tutorial_logreg_submission.csv",
index=False)

Can i call vectorizer.fit_transform multiple times to update vectorizer

I'm training a Multinomial Naive Bayes classifier on a large dataset separated over multiple files. I would like to update the CountVectorizer with all my data, but only read one file into memory at the time.
My current code:
raw_documents = []
for f in files:
text = np.loadtxt(open("csv/{f}".format(f=f), "r", delimiter="\t", dtype="str", comments=None)
raw_documents.extend(list(text[:,1]))
vectorizer = CountVectorizer(stop_words=None)
train_features = vectorizer.fit_transform(raw_documents)
Is it possible to partially call fit_transform, such that i can do
vectorizer = CountVectorizer(stop_words=None)
for f in files:
text = np.loadtxt(open("csv/{f}".format(f=f), "r", delimiter="\t", dtype="str", comments=None)
train_features = vectorizer.fit_transform(text[:,1])
Relevant documentation can be found here, but I don't manage to fully understand it.
Thanks in advance!
The problem is that the CountVecorizer needs to know in advance all what all the words in your courpus are, so that it can have a way of mapping words to integers. (It would be nice if you could do a "partial fit" where if it encounters new words it adds them onto the end, but as far as I know this is not currently supported)
An alternative would be to use HashingVectorizer; this doesn't need to be fit, as it just runs each word through a fixed hashing function to get its integer encoding.

How to avoid loading a large file into a python script repeatedly?

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).
If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)
Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]
Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.
Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.

Dimension Discrepancy: Converting Raw Data to Numpy N-Dimensional Array for Classification in Scikit-Learn

I have a pretty straightforward question. I'm working on a script to perform multi-label classification with the Scikit-Learn library. My question concerns converting my raw data (which is in JSON format) to NumPy arrays for use with Scikit-Learn's logistic regression classifier. I know from working with the data in Pandas that the dimensions of the data should be 3369 x 76 (3369 instances, and 76 features). But, when I load my JSON data in with my script, and convert it to a NumPy array and call data.shape the dimensions are (3369, 1077), so 3369 instances with 1077 features. Perhaps this is something I don't understand about how NumPy works, but if I'm doing something incorrectly and I'm inadvertently generating all of these extra features, I'm worried that any classifiers I attempt to build won't even have a chance at decent performance due to noise introduced by extra features.
So, my questions are two: (1) am I loading in/converting the data correctly for use with Scikit-Learn's classifiers, and (2) if I am converting the data correctly (as far as anyone can tell) what is the reason for the discrepancy in dimensions between the original data and the data after it is converted to NumPy arrays? Thanks in advance for reading. Code below:
EDIT: I've pared down the code a bit to just focus on the areas where I'm converting the original data to a NumPy array.
import os, json
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.feature_extraction import DictVectorizer
from sklearn.cross_validation import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
I read in the original data like so:
with open('all_labels.json', 'rb') as infile:
data = json.load(infile)
So now we have a variable data which is a list of Python dictionaries, one dictionary per instance in the dataset.
# Extract labels from data and assign to separate variable
labels = [i['label'] for i in data]
# Convert labels to numpy format for sklearn to use
le = preprocessing.LabelEncoder()
le.fit(['class1', 'class2', 'class3', 'class4'])
labels = le.transform(labels)
# Remove target labels from data instances
for i in data:
del i['label']
Here is where I may be going wrong. I'm using Scikit-Learn's DictVectorizer to try to convert a Python list of dictionaries (where each dictionary represents an instance of the dataset and each key within a dictionary is a feature for the classifier) to a NumPy array.
# THIS IS WHERE I THINK MY MISUNDERSTANDING LIES
# Initialize vectorizer object to convert data instances to
# numpy format for sklearn
vec = DictVectorizer()
data = vec.fit_transform(data).toarray()
data.shape # (3369, 1077)
Hopefully this direct people to the specific problem I'm having in the script.

Categories