Inputs:
I have an array X of images where each row is an example representing a person.
Another array y for their labels where a label is an integer between 1 and 7.
And last array of ids where the ids[i] represents the id of ith person at X[i]. (A same person has the same id and there could be different images of same person.)
Is it possible to partition X and y so that the same person doesn't go into both testing and training set?
I think that I need to use sklearn.cross_validation.train_test_split. Can someone explain what "stratify" does and is this the right method to do what I'm trying to do?
Stratified sampling means that sklearn will try to match the ratios of classes in your train and test splits to those of the overall data.
What information is contained in your y-labels?
It sounds that you need something like LabelKFold or LabelShuffleSplit where label would be the ids in your case.
Related
Sorry if the title is a bit confusing, I don't know how else can I make this question more specific.
I am trying to create an Adaboost implementation in Python, I am using the MNIST from Keras datasets.
Currently, I am just trying to create a training array for a weak threshold that classifies the "0" number images.
For that, I need to create an array, half of it being just images of "0", and the other half being any other random number.
Currently, I have 2 arrays, x_train: an array that contains the pictures and y_train: an array that contains the tag, that way we can check if, for example, x_train[i] is a picture of the number "0" if y_train[i] == 0.
So, I want to know if there's an automated way of doing that using NumPy, to grab elements from an array using a condition applied to another array.
Basically: Grab n elements and push into custom_array from x_array[i] if y_array[i] == 0 , and, grab n elements and push into custom_array from x_array[i] if y_array[i] != 0.
Best regards.
Does this serve your purpose?
mask = y_array == 0
array_0 = x_array[mask]
array_non0 = x_array[~mask]
If I interpreted your question the wrong way, please correct me.
I want to build a content-based recommender system in Python that uses multiple attributes to decide whether two items are similar. In my case, the "items" are packages hosted by the C# package manager (example) that have various attributes such as name, description, tags that could help to identify similar packages.
I have a prototype recommender system here that currently uses only a single attribute, the description, to decide whether packages are similar. It computes TF-IDF rankings for the descriptions and prints out the top 10 recommendations based on that:
# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
def train(dataframe):
tfidf = TfidfVectorizer(analyzer='word',
ngram_range=(1, 3),
min_df=0,
stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
for idx, row in dataframe.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
for i in similar_indices]
id = row['id']
similar_items = [it for it in similar_items if it[0] != id]
# This 'sum' is turns a list of tuples into a single tuple:
# [(1,2), (3,4)] -> (1,2,3,4)
flattened = sum(similar_items, ())
try_print("Top 10 recommendations for %s: %s" % (id, flattened))
How can I combine cosine_similarities with other similarity measures (based on same author, similar names, shared tags, etc.) to give more context to my recommendations?
For some context, my work with content-based recommenders has revolved primarily around raw text and categorical data/features. Here's a high-level approach I've taken that has worked out nicely and is pretty simple to implement.
Suppose I have three feature columns that I can potentially use to make recommendations: description, name, and tags. To me, the path of least resistance entails combining these three feature sets in a useful way.
You're off to a good start, using TF-IDF to encode description. So why not treat name and tags in a similar way by creating a feature "corpus" consisting of description, name, and tags? Literally, this would mean concatenating the contents of each of the three columns into one long text column.
Be wise about the concatenation, though, as it's probably to your advantage to preserve from which column a given word comes from, in the case of features like name and tag, which are assumed to have much lower cardinality than description. To put it more explicitly: instead of just creating your corpus column like this:
df['corpus'] = (pd.Series(df[['description', 'name', 'tags']]
.fillna('')
.values.tolist()
).str.join(' ')
You might try preserving information about where particular data points in name and tags come from. Something like this:
df['name_feature'] = ['name_{}'.format(x) for x in df['name']]
df['tags_feature'] = ['tags_{}'.format(x) for x in df['tags']]
And after you do that, I would take things a step further by considering how the default tokenizer (which you're using above) works in TfidfVectorizer. Suppose you have the name of a given package's author: "Johnny 'Lightning' Thundersmith". If you just concatenate that literal string, the tokenizer will split it up and roll each of "Johnny", "Lightning", and "Thundersmith" into separate features, which could potentially diminish the information added by that row's value for name. I think it's best to try to preserve that information. So I would do something like this to each of your lower-cardinality text columns (e.g. name or tags):
def raw_text_to_feature(s, sep=' ', join_sep='x', to_include=string.ascii_lowercase):
def filter_word(word):
return ''.join([c for c in word if c in to_include])
return join_sep.join([filter_word(word) for word in text.split(sep)])
def['name_feature'] = df['name'].apply(raw_text_to_feature)
The same sort of critical thinking should be applied to tags. If you've got a comma-separated "list" of tags, you'll probably have to parse those individually and figure out the right way to use them.
Ultimately, once you've got all of your <x>_feature columns created, then you can create your final "corpus" and plug that into your recommender system as inputs.
This whole system takes some engineering, to be sure, but I've found it's the easiest way to introduce new information from other columns that have different cardinalities.
As I understand your question, there are two ways this can be done:
Combine the other features with tfidf_matrix and then calculate the cosine similarity
Calculate the similarity of other features using other methods and then somehow combine them with the cosine similarity of tfidf_matrix to get a meaningful metric.
I was talking about the first one.
For example lets say, for your data, the tfidf_matrix (for only the 'description' column) is of shape (3000, 4000)
where 3000 are the rows in the data and 4000 are the unique words (vocabulary) found by the TfidfVectorizer.
Now lets say you do some feature processing on the other columns ('authors', 'id' etc) and that produces 5 columns. So the shape of that data is (3000, 5).
I was saying to combine the two matrices (combine the columns) so that the new shape of your data is (3000, 4005) and then calculate the cosine_similarity.
See below example:
from scipy import sparse
# This is your original matrix
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
# This is the other features
other_matrix = some_processing_on_other_columns()
combined_matrix = sparse.hstack((tfidf_matrix, other_matrix))
cosine_similarities = linear_kernel(combined_matrix, combined_matrix)
You have a vector for a user $\gamma_u$ and an item $\gamma_i$. The scoring function for your recommendation is:
Right now you said your feature vector has only 1 item, but once you get more, this model will scale for that.
In this case you already engineered your vectors, but typically in recommenders, the feature are learned through matrix factorization. This is called a latent factor model, whereas you have a hand-crafted model.
I have 50 products. For each product, I want to identify the following four related products using similarity measures.
1 related the most
2 partially related
1 not related
I want to compare the ranked list generated by my model (predicted) with the ranked list specified by the domain experts (ground truth).
Through reading, I found that I may use rank correlation based approaches such as Kendall Tau/Spearmen to compare the ranked lists. However, I am not sure if these approaches are suitable as my number of samples is low (4). Please correct me if I am wrong.
Another approach is to use Jaccard similarity (set intersection) to quantify the similarity between two ranked list. Then, I may plot histogram from the setbased_list (see below).
for index, row in evaluate.iterrows():
d= row['Id']
y_pred = [3,2,1,0]
y_true = [row['A'],row['B'],row['C'],row['D']]
sim = jaccard_similarity_score(y_true, y_pred)
setbased_list.append(sim)
Is my approach to the problem above correct?
What are other approaches that I may use if I want to take into consideration the positions of elements in the list (weight-based)?
From the way you have described the problem, it sounds as if you might as well just assign an arbitrary score for each item on your list - e.g. 3 points for the same item at the same rank as on the 'training' list, 1 point for the same item but at a different rank, or something like that.
I'm not clear on the role of the 'not related' item though - are the other 45 items all equally 'not related' to the target item and if so does it matter which one you choose? Perhaps you need to take points away from the score if the 'not related' item appears in one of the 'related' positions? That subtlety might not be captured by a standard nonparametric correlation measure.
If it's important that you use a standard, statistically based measure for some reason then you are probably better off asking on Cross Validated.
I'm having issues with presenting my data in a form which sklearn will accept
My raw data is a few hundred strings, and these are classified into one of 5 classes, I've a list of the strings i'd like to classify, and a parallel list of their respective classes. I'm using GaussianNB()
Example Data:
For such a large, successful business, I really feel like they need to be
either choosier in their employee selection or teach their employees to
better serve their customers.|||Class:4
Which represents a given "feature" and a classification
Naturally, the strings themselves have to be converted to vectors prior to their use in the classifier, I've attempted to use DictVector to perform this task
dictionaryTraining = convertListToSentence(data)
vec = DictVectorizer()
print(dictionaryTraining)
vec.fit_transform(dictionaryTraining)
However in order todo it, i have to attach the actual classification of the data into the dictionary, otherwise i get the error 'str' object has no attribute 'items' I understand this is because .fit_transform requires features and indices, but i don't fully understand the purpose of the indice
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
My question is, how can i take a list of strings, and a list of numbers representing their classifications, and provide these to a gaussianNB() classifier such that i can present it with a similar string in the future and it will estimate the strings class?
Since your input data are in the format of raw text and not in the format of a dictionary where like {"word":number_of_occurrences, } I believe you should go with a CountVectorizer which will split your input text on white space and transform it on the input vectors you need.
A simple example of such a transformation would be:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This is the second second document.',
'And the third one.', 'Is this the first document?',]
x = CountVectorizer().fit_transform(corpus)
print x.todense() #x holds your features. Here I am only vizualizing it
What exactly does the LogisticRegression.predict_proba function return?
In my example I get a result like this:
[[ 4.65761066e-03 9.95342389e-01]
[ 9.75851270e-01 2.41487300e-02]
[ 9.99983374e-01 1.66258341e-05]]
From other calculations, using the sigmoid function, I know, that the second column are probabilities. The documentation says, that the first column are n_samples, but that can't be, because my samples are reviews, which are texts and not numbers. The documentation also says, that the second column are n_classes. That certainly can't be, since I only have two classes (namely +1 and -1) and the function is supposed to be about calculating probabilities of samples really being of a class, but not the classes themselves.
What is the first column really and why it is there?
4.65761066e-03 + 9.95342389e-01 = 1
9.75851270e-01 + 2.41487300e-02 = 1
9.99983374e-01 + 1.66258341e-05 = 1
The first column is the probability that the entry has the -1 label and the second column is the probability that the entry has the +1 label. Note that classes are ordered as they are in self.classes_.
If you would like to get the predicted probabilities for the positive label only, you can use logistic_model.predict_proba(data)[:,1]. This will yield you the [9.95342389e-01, 2.41487300e-02, 1.66258341e-05] result.