I am trying to use the Sci-kit learn python library to classify a bunch of urls for the presence of certain keywords matching a user profile. A user has name, email address ... and a url assigned to them. I have created a txt with the result of each profile data match on each link so it is in the format:
Name Email Address
0 1 0 =>Relavent
1 1 0 =>Relavent
0 1 1 =>Relavent
0 0 0 =>Not Relavent
Where the 0 or 1 signifies that the attribute was found on the page(each row is a webpage)
How do i give this data to the sci-kit so it can use it to run a classifier? The examples i have seen all have data coming from a predefined sch-kit library such as digits or iris or are being generated in the format i already have. I just dont know how to use the data format i have to provide to the library
The above is a toy example and i have many more features than 3
The data needed is a numpy array (in this case a "matrix") with the shape (n_samples, n_features).
A simple way to read the csv-file to the right format by using numpy.genfromtxt. Also refer this thread.
Let the contents of a csv file (say file.csv in the current working directory) be:
a,b,c,target
1,1,1,0
1,0,1,0
1,1,0,1
0,0,1,1
0,1,1,0
To load it we do
data = np.genfromtxt('file.csv', skip_header=True)
The skip_header is set to True, to prevent reading the header column (The a,b,c,target line). Refer numpy's documentation for more details.
Once you load the data, you need to do some pre-processing based on your input data format. The preprocessing could be something like splitting the input and the targets (classification) or splitting the whole dataset into a training and validation set (for cross-validation).
To split the input (feature matrix) from the output (target vector) we do
features = data[:, :3]
targets = data[:, 3] # The last column is identified as the target
For the above given CSV data, the arrays will use will look like:
features = array([[ 0, 1, 0],
[ 1, 1, 0],
[ 0, 1, 1],
[ 0, 0, 0]]) # shape = ( 4, 3)
targets = array([ 1, 1, 1, 0]) # shape = ( 4, )
Now these matrices are passed to the estimator objects fit function. If you are using the popular svm classifier then
>>> from sklearn.svm import LinearSVC
>>> linear_svc_model = LinearSVC()
>>> linear_svc_model.fit(X=features, y=targets)
Related
I am working with the following torch_geometric dataset object. It consists of a multitude of graphs, each representing a molecule. To give an idea, this is the data inside the Dataset.
Data(x=[1118918, 43], edge_attr=[2161762, 6], edge_index=[2, 2161762], y=[54613, 1], mol_code=[54613])
There is x (atom or node features), edge_attr (bond or edge features), edge_index (adiacency matrix), and y (target). There are 54613 graphs in this dataset: each graph is a molecule. In fact, to be clear, the dimension of x is [average_number_of_atoms * 54613, n_atom_features] and for edge_attr [average_number_of_bonds * 54613, n_bond_features].
What is mol_code then? It pairs the molecule with an id. In fact, even if I have 54613 graphs == 54613 molecules, the molecules repeat themselves. For example, the first 20 elements of mol_code could be all 0 (the 0th molecule), while the next 21 all 1, and so on without a fixed dimension. For splitting this dataset, I do the following:
# Let's say I want the training to be all entries corresponding to molecule 0, 1, and 2
molecules = [0, 1, 2]
training_dataset = dataset[torch.isin(dataset.data.mol_code, torch.tensor(molecules)]
However, now I want to have a training_dataset with replacement. For example, molecules would be [0, 1, 1].
The easiest thing I tried was to do:
# some function generates randomly molecules with repetitions
# imagine molecules is [0, 1, 1]
training_dataset = dataset[torch.isin(dataset.data.mol_code, torch.tensor(molecules)]
The problem is that now the training set obviously shrinked, because it's taking all the entries that have mol_code either 0 or 1, but it's not taking the entries with 1 twice.
There is a sampler method I read about, but from what I understood it will sample individual entries, rather than group them by mol_code. Any ideas?
I wonder how I can get the size or the len of the dataset after applying a filter. Using tf.data.experimental.cardinality give -2, and this is not what I am looking for!! I want to know how many filtered samples exist in my dataset in order to be able to split it to training and validation datasets using take() and skip().
Example:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])
dataset = dataset.filter(lambda x: x < 4)
size = tf.data.experimental.cardinality(dataset).numpy()
#size here is equal to -2 but I want to get the real size which is 3
My dataset contains images and their labels, this is just an illustrative example
Taking a look at the documentation reveals that a cardinality of -2 shows that Tensorflow is unable to determine the cardinality of the data set. You can find this in here. For your example, you can do
dataset = dataset.as_numpy_iterator()
dataset = list(dataset)
print(len(dataset))
I am trying to plot my data in a 2-dimensional space using sklearn PCA. I want to re-use the same PCA representation to plot several data afterwards, but let us focus on one set first.
When I run a sklearn.fit_transform on my data I get the following result:
sklearn_pca = sklearnPCA(n_components = 2, random_state = 55)
X_train_proj = sklearn_pca.fit_transform(X_train)
plt.scatter(X_train_proj[:, 0],
X_train_proj[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 1: https://i.ibb.co/B4FcV08/capture-1.png
When I run a sklearn.transform on the same data, using the PCA object created before thanks to the fit_transform, here is what I get:
X_train_proj_2 = sklearn_pca.transform(X_train)
plt.scatter(X_train_proj_2[:, 0],
X_train_proj_2[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 2: https://i.ibb.co/0MS3Jhy/capture-2.png
My data contains absolutely no NAs and is already scaled. The size, however, is quite big, as I have ~11,000 lines and ~20 columns.
I have also rapidly checked that my columns are not correlated by computing the correlation matrix.
Just getting started with this library... having some issues (i've read the docs but didn't get clarity) with RandomForestClassifiers
My question is pretty simple, say i have a train data set like
A B C
1 2 3
Where A is the independent variable (y) and B-C are the dependent variables (x). Let's say the test set looks the same, however the order is
B A C
1 2 3
When I call forest.fit(train_data[0:,1:],train_data[0:,0])
do I then need to reorder the test set to match this order before running? (Ignoring the fact that I need to remove the already predicted y value (a), so lets just say B and C are out of order... )
Yes, you need to reorder them. Imagine a simpler case, Linear Regression. The algorithm will calculate the weights for each of the features, so for example if feature 1 is unimportant, it will get assigned a close to 0 weight.
If at prediction time the order is different, an important feature will be multiplied by this almost null weight, and the prediction will be totally off.
elyase is correct. scikit-learn will simply take the data in whatever order you give it. Hence, you'll have to ensure that the data is in the same order during training and prediction time.
Here's a simple illustrating example:
Training time:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
Prediction time:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
This is how I cache the column order during training so that I can use it during prediction time.
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())
I suppose this is possible since in the def of fit function it says:
X : array-like, shape = [n_samples, n_features]
Now I have,
I can certainly generate a string representation of the decision tree then replace X[] with actual feature names. But I wonder if the fit function could directly take feature names as part of inputs? I tried the following format for each sample
[1, 2, "feature_1", "feature_2"]
[[1, 2], ["feature_1", "feature_2"]]
but neither worked. What does that shape mean? Could you please give me an example?
The fit function itself doesn't support anything like that. However, you can draw the decision tree, including feature labels, with the export_graphviz member function. (Isn't this how you generated the tree above?). Essentially, you'd do something like this:
iris = load_iris()
t = tree.DecisionTreeClassifier()
fitted_tree = t.fit(iris.data, iris.targets)
outfile = tree.export_graphviz(fitted_tree, out_file='filename.dot', feature_names=iris.feature_names)
outfile.close()
This will produce a 'dot' file, which graphviz (which must be installed separately) can then "render" into a traditional image format (postscript, png, etc.) For example, to make a png file, you'd run:
dot -Tpng filename.dot > filename.png
The dot file itself is a plain-text format and fairly self-explanatory. If you wanted to tweak the text, a simple find-replace in the text editor of your choice would work. There are also python modules for directly interacting with graphviz and its files. PyDot seems to be pretty popular, but there are others too.
The shape reference in fit's documentation just refers to the layout of X, the training data matrix. Specifically, it expects the first index to vary over training examples, while the 2nd index refers to features. For example, suppose your data's shape is (150, 4), as is the case for iris.data. The fit function will interpret it as containing 150 training examples, each of which consists of four values.
X should be a 2 dimensional numpy ndarray where each row corresponds to a sample and each column represents the values of a feature. That shape refers to the number of rows and columns of the feature data X.
An example of a valid X which contains 3 samples and 2 features:
import numpy as np
X = np.array([[2,2],[2,0],[0,2]])
y = np.array([0,1,1])
print X.shape # Output (2,2)
where the first sample has value 1 and 2 for the first and second feature respectively.
If you have a representation of the feature data in a list of dict (each dict corresponds to a single sample) like so
D = [
{'feature1': 2, 'feature2': 2},
{'feature1': 2, 'feature2': 0},
{'feature1': 0, 'feature2': 2}
]
then you can use DictVectorizer to produce the matrix X:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
X = v.fit_transform(D)