I am working with the following torch_geometric dataset object. It consists of a multitude of graphs, each representing a molecule. To give an idea, this is the data inside the Dataset.
Data(x=[1118918, 43], edge_attr=[2161762, 6], edge_index=[2, 2161762], y=[54613, 1], mol_code=[54613])
There is x (atom or node features), edge_attr (bond or edge features), edge_index (adiacency matrix), and y (target). There are 54613 graphs in this dataset: each graph is a molecule. In fact, to be clear, the dimension of x is [average_number_of_atoms * 54613, n_atom_features] and for edge_attr [average_number_of_bonds * 54613, n_bond_features].
What is mol_code then? It pairs the molecule with an id. In fact, even if I have 54613 graphs == 54613 molecules, the molecules repeat themselves. For example, the first 20 elements of mol_code could be all 0 (the 0th molecule), while the next 21 all 1, and so on without a fixed dimension. For splitting this dataset, I do the following:
# Let's say I want the training to be all entries corresponding to molecule 0, 1, and 2
molecules = [0, 1, 2]
training_dataset = dataset[torch.isin(dataset.data.mol_code, torch.tensor(molecules)]
However, now I want to have a training_dataset with replacement. For example, molecules would be [0, 1, 1].
The easiest thing I tried was to do:
# some function generates randomly molecules with repetitions
# imagine molecules is [0, 1, 1]
training_dataset = dataset[torch.isin(dataset.data.mol_code, torch.tensor(molecules)]
The problem is that now the training set obviously shrinked, because it's taking all the entries that have mol_code either 0 or 1, but it's not taking the entries with 1 twice.
There is a sampler method I read about, but from what I understood it will sample individual entries, rather than group them by mol_code. Any ideas?
Related
I have a network with 10 nodes. There are two connected sub-graphs in the network (from the code below), i.e. one subnetwork is the nodes 0-1-2-3-4-5 linked together, and the other subnetwork is 6-7-8-9 nodes linked together.
Each of these nodes has a set of features (as described in the X vector), so there are 10 of these values (nodes 0-9).
There are two classes (i.e. subnetwork 1 is in class 0 and subnetwork 2 is in class 1).
The ultimate aim is to be able to classify each subnetwork into it's class.
I wrote this code, which describes the whole graph as one entity:
edge_origins = np.array([0,1,2,3,4,6,7,8])
edge_destinations = np.array([1,2,3,4,5,7,8,9])
target = np.array([0,1])
x = [np.array([0.1,0.5,0.2]),np.array([0.5,0.6,0.23]),
np.array([0.1,0.5,0.5]),np.array([0.1,0.6,0.23]),
np.array([0.1,0.4,0.4]),np.array([0.52,0.6,0.23]),
np.array([0.1,0.3,0.3]),np.array([0.3,0.6,0.23]),
np.array([0.1,0.1,0.2]),np.array([0.4,0.6,0.23])]
edge_index = torch.tensor([edge_origins, edge_destinations], dtype=torch.long)
x = torch.tensor(x, dtype=torch.float)
y = torch.tensor(target, dtype=torch.long)
dataset = Data(x=x, edge_index=edge_index, y=y, num_classes = len(set(target)))
However, my problem is that I cannot divide this into train and test, because it's just one graph.
My starting data is 1D (as in the example above), and I need to convert it to 2D data and feed that into a Pytorch network one at a time.
I was trying to do:
edge_origins_list = []
edge_destinations_list = []
for k, g in groupby(enumerate(edge_origins), lambda i_x: i_x[0] - i_x[1]):
edge_origins_list.append(list(map(itemgetter(1), g)))
for k, g in groupby(enumerate(edge_destinations), lambda i_x: i_x[0] - i_x[1]):
edge_destinations_list.append(list(map(itemgetter(1), g)))
print(edge_destinations_list)
This will divide by origin and destination list as required:
[[0, 1, 2, 3, 4], [6, 7, 8]]
[[1, 2, 3, 4, 5], [7, 8, 9]]
And now I'm unclear on two things but I'll ask as two separate questions on SO for succintness. How do I split the X features list to follow the node pattern of which network they should be in? I was trying:
for i in edge_origins_list:
length = len(i[0])
x[0:length]
...but then I'm not sure about the iteration part.
So then the output should be:
edge_origins = np.array([[0,1,2,3,4],[6,7,8]])
edge_destinations = np.array([[1,2,3,4,5],[7,8,9]])
target = np.array([0,1])
x = [[np.array([0.1,0.5,0.2]),np.array([0.5,0.6,0.23]),
np.array([0.1,0.5,0.5]),np.array([0.1,0.6,0.23]),
np.array([0.1,0.4,0.4]),np.array([0.52,0.6,0.23])],
[np.array([0.1,0.3,0.3]),np.array([0.3,0.6,0.23]),
np.array([0.1,0.1,0.2]),np.array([0.4,0.6,0.23])]]
which is the first 6 nodes with 5 edges and a class 0, and then 4 nodes with 3 edges and class 1 networks.
I'm trying to cluster and visualise some data with xmeans from the pyclustering lib.
I copied the code directly from the example in the documentation,
from pyclustering.cluster import cluster_visualizer
from pyclustering.cluster.xmeans import xmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import SIMPLE_SAMPLES
sample = X # read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will
# start analysis.
amount_initial_centers = 2
initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize()
# Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum
# number of clusters that can be allocated is 20.
xmeans_instance = xmeans(sample, initial_centers, 20)
xmeans_instance.process()
# Extract clustering results: clusters and their centers
clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()
# Print total sum of metric errors
print("Total WCE:", xmeans_instance.get_total_wce())
# Visualize clustering results
visualizer = cluster_visualizer()
visualizer.append_clusters(clusters, sample)
visualizer.append_cluster(centers, None, marker='*', markersize=10)
visualizer.show()
The only difference is that I assigned to sample the value of my matrix X instead of loading a sample dataset.
When I try to visualise the clustering result I get this error:
Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.
My X matrix is generated in this way:
features = ["I", "Iu", other 7 column names]
data = df[features]
...
X = scaler.fit_transform(data)
Is there a way to visualise the clusters and plotting only two/three features at a time?
I can't find anything on the documentation.
I tried this:
visualizer.append_clusters(clusters, sample[:,[0,1]])
in order to visualise only the first two features and got this error
Only clusters with the same dimension of objects can be displayed on canvas.
EDIT:
I updated the code as suggested in the answer by annoviko but now I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-69-6fd7d2ce5fcd> in <module>
20 visualizer.append_clusters(clusters, X)
21 visualizer.append_cluster(centers, None, marker='*', markersize=10)
---> 22 visualizer.show(pair_filter=[[0, 1], [0, 2]])
/usr/local/lib/python3.8/site-packages/pyclustering/cluster/__init__.py in show(self, pair_filter, **kwargs)
224 raise ValueError("There is no non-empty clusters for visualization.")
225
--> 226 cluster_data = self.__clusters[0].data or self.__clusters[0].cluster
227 dimension = len(cluster_data[0])
228
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
It is raised by visualizer.show(), and it happens even if I remove the pair_filter from within the function call.
In line with the error that you got:
Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.
You have to use cluster_visualizer_multidim as it was mentioned. There is a documentation (pyclustering 0.10.1) with an example: https://pyclustering.github.io/docs/0.10.1/html/dc/d6b/classpyclustering_1_1cluster_1_1cluster__visualizer__multidim.html
For example, if you have a data (D > 3) and you want to display (x0, x1) and (x0, x2) then you can display it in the following way:
visualizer = cluster_visualizer_multidim()
visualizer.append_clusters(clusters, sample_4d)
visualizer.show(pair_filter=[[0, 1], [0, 2]])
Where pair_filter specifies which features should be shown. In example above, it will show only (x0, x1) - [0, 1] and (x0, x2) - [0, 2].
So, in your particular case where you have to display only first two features it should be:
visualizer = cluster_visualizer_multidim()
visualizer.append_clusters(clusters, sample)
visualizer.show(pair_filter=[[0, 1]])
I think I have to make error more readable and make a proposal to use another class in the first sentence. Let me know if it helps (if it is still relevant for you).
I am trying to plot my data in a 2-dimensional space using sklearn PCA. I want to re-use the same PCA representation to plot several data afterwards, but let us focus on one set first.
When I run a sklearn.fit_transform on my data I get the following result:
sklearn_pca = sklearnPCA(n_components = 2, random_state = 55)
X_train_proj = sklearn_pca.fit_transform(X_train)
plt.scatter(X_train_proj[:, 0],
X_train_proj[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 1: https://i.ibb.co/B4FcV08/capture-1.png
When I run a sklearn.transform on the same data, using the PCA object created before thanks to the fit_transform, here is what I get:
X_train_proj_2 = sklearn_pca.transform(X_train)
plt.scatter(X_train_proj_2[:, 0],
X_train_proj_2[:, 1],
c = dic[y_train.astype(int)],
s = y_train * 10 + 1)
Output 2: https://i.ibb.co/0MS3Jhy/capture-2.png
My data contains absolutely no NAs and is already scaled. The size, however, is quite big, as I have ~11,000 lines and ~20 columns.
I have also rapidly checked that my columns are not correlated by computing the correlation matrix.
Just getting started with this library... having some issues (i've read the docs but didn't get clarity) with RandomForestClassifiers
My question is pretty simple, say i have a train data set like
A B C
1 2 3
Where A is the independent variable (y) and B-C are the dependent variables (x). Let's say the test set looks the same, however the order is
B A C
1 2 3
When I call forest.fit(train_data[0:,1:],train_data[0:,0])
do I then need to reorder the test set to match this order before running? (Ignoring the fact that I need to remove the already predicted y value (a), so lets just say B and C are out of order... )
Yes, you need to reorder them. Imagine a simpler case, Linear Regression. The algorithm will calculate the weights for each of the features, so for example if feature 1 is unimportant, it will get assigned a close to 0 weight.
If at prediction time the order is different, an important feature will be multiplied by this almost null weight, and the prediction will be totally off.
elyase is correct. scikit-learn will simply take the data in whatever order you give it. Hence, you'll have to ensure that the data is in the same order during training and prediction time.
Here's a simple illustrating example:
Training time:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
Prediction time:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
This is how I cache the column order during training so that I can use it during prediction time.
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())
I am trying to use the Sci-kit learn python library to classify a bunch of urls for the presence of certain keywords matching a user profile. A user has name, email address ... and a url assigned to them. I have created a txt with the result of each profile data match on each link so it is in the format:
Name Email Address
0 1 0 =>Relavent
1 1 0 =>Relavent
0 1 1 =>Relavent
0 0 0 =>Not Relavent
Where the 0 or 1 signifies that the attribute was found on the page(each row is a webpage)
How do i give this data to the sci-kit so it can use it to run a classifier? The examples i have seen all have data coming from a predefined sch-kit library such as digits or iris or are being generated in the format i already have. I just dont know how to use the data format i have to provide to the library
The above is a toy example and i have many more features than 3
The data needed is a numpy array (in this case a "matrix") with the shape (n_samples, n_features).
A simple way to read the csv-file to the right format by using numpy.genfromtxt. Also refer this thread.
Let the contents of a csv file (say file.csv in the current working directory) be:
a,b,c,target
1,1,1,0
1,0,1,0
1,1,0,1
0,0,1,1
0,1,1,0
To load it we do
data = np.genfromtxt('file.csv', skip_header=True)
The skip_header is set to True, to prevent reading the header column (The a,b,c,target line). Refer numpy's documentation for more details.
Once you load the data, you need to do some pre-processing based on your input data format. The preprocessing could be something like splitting the input and the targets (classification) or splitting the whole dataset into a training and validation set (for cross-validation).
To split the input (feature matrix) from the output (target vector) we do
features = data[:, :3]
targets = data[:, 3] # The last column is identified as the target
For the above given CSV data, the arrays will use will look like:
features = array([[ 0, 1, 0],
[ 1, 1, 0],
[ 0, 1, 1],
[ 0, 0, 0]]) # shape = ( 4, 3)
targets = array([ 1, 1, 1, 0]) # shape = ( 4, )
Now these matrices are passed to the estimator objects fit function. If you are using the popular svm classifier then
>>> from sklearn.svm import LinearSVC
>>> linear_svc_model = LinearSVC()
>>> linear_svc_model.fit(X=features, y=targets)