Tutorial for scipy.cluster.hierarchy [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works.
Is there any tutorial that can help me to start with, explaining step by step some simple tasks?
Let's say I have the following data set:
a = np.array([[0, 0 ],
[1, 0 ],
[0, 1 ],
[1, 1 ],
[0.5, 0 ],
[0, 0.5],
[0.5, 0.5],
[2, 2 ],
[2, 3 ],
[3, 2 ],
[3, 3 ]])
I can easily do the hierarchy cluster and plot the dendrogram:
z = linkage(a)
d = dendrogram(z)
Now, how I can recover a specific cluster? Let's say the one with elements [0,1,2,4,5,6] in the dendrogram?
How I can get back the values of that elements?

There are three steps in hierarchical agglomerative clustering (HAC):
Quantify Data (metric argument)
Cluster Data (method argument)
Choose the number of clusters
Doing
z = linkage(a)
will accomplish the first two steps. Since you did not specify any parameters it uses the standard values
metric = 'euclidean'
method = 'single'
So z = linkage(a) will give you a single linked hierachical agglomerative clustering of a. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is:
Check which metric is appropriate, e. g. cityblock or chebychev will quantify your data differently (cityblock, euclidean and chebychev correspond to L1, L2, and L_inf norm)
Check the different properties / behaviours of the methdos (e. g. single, complete and average)
Check how to determine the number of clusters, e. g. by reading the wiki about it
Compute indices on the found solutions (clusterings) such as the silhouette coefficient (with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.
Here is something to start with
import numpy as np
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
a = np.array([[0.1, 2.5],
[1.5, .4 ],
[0.3, 1 ],
[1 , .8 ],
[0.5, 0 ],
[0 , 0.5],
[0.5, 0.5],
[2.7, 2 ],
[2.2, 3.1],
[3 , 2 ],
[3.2, 1.3]])
fig, axes23 = plt.subplots(2, 3)
for method, axes in zip(['single', 'complete'], axes23):
z = hac.linkage(a, method=method)
# Plotting
axes[0].plot(range(1, len(z)+1), z[::-1, 2])
knee = np.diff(z[::-1, 2], 2)
axes[0].plot(range(2, len(z)), knee)
num_clust1 = knee.argmax() + 2
knee[knee.argmax()] = 0
num_clust2 = knee.argmax() + 2
axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point')
part1 = hac.fcluster(z, num_clust1, 'maxclust')
part2 = hac.fcluster(z, num_clust2, 'maxclust')
clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' ,
'#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC']
for part, ax in zip([part1, part2], axes[1:]):
for cluster in set(part):
ax.scatter(a[part == cluster, 0], a[part == cluster, 1],
color=clr[cluster])
m = '\n(method: {})'.format(method)
plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition',
ylabel='{}\ncluster distance'.format(m))
plt.setp(axes[1], title='{} Clusters'.format(num_clust1))
plt.setp(axes[2], title='{} Clusters'.format(num_clust2))
plt.tight_layout()
plt.show()
Gives

Related

Labels obtained from clustering seem visually incorrect

I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

Vehicle gear prediction using clustering algorithm (machine learning) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to predict which gear vehicle is driven.
I have Engine_Speed and vehicle_Speed column in the data set:
I have tried the k-means clustering algorithm, but it didn't succeed.
Which algorithm do I have to use? And how do I implement it using Python?
Looking at the vehicle speed in relation to the engine speed, the different slopes should give the different gears.
My initial reaction would be to say that this is a linear regression problem. You don't have enough data for anything else. Looking at the data, though, we can see that it is actually two linear regression problems:
[![Engine speed vs. vehicle speed][2]][2]
There is an inflection point at about 700 revs, so you should design a cutoff that selects one of two regression lines, depending on whether you are above or below the cutoff.
To determine the regression in Python, you can use any number of packages. In scikit-learn it looks like this:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
The example given there, using the Python console, is
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
Obviously you need to put your own data in X and y and in fact you would want two arrays for the two sections of your graph. You would also have two reg = LinearRegression().fit(X, y) expressions, and an if statement deciding which reg to use, depending on the input. The inflection point is at the intersection of your two regression lines.
The two regression lines have the form y = m1 x + c1 and y = m2 x + c2, where m1, m2 are the gradients of the lines and c1, c2 the intercepts. At the point of intersection m1x + c1 = m2x + c2. If you don't want to do the maths, then you can use Shapely:
import shapely
from shapely.geometry import LineString, Point
line1 = LineString([A, B])
line2 = LineString([C, D])
int_pt = line1.intersection(line2)
point_of_intersection = int_pt.x, int_pt.y
print(point_of_intersection)
(taken from this answer on Stack Overflow: How do I compute the intersection point of two lines?)
After discussion with Sanjiv, here is the updated code (adapted from here: https://machinelearningmastery.com/clustering-algorithms-with-python/)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import KMeans
matplotlib.use('TkAgg')
df = pd.read_excel("GearPredictionSanjiv.xlsx", sheet_name='FullData')
x = []
y = []
x = round(df['Engine_speed'])
y = df['Vehicle_speed']
if 'Ratio' not in df.columns or not os.path.exists('dataset.xlsx'):
df['Ratio'] = round(x/y)
model = KMeans(n_clusters=5)
# Fit the model
model.fit(X)
# Assign a cluster to each example
yhat = model.predict(X)
# Plot
plt.scatter(yhat, X['Ratio'], c=yhat, cmap=plt.cm.coolwarm)
# Show the plot
plt.show()
The question is somewhat confusing.
I assume you want to infer the vehicle speed using the engine_speed. Then there is only one feature in this dataset (i.e., engine speed) and the class label is vehicle speed. Actually, a simple IF THEN ELSE can solve the statement but for the sake of answering your question using a machine learning approach (e.g., Decision Tree), I will share how to solve this as a classification problem using scikit-learn in Python.
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score
### np.reshape(array, (-1, 1)) is to convert the array to 2D array
engine_speed = np.reshape([1124, 974, 405, 865, 754, 200], (-1, 1))
vehicle_speed = np.reshape([5, 4, 3, 4, 4, 2], (-1, 1))
test_engine_speed = np.reshape([1000, 900, 800, 700, 600, 500, 400], (-1, 1))
test_vehicle_speed = np.reshape([5, 4, 4, 4, 4, 3, 3], (-1, 1))
clf = tree.DecisionTreeClassifier()
clf = clf.fit(engine_speed, vehicle_speed)
y_pred = clf.predict(test_engine_speed)
print(accuracy_score(test_vehicle_speed, y_pred))
print(test_vehicle_speed.ravel()) # ravel() is to convert 2D array to 1D array
print(y_pred.ravel()) # ravel() is to convert 2D array to 1D array
I hope this would be helpful.

Linear/Order Preserving Clustering in Python

I want to group numbers in a list, based on how 'large' the numbers are in comparison of their neighbors, but I want to do it continuously and via clustering if possible. To clarify, let me give you an example:
Suppose you have the list
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
then, if we have 3 groups, it's obvious how to cluster. Running the k-means algorithm from sklearn (see code) confirms this. But, when the numbers in the list aren't that 'convenient', I run into trouble. Suppose you have the list:
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
My problem now is two-fold:
I want some sort of 'order-preserving, linear' clustering, which takes the order of the data into account. For the list above, the clustering algorithm should give me a desired output of the form
lst = [0,0,1,1,1,1,1,1,2,2]
If you look at this output above, you also see that I want the value 6.2 to be clustered in the second cluster, i.e. I want the cluster algorithm to see it as an outlier, not as an entirely new cluster.
EDIT For clarification, I want to be able to specify the amount of clusters in the linear clustering process, i.e. the 'end total' of clusters.
Code:
import numpy as np
from sklearn.cluster import KMeans
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 2]: OK output
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]
As mentioned, i think a straightforward(ish) way to get the desired results is to just use a normal K-means clustering, and then modify the generated output as desired.
Explanation: The idea is to get the K-means outputs, and then iterate through them: keeping track of previous item's cluster group, and current cluster group, and controlling new clusters created on conditions. Explanations in code.
import numpy as np
from sklearn.cluster import KMeans
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 2]: OK output
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]
def linear_order_clustering(km_labels, outlier_tolerance = 1):
'''Expects clustering outputs as an array/list'''
prev_label = km_labels[0] #keeps track of last seen item's real cluster
cluster = 0 #like a counter for our new linear clustering outputs
result = [cluster] #initialize first entry
for i, label in enumerate(km_labels[1:]):
if prev_label == label:
#just written for clarity of control flow,
#do nothing special here
pass
else: #current cluster label did not match previous label
#check if previous cluster label reappears
#on the right of current cluster label position
#(aka current non-matching cluster is sandwiched
#within a reasonable tolerance)
if (outlier_tolerance and
prev_label in km_labels[i + 1: i + 2 + outlier_tolerance]): label = prev_label #if so, overwrite current label
else:
cluster += 1 #its genuinely a new cluster
result.append(cluster)
prev_label = label
return result
Note that i have only tested this with tolerance for 1 outlier, and cannot promise it works as-is out of the box for all cases. This should get you started however.
Output:
print(km.labels_)
result = linear_order_clustering(km.labels_)
print(result)
[1 1 0 0 0 2 0 0 1 1]
[0, 0, 1, 1, 1, 1, 1, 1, 2, 2]
I would approach this in a couple of passes. First I would have a first function/method to do the analysis to determine the clustering centers, for each group and return an array of those centers. I would then take those centers along with the list into another function/method to assemble a list of the cluster id of each number in the list. I would then return that list sorted.
Define a threshold.
If the values of x[i] and x[i-1] differ too much, begin a new segment.
For better results, look at KDE and CUSUM approaches.
Don't use clustering. It has a different objective.
I had a similar problem and solved it as follows:
Given a distances matrix between all the elements,
I either do a bottom-up clustering (merging the two "most similar" elements/sub-clusters) or a top-down clustering (splitting a group of elements into the "most different" sub-clusters);
To compute the distance between sub-clusters I aggregate the distances of all the elements in them (the default method is taking the average, using the minimal or maximal distance is also possible).
Either way this results in a hierarchical clustering which you can then cut to produce any desired number of clusters.
It seems the bottom-up method gave better results, but YMMV.
Here's the code for the bottom-up method (in R). It builds:
A merge matrix where every row includes two columns with the indices of the next two things to merge - negative index for elements and positive index for previously created sub-clusters (R uses 1-based indices)
A height array containing the distance between the two merged elements/sub-clusters. This is added to the maximal height of the merged things (0 height for leaf elements) so heights are always increasing (for display of the tree, or as R calls it, the "dendogram").
This can be used to create R hclust objects which can be displayed and manipulated in various ways.
This isn't the most efficient possible implementation, but it gets the work done in a reasonable amount of time. A more efficient approach would be to reduce the size of the distances matrix (this would require more book keeping keeping track of the indices mapping between the smaller matrix and the original elements):
bottom_up <- function(distances, aggregation) {
aggregate <- switch(aggregation, mean=mean, min=min, max=max)
rows_count <- dim(distances)[1]
diag(distances) <- Inf
merge <- matrix(0, nrow=rows_count - 1, ncol=2)
height <- rep(0, rows_count - 1)
merged_height <- rep(0, rows_count)
groups <- -(1:rows_count)
for (merge_index in 1:(rows_count - 1)) {
adjacent_distances <- pracma::Diag(distances, 1)
low_index <- which.min(adjacent_distances)
high_index <- low_index + 1
grouped_indices <- sort(groups[c(low_index, high_index)])
merged_indices <- which(groups %in% grouped_indices)
groups[merged_indices] <- merge_index
merge[merge_index,] <- grouped_indices
height[merge_index] <- max(merged_height[merged_indices]) + adjacent_distances[low_index]
merged_height[merged_indices] <- height[merge_index]
merged_distances <- apply(distances[,merged_indices], 1, aggregate)
distances[,merged_indices] <- merged_distances
distances[merged_indices,] <- rep(merged_distances, each=length(merged_indices))
distances[merged_indices, merged_indices] <- Inf
}
return (list(merge=merge, height=height))
}
The pracma::Diag(distances, 1) fetches the offset-by-1 diagonal (above the main diagonal).

generating correlated numbers in numpy / pandas

I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 and a standard deviation of 15.
df = pd.DataFrame(15 * np.random.randn(5, 4) + 60, columns=['Math', 'Science', 'History', 'Art'])
What I can’t figure out is how to make it so that a student’s Science mark is highly correlated to their Math mark, and that their History and Art marks are less so, but still somewhat correlated to the Math mark.
I’m neither a statistician or an expert programmer, so a less sophisticated but more easily understood solution is what I’m hoping for.
Let's put what has been suggested by #Daniel into code.
Step 1
Let's import multivariate_normal:
import numpy as np
from scipy.stats import multivariate_normal as mvn
Step 2
Let's construct covariance data and generate data:
cov = np.array([[1, 0.8,.7, .6],[.8,1.,.5,.5],[0.7,.5,1.,.5],[0.6,.5,.5,1]])
cov
array([[ 1. , 0.8, 0.7, 0.6],
[ 0.8, 1. , 0.5, 0.5],
[ 0.7, 0.5, 1. , 0.5],
[ 0.6, 0.5, 0.5, 1. ]])
This is the key step. Note, that covariance matrix has 1's in diagonal, and the covariances decrease as you step from left to right.
Now we are ready to generate data, let's sat 1'000 points:
scores = mvn.rvs(mean = [60.,60.,60.,60.], cov=cov, size = 1000)
Sanity check (from covariance matrix to simple correlations):
np.corrcoef(scores.T):
array([[ 1. , 0.78886583, 0.70198586, 0.56810058],
[ 0.78886583, 1. , 0.49187904, 0.45994833],
[ 0.70198586, 0.49187904, 1. , 0.4755558 ],
[ 0.56810058, 0.45994833, 0.4755558 , 1. ]])
Note, that np.corrcoef expects your data in rows.
Finally, let's put your data into Pandas' DataFrame:
df = pd.DataFrame(data = scores, columns = ["Math", "Science","History", "Art"])
df.head()
Math Science History Art
0 60.629673 61.238697 61.805788 61.848049
1 59.728172 60.095608 61.139197 61.610891
2 61.205913 60.812307 60.822623 59.497453
3 60.581532 62.163044 59.277956 60.992206
4 61.408262 59.894078 61.154003 61.730079
Step 3
Let's visualize some data that we've just generated:
ax = df.plot(x = "Math",y="Art", kind="scatter", color = "r", alpha = .5, label = "Art, $corr_{Math}$ = .6")
df.plot(x = "Math",y="Science", kind="scatter", ax = ax, color = "b", alpha = .2, label = "Science, $corr_{Math}$ = .8")
ax.set_ylabel("Art and Science");
The statistical tool for that is the covariance matrix: https://en.wikipedia.org/wiki/Covariance.
Each cell (i,j) is representing the dependecy between the variable i and the variable j, so in your case it can be between math and science. If there is no dependency the value would be 0.
What you did was assuming that the covariance was a diagonal matrix with the same values on the diagonal. So what you have to do is defines your covariance matrix and afterwards draw the samples from a gaussian with numpy.random.multivariate_normal https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html or any other distribution functions.
Thank you guys for the responses; they were extremely useful. I adapted the code provided by Sergey to produce the result I was looking for, which was records with Math and Science marks that are relatively close most of the time, and History and Art marks that are more independent.
The following produced data that looks reasonable:
cov = np.array([[1, 0.5,.2, .1],[.5,1.,.1,.1],[0.2,.1,1,.3],[0.1,.1,.3,1]])
scores = mvn.rvs(mean = [0.,0.,0.,0.], cov=cov, size = 100)
df = pd.DataFrame(data = 15 * scores + 60, columns = ["Math","Science","History", "Art"])
df.head(10)
The next step would be to make it so that each subject has a different mean, but I have an idea of how to do that. Thanks again.
example dataframe

How can we map training data to 0 to 1 for Theano?

As you might know for training task in Theano we have to map our training data to 0 to 1. My training data has negative values also. Currently I am using this formula:
x'=(x - min(x))/(max(x) - min(x))
which is implemented by this code:
for i in range (train_x.shape[0]):
train_x[i,:] = ((train_x[i,:] - train_x[i,:].min(0)) /train_x[i,:].ptp(0))
Is this formula correct? Do you have a better idea regarding feature rescaling?
so this is my way to do normalization for variable t use your formula:
import numpy as np
t = np.array([[-1,-2,3,2],[2,1,3,-1]],dtype='float32')
b = (t-t.min(1).reshape(2,1))/t.ptp(1).reshape(2,1)
print b
it gives the correct output:
[[0.2, 0, 1, 0.8]
[0.75, 0.5, 1, 0]]

Categories