Python combinations with replacement with random placement of values (list) - python

I am trying to create a list of combinations from a list ([0,1,2,4,6]).
I want combinations with 12 values.
Eg:
"(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2)"
"(0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2)"
"(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4)"
"(0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2)"
"(0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4)"
"(0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2)"
"(0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 4)"
"(0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 6)"
"(0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2)"
This is working perfectly but now what I want to do is that the position of these values in each output should be random.
Something like:
"(0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2)" should be "(2, 0, 2, 0, 1, 2, 2, 0, 2, 1, 2, 0)"
This is the code, I have written.
combinations_list = [comb for i in range(1, 13) for comb in combinations_with_replacement(numbers, i) if sum(comb) == match_points]
where match_points can be any number. Say, for the above output, match_points was 14. and numbers = [0, 1, 2, 4, 6]
How shall I randomise the combination values? Also, I need to restrict the count of 0s in the combination to 6.
Eg:
"(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 6, 6)"
"(0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 6)"
"(0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 6, 6)"
shouldn't be generated.

Just shuffle your list.
import random
# .. code
random.shuffle(your_list) # It does the shuffle inplace.

Related

Difference in prediction results from kmeans tsne on load_iris python

I am running KMeans clustering with t-SNE dimensionality reduction technique on the iris dataset in Python. I am arriving at different predictions results when I load the iris dataset in 2 different ways.
Method 1:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
iris = load_iris()
X1 = iris.data
y1 = iris.target
km = KMeans(n_clusters = 3, random_state=146)
tsne = TSNE(perplexity = 30, random_state=146)
km.fit(X1)
X1_tsne = tsne.fit_transform(X1)
y1_pred = km.fit_predict(X1_tsne)
print(y1.tolist())
print(y1_pred.tolist())
print(X1[77])
print(y1[77])
print(y1_pred[77])
output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1]
[6.7 3. 5. 1.7]
1
2
Method 2:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
X2,y2 = load_iris(return_X_y=True, as_frame=True)
km = KMeans(n_clusters = 3, random_state=146)
tsne = TSNE(perplexity = 30, random_state=146)
# X2 & y2
km.fit(X2)
X2_tsne = tsne.fit_transform(X2)
y2_pred = km.fit_predict(X2_tsne)
print(y2.tolist())
print(y2_pred.tolist())
print(X2.iloc[77])
print(y2[77])
print(y2_pred[77])
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1]
sepal length (cm) 6.7
sepal width (cm) 3.0
petal length (cm) 5.0
petal width (cm) 1.7
Name: 77, dtype: float64
1
1
Why is index 77 predicted as 2 on Method 1 but predicted as 1 in Method 2?

DBSCAN eps and min_samples

I have been trying to use DBSCAN in order to detect outliers, from my understanding DBSCAN outputs -1 as outlier and 1 as inliner, but after I ran the code, I'm getting numbers that are not -1 or 1, can someone please explain why? Also is it normal to find the best value of eps using trial and error, because I couldn't figure out a way to find the best possible eps value.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import DBSCAN
df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)
# Dropping columns with low feature importance
del df['AmbTemp_DegC']
del df['NacelleOrientation_Deg']
del df['MeasuredYawError']
#applying DBSCAN
DBSCAN = DBSCAN(eps = 1.8, min_samples =10,n_jobs=-1)
df['anomaly'] = DBSCAN.fit_predict(df)
np.unique(df['anomaly'],return_counts=True)
(array([ -1, 0, 1, ..., 8462, 8463, 8464]),
array([1737565, 3539278, 4455734, ..., 13, 8, 8]))
Thank you.
Well, you did not actually get the real idea of DBSCAN.
This is a copy from wikipedia:
A point p is a core point if at least minPts points are within
distance ε of it (including p).
A point q is directly reachable from p if point q is within distance ε
from core point p. Points are only said to be directly reachable from
core points.
A point q is reachable from p if there is a path p1, ..., pn with p1 =
p and pn = q, where each pi+1 is directly reachable from pi. Note that
this implies that all points on the path must be core points, with the
possible exception of q.
All points not reachable from any other point are outliers or noise
points.
So saying in easier words, The idea is that:
Any sample who has min_samples neigbours by the distance of epsilon is a core sample.
Any data sample which is not core, but has at least one core neighbor (with a distance less than eps), is a directly reachable sample and can be added to the cluster.
Any data sample which is not directly reachable nor a core, but has at least one directly reachable neighbor (with a distance less than eps) is a reachable sample and will be added to the cluster.
Any other examples are considered to be noise, outlier or whatever you want to name it.( and those will be labeled by -1)
Depending on the parameters of the clustering (eps and min_samples) , you are very likely to have more than two clusters. You see, that is the reason you are seeing other values than 0 and -1 in the result of your clustering.
To answer your second question
Also is it normal to find the best value of eps using trial and error,
If you mean doing cross-validation( over a set where you know the cluster labels or you can approximate the correct clustering), yes I think that is the normal way to do it
PS: The paper is very good and comprehensive. I highly suggest you have a look. Good luck.
sklearn.cluster.DBSCAN gives -1 for noise, which is an outlier, all the other values other than -1 is the cluster number or cluster group. To see the total number of clusters you can use the command DBSCAN.labels_
What is eps or Epsilon value used in DBScan?
Epsilon is the local radius for expanding clusters. Think of it as a step size - DBSCAN never takes a step larger than this, but by doing multiple steps DBSCAN clusters can become much larger than eps.
How to find the best eps value?
use any hyperparameter tuning method / package like GridSearchCV or Hyperopt. You can use any of the following indices mentioned here.
I have found this to be a really good example of getting to understand how DBSCAN works.
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0)
X = StandardScaler().fit_transform(X)
# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
a = np.array(labels)
a
Result:
array([ 0, 1, 0, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 1, 0, -1,
1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 0, 0, 2, 0, 1, 1, 0,
1, 0, 2, 0, 0, 2, 2, 1, 1, 1, 1, 1, 0, 2, 0, 1, 2,
2, 1, 1, 2, 2, 1, 0, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2,
0, 0, 0, 2, 0, 0, 2, 1, -1, 1, 0, 2, 1, 1, 0, 0, 0,
0, 1, 2, 1, 2, 2, 0, 1, 0, 1, -1, 1, 1, 0, 0, 2, 1,
2, 0, 2, 2, 2, 2, -1, 0, -1, 1, 1, 1, 1, 0, 0, 1, 0,
1, 2, 1, 0, 0, 1, 2, 1, 0, 0, 2, 0, 2, 2, 2, 0, -1,
2, 2, 0, 1, 0, 2, 0, 0, 2, 2, -1, 2, 1, -1, 2, 1, 1,
2, 2, 2, 0, 1, 0, 1, 0, 1, 0, 2, 2, -1, 1, 2, 2, 1,
0, 1, 2, 2, 2, 1, 1, 2, 2, 0, 1, 2, 0, 0, 2, 0, 0,
1, 0, 1, 0, 1, 1, 2, 2, 0, 0, 1, 1, 2, 1, 2, 2, 2,
2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0, 0, 1, 1, 1, 2, 2,
2, 2, 1, 2, 2, 0, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 2,
1, 1, 0, 1, 2, 2, 1, 2, 2, 1, 0, 0, 1, 1, 1, 0, 1,
0, 2, 0, 2, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 0, 2,
1, -1, 2, 1, 1, 2, 1, 2, 0, 2, 2, 0, 1, 2, 2, 0, 2,
2, 0, 0, 2, 0, 2, 0, 2, 1, 0, 0, 0, 1, 2, 1, 2, 2,
0, 2, 2, 0, 0, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
0, 1, 1, 1, 0, 2, 0, 1, 2, 2, 0, 0, 2, 0, 2, 1, 0,
2, 0, 2, 0, 2, 2, 0, 1, 0, 1, 0, 2, 2, 1, 1, 1, 2,
0, 2, 0, 2, 1, 2, 2, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0,
2, 0, 1, 0, 1, 2, 1, 1, 1, 0, 1, 1, 0, 2, 1, 0, 2,
2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 0, 2, 1, 2, 1, 0, 1,
0, 1, 1, 0, 1, 2, -1, 1, 0, 0, 2, 1, 2, 2, 2, 2, 1,
0, 0, 0, 0, 1, 0, 2, 1, 0, 1, 2, 0, 0, 1, 0, 1, 1,
0, -1, 0, 2, 2, 2, 1, 1, 2, 0, 1, 0, 0, 1, 0, 1, 1,
2, 2, -1, 0, 1, 2, 2, 1, 1, 1, 1, 0, 0, 0, 2, 2, 1,
2, 1, 0, 0, 1, 2, 1, 0, 0, 2, 0, 1, 0, 2, 1, 0, 2,
2, 1, 0, 0, 0, 2, 1, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0,
1, 0, 1, 0, 0, 2, 0, 1, 1, 2, 1, 1, 0, 1, 0, 2, 1,
0, 0, 1, 0, 1, 1, 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1,
2, 0, 0, 0, 1, 2, 2, 0, 2, 0, 2, 1, 0, 1, 1, 0, 0,
1, 2, 1, 2, 2, 0, 2, 1, 1, 1, 2, 0, 0, 2, 0, 2, 2,
0, 2, 0, 1, 1, 1, 1, 0, 0, 0, 2, 1, 1, 1, 1, 2, 2,
2, 0, 2, 1, 1, 0, 0, 1, 0, 2, 1, 2, 1, 0, 2, 2, 0,
0, 1, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0, 1, 1, 0, 0,
1, 2, 2, 0, 0, 0, 0, 2, -1, 1, 1, 2, 1, 0, 0, 2, 2,
0, 1, 2, 0, 1, 2, 2, 1, 0, 0, -1, -1, 2, 0, 0, 0, 2,
-1, 2, 0, 1, 1, 1, 1, 1, 0, 0, 2, 1, 2, 0, 1, 1, 1,
0, 2, 1, 1, -1, 2, 1, 2, 0, 2, 2, 1, 0, 0, 0, 1, 1,
2, 0, 0, 2, 2, 1, 2, 2, 2, 0, 2, 1, 2, 1, 1, 1, 2,
0, 2, 0, 2, 2, 0, 0, 2, 1, 2, 0, 2, 0, 0, 0, 1, 0,
2, 1, 2, 0, 1, 0, 0, 2, 0, 2, 1, 1, 2, 1, 0, 1, 2,
1, 2], dtype=int64)
Those -1 data points are outliers. Let's count the number of outliers and see if it matches what we see in the image above.
list(a)
b = a.tolist()
count = b.count(-1)
count
Result:
18
We got 18! Perfect!!
Relevant Link:
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

Regrouping a list positionally into quantiles

I have a dict in which each key corresponds to a gene name, and each value corresponds to a list. The length of the list is different for each gene, because each element represents a different nucleotide. The number at each position indicates the "score" of the nucleotide.
Because each gene is a different length, I want to be able to directly compare their positional score distributions by splitting each gene up into quantiles (most likely, percentiles: 100 bins).
Here is some simulation data:
myData = {
'Gene1': [3, 1, 1, 2, 3, 1, 1, 1, 3, 0, 0, 0, 3, 3, 3, 0, 1, 2, 1, 3, 2, 2, 0, 2, 0, 1, 0, 3, 0, 3, 1, 1, 0, 3, 0, 0, 1, 0, 1, 0, 1, 3, 3, 2, 3, 1, 0, 1, 2, 2, 0, 3, 0, 2, 0, 1, 1, 2, 3, 3, 1, 2, 1, 3, 1, 0, 0, 3, 2, 0, 3, 0, 2, 1, 1, 1, 2, 1, 1, 3, 0, 1, 1, 1, 3, 3, 0, 2, 2, 1, 3, 2, 3, 0, 2, 3, 2, 1, 3, 1, 3, 2, 1, 3, 0, 3, 3, 0, 0, 1, 0, 3, 1, 1, 3, 0, 0, 2, 3, 1, 0, 2, 1, 2, 1, 2, 1, 2, 0, 1, 1, 1, 3, 1, 3, 1, 3, 2, 3, 3, 3, 1, 1, 2, 1, 0, 2, 2, 2, 0, 1, 0, 3, 1, 3, 2, 1, 3, 0, 1, 3, 1, 0, 1, 2, 1, 2, 2, 3, 2, 3, 2, 2, 2, 1, 2, 2, 0, 3, 1, 2, 1, 1, 3, 2, 2, 1, 3, 1, 0, 1, 3, 2, 2, 3, 0, 0, 1, 0, 0, 3],
'Gene2': [3, 0, 0, 0, 3, 3, 1, 3, 3, 1, 0, 0, 1, 0, 1, 1, 3, 2, 2, 2, 0, 1, 3, 2, 1, 3, 1, 1, 2, 3, 0, 2, 0, 2, 1, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 0, 1, 1, 1, 1, 3, 2, 0, 3, 0, 1, 1, 2, 3, 0, 2, 1, 3, 3, 0, 3, 2, 1, 1, 2, 0, 0, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 0, 0, 1, 0, 3, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 3, 0, 3, 2, 2, 0, 2, 3, 0, 2, 2, 3, 0, 3, 1, 2, 3, 0, 1, 0, 2, 3, 1, 3, 1, 2, 3, 1, 1, 0, 1, 3, 0, 2, 3, 3, 3, 3, 0, 1, 2, 2, 2, 3, 0, 3, 1, 0, 2, 3, 1, 0, 1, 1, 0, 3, 3, 1, 2, 1, 2, 3, 2, 3, 1, 2, 0, 2, 3, 1, 2, 3, 2, 1, 2, 2, 0, 0, 0, 0, 2, 0, 2, 3, 0, 2, 0, 0, 2, 0, 3, 3, 0, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 1, 3, 2, 0, 2, 3, 0, 0, 0, 1, 1, 0, 1, 2, 0, 1, 2, 1, 3, 3, 0, 2, 2, 1, 0, 1, 1, 1, 0, 0, 2, 1, 2, 0, 1, 2, 1, 1, 3, 0, 1, 0, 1, 2, 1, 3, 0, 2, 3, 1, 2, 0, 0, 3, 2, 0, 3, 2, 1, 2, 3, 1, 0, 1, 0, 0, 1, 2, 3, 3, 2, 2, 1, 2, 2, 3, 3, 3, 3, 0, 0, 2, 2, 2, 2, 3, 2, 3, 2, 0, 3, 1, 0, 2, 3, 0, 1, 2, 2, 0, 2],
'Gene3': [2, 3, 1, 0, 3, 2, 1, 0, 1, 2, 1, 2, 1, 3, 0, 2, 2, 3, 2, 0, 0, 0, 1, 1, 1, 1, 0, 0, 2, 3, 2, 2, 1, 3, 1, 2, 3, 0, 0, 3, 1, 0, 3, 2, 2, 3, 0, 0, 3, 3, 1, 1, 1, 0, 0, 2, 3, 2, 0, 2, 0, 1, 0, 2, 3, 0, 2, 0, 3, 3, 0, 0, 1, 0, 3, 2, 1, 1, 3, 3, 0, 2, 3, 1, 1, 0, 1, 3, 2, 1, 0, 3, 2, 0, 3, 2, 1, 1, 0, 3, 0, 0, 2, 0, 3, 3, 0, 2, 0, 3, 3, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 3, 3, 3, 3, 1, 3, 0, 0, 3, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 0, 0, 3, 0, 2, 2, 0, 0, 3, 0, 1, 3, 1, 1, 0, 2, 2, 3, 3, 0, 2, 0, 0, 2, 3, 1, 2, 1, 1, 2, 2, 0, 0, 3, 2, 2, 2, 1, 2, 0, 3, 2, 2, 2, 2, 1, 0, 3, 2, 2, 1, 0, 0, 2, 2, 0, 3, 2, 0, 2, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 3, 2, 0, 2, 3, 3, 0, 2, 2, 0, 1, 1, 3, 0, 0, 1, 2, 3, 1, 3, 2, 3, 3, 2, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 2, 1, 3, 1, 3, 1, 1, 0, 3, 0, 1, 1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 0, 2, 0, 0, 2, 2, 2, 3, 3, 0, 0, 3, 2, 1, 2, 1, 0, 3, 2, 3, 1, 1, 0, 1, 3, 2, 0, 3, 1, 3, 1, 2, 0, 0, 2, 3, 2, 2, 0, 3, 0, 2, 2, 2, 3, 3, 2, 1, 3, 3, 0, 2, 2, 2, 1, 1, 2, 1, 3, 2, 3, 2, 1, 3, 1, 0, 0, 2, 0, 1, 1, 3, 3, 0, 1, 2, 3, 1, 2, 3, 1, 1, 1, 2, 0, 2, 0, 1, 0, 3, 1, 0, 3, 3, 1, 3, 1, 1, 2, 2, 0, 2, 0, 1, 0, 3, 1, 1, 1, 3, 3, 0, 0, 1, 1, 2, 3, 0, 2, 0, 1, 1, 3, 3, 1, 1, 0, 0, 2, 0, 1, 2, 2, 2, 3, 1, 1, 1, 0, 3, 0, 0, 0, 1, 0, 1, 3, 1, 2, 2, 1, 2, 2]
}
As you can see, Gene1 has a length of 201, and Gene2 has a length of 301. However, Gene3 has a length of 428. I want to summarize each of these lists so that, for an arbitrary number of bins (nBins), I can partition the list into a list of lists.
For example, for the first two genes, if I chose nBins=100, then Gene1 would look like [[3,1],[1,2],[3,1],[1,1]...] while Gene2 would look like [[3,0,0],[0,3,3],[1,3,3]...]. That is, I want to partition based on the positions and not the values themselves. My dataset is large, so I'm looking for a library that can do this most efficiently.
Are you sure the length of Gene1 isn't 201?
You don't say what you want to happen in the case where the length isn't divisible by the number of bins. My code mixes sublists of length floor(length/nBins) and ceiling(length/nBins) to get the right number of bins.
new_data = {key : [value[
int(bin_number*len(value)/nBins):
int((bin_number+1)*len(value)/nBins)
]
for bin_number in range(nBins)] for key, value in myData.items()}
You don't need a library. Pure python should be fast enough in 90% of the cases:
nBins = 100
def group(l, size):
return [l[i:i + size] for i in range(0, len(l) + len(l) % size, size)]
bin_data = {k: group(l, len(l) // nBins ) for k, l in myData.items()}
print(bin_data)

chi squared in scipy different from results in SPSS

I'm trying to automate chi squared calculations. I'm using scipy.stats.pearsonr. However, that's giving me different answers than SPSS is. Like, factor of 10 difference. (.07 --> .8)
I'm pretty sure that the data is the same in both cases because I'm printing out the crosstab in both cases (using pandas.crosstab) and the numbers are identical.
d1 = [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1]
d2 = [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1]
print scipy.stats.stats.pearsonr(d1,d2)
gives:
(-0.065191159985573108, 0.61172152831874682)
(the 1st is the coefficient, the 2nd is the p value)
However SPSS says that the Pearson Chi-Square is .057.
Is there something I should check other than the crosstab?
Apparently you are computing the chi-squared statistic and p-value for the contingency table (i.e. "cross tab") of the data. The scipy function pearsonr is not the correct function to use for this. To do the calculation with scipy, you'll need to form the contingency table and then use scipy.stats.chi2_contingency.
There are several ways you could convert d1 and d2 into a contingency table. Here I'll use the Pandas function pandas.crosstab. Then I'll use chi2_contingency for the chi-squared test.
First, here is your data. I have them in numpy arrays, but this is not necessary:
In [49]: d1
Out[49]:
array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1])
In [50]: d2
Out[50]:
array([1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 0, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1])
Use pandas to form the contingency table:
In [51]: import pandas as pd
In [52]: table = pd.crosstab(d1, d2)
In [53]: table
Out[53]:
col_0 0 1 2
row_0
0 5 7 4
1 10 34 3
Then use chi2_contingency for the chi-squared test:
In [54]: from scipy.stats import chi2_contingency
In [55]: chi2, p, dof, expected = chi2_contingency(table.values)
In [56]: p
Out[56]: 0.057230732412525138
The p value matches the value computed by SPSS.
Update: In SciPy 1.7.0 (targeted for mid-2021), you'll be able to create the contingency table with scipy.stats.contingency.crosstab:
In [33]: from scipy.stats.contingency import crosstab # Will be in SciPy 1.7.0
In [34]: d1
Out[34]:
array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1])
In [35]: d2
Out[35]:
array([1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1,
1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
1, 0, 1, 1, 0, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1])
In [36]: (vals1, vals2), table = crosstab(d1, d2)
In [37]: vals1
Out[37]: array([0, 1])
In [38]: vals2
Out[38]: array([0, 1, 2])
In [39]: table
Out[39]:
array([[ 5, 7, 4],
[10, 34, 3]])

k-means in python: Determine which data are associated with each centroid

I've been using scipy.cluster.vq.kmeans for doing some k-means clustering, but was wondering if there's a way to determine which centroid each of your data points is (putativly) associated with.
Clearly you could do this manually, but as far as I can tell the kmeans function doesn't return this?
There is a function kmeans2 in scipy.cluster.vq that returns the labels, too.
In [8]: X = scipy.randn(100, 2)
In [9]: centroids, labels = kmeans2(X, 3)
In [10]: labels
Out[10]:
array([2, 1, 2, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1, 2, 0,
1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 0, 0,
2, 2, 0, 1, 0, 0, 0, 2, 2, 2, 0, 0, 1, 2, 1, 0, 0, 0, 2, 1, 1, 1, 1,
1, 0, 0, 1, 0, 1, 2, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2, 0, 2, 2, 0,
1, 1, 0, 1, 0, 0, 0, 2])
Otherwise, if you must use kmeans, you can also use vq to get labels:
In [17]: from scipy.cluster.vq import kmeans, vq
In [18]: codebook, distortion = kmeans(X, 3)
In [21]: code, dist = vq(X, codebook)
In [22]: code
Out[22]:
array([1, 0, 1, 0, 2, 2, 2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
2, 2, 1, 2, 0, 1, 1, 0, 2, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1,
0, 1, 2, 0, 1, 2, 2, 1, 1, 1, 2, 2, 0, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2,
0, 1, 1, 2, 1, 0, 0, 0, 0, 1, 2, 1, 2, 0, 2, 0, 2, 2, 1, 1, 1, 1, 1,
2, 0, 2, 0, 2, 1, 1, 1])
Documentation: scipy.cluster.vq

Categories