Regrouping a list positionally into quantiles - python

I have a dict in which each key corresponds to a gene name, and each value corresponds to a list. The length of the list is different for each gene, because each element represents a different nucleotide. The number at each position indicates the "score" of the nucleotide.
Because each gene is a different length, I want to be able to directly compare their positional score distributions by splitting each gene up into quantiles (most likely, percentiles: 100 bins).
Here is some simulation data:
myData = {
'Gene1': [3, 1, 1, 2, 3, 1, 1, 1, 3, 0, 0, 0, 3, 3, 3, 0, 1, 2, 1, 3, 2, 2, 0, 2, 0, 1, 0, 3, 0, 3, 1, 1, 0, 3, 0, 0, 1, 0, 1, 0, 1, 3, 3, 2, 3, 1, 0, 1, 2, 2, 0, 3, 0, 2, 0, 1, 1, 2, 3, 3, 1, 2, 1, 3, 1, 0, 0, 3, 2, 0, 3, 0, 2, 1, 1, 1, 2, 1, 1, 3, 0, 1, 1, 1, 3, 3, 0, 2, 2, 1, 3, 2, 3, 0, 2, 3, 2, 1, 3, 1, 3, 2, 1, 3, 0, 3, 3, 0, 0, 1, 0, 3, 1, 1, 3, 0, 0, 2, 3, 1, 0, 2, 1, 2, 1, 2, 1, 2, 0, 1, 1, 1, 3, 1, 3, 1, 3, 2, 3, 3, 3, 1, 1, 2, 1, 0, 2, 2, 2, 0, 1, 0, 3, 1, 3, 2, 1, 3, 0, 1, 3, 1, 0, 1, 2, 1, 2, 2, 3, 2, 3, 2, 2, 2, 1, 2, 2, 0, 3, 1, 2, 1, 1, 3, 2, 2, 1, 3, 1, 0, 1, 3, 2, 2, 3, 0, 0, 1, 0, 0, 3],
'Gene2': [3, 0, 0, 0, 3, 3, 1, 3, 3, 1, 0, 0, 1, 0, 1, 1, 3, 2, 2, 2, 0, 1, 3, 2, 1, 3, 1, 1, 2, 3, 0, 2, 0, 2, 1, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 0, 1, 1, 1, 1, 3, 2, 0, 3, 0, 1, 1, 2, 3, 0, 2, 1, 3, 3, 0, 3, 2, 1, 1, 2, 0, 0, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 0, 0, 1, 0, 3, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 3, 0, 3, 2, 2, 0, 2, 3, 0, 2, 2, 3, 0, 3, 1, 2, 3, 0, 1, 0, 2, 3, 1, 3, 1, 2, 3, 1, 1, 0, 1, 3, 0, 2, 3, 3, 3, 3, 0, 1, 2, 2, 2, 3, 0, 3, 1, 0, 2, 3, 1, 0, 1, 1, 0, 3, 3, 1, 2, 1, 2, 3, 2, 3, 1, 2, 0, 2, 3, 1, 2, 3, 2, 1, 2, 2, 0, 0, 0, 0, 2, 0, 2, 3, 0, 2, 0, 0, 2, 0, 3, 3, 0, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 1, 3, 2, 0, 2, 3, 0, 0, 0, 1, 1, 0, 1, 2, 0, 1, 2, 1, 3, 3, 0, 2, 2, 1, 0, 1, 1, 1, 0, 0, 2, 1, 2, 0, 1, 2, 1, 1, 3, 0, 1, 0, 1, 2, 1, 3, 0, 2, 3, 1, 2, 0, 0, 3, 2, 0, 3, 2, 1, 2, 3, 1, 0, 1, 0, 0, 1, 2, 3, 3, 2, 2, 1, 2, 2, 3, 3, 3, 3, 0, 0, 2, 2, 2, 2, 3, 2, 3, 2, 0, 3, 1, 0, 2, 3, 0, 1, 2, 2, 0, 2],
'Gene3': [2, 3, 1, 0, 3, 2, 1, 0, 1, 2, 1, 2, 1, 3, 0, 2, 2, 3, 2, 0, 0, 0, 1, 1, 1, 1, 0, 0, 2, 3, 2, 2, 1, 3, 1, 2, 3, 0, 0, 3, 1, 0, 3, 2, 2, 3, 0, 0, 3, 3, 1, 1, 1, 0, 0, 2, 3, 2, 0, 2, 0, 1, 0, 2, 3, 0, 2, 0, 3, 3, 0, 0, 1, 0, 3, 2, 1, 1, 3, 3, 0, 2, 3, 1, 1, 0, 1, 3, 2, 1, 0, 3, 2, 0, 3, 2, 1, 1, 0, 3, 0, 0, 2, 0, 3, 3, 0, 2, 0, 3, 3, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 3, 3, 3, 3, 1, 3, 0, 0, 3, 1, 0, 2, 2, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 0, 0, 3, 0, 2, 2, 0, 0, 3, 0, 1, 3, 1, 1, 0, 2, 2, 3, 3, 0, 2, 0, 0, 2, 3, 1, 2, 1, 1, 2, 2, 0, 0, 3, 2, 2, 2, 1, 2, 0, 3, 2, 2, 2, 2, 1, 0, 3, 2, 2, 1, 0, 0, 2, 2, 0, 3, 2, 0, 2, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 3, 2, 0, 2, 3, 3, 0, 2, 2, 0, 1, 1, 3, 0, 0, 1, 2, 3, 1, 3, 2, 3, 3, 2, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 2, 1, 3, 1, 3, 1, 1, 0, 3, 0, 1, 1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 0, 2, 0, 0, 2, 2, 2, 3, 3, 0, 0, 3, 2, 1, 2, 1, 0, 3, 2, 3, 1, 1, 0, 1, 3, 2, 0, 3, 1, 3, 1, 2, 0, 0, 2, 3, 2, 2, 0, 3, 0, 2, 2, 2, 3, 3, 2, 1, 3, 3, 0, 2, 2, 2, 1, 1, 2, 1, 3, 2, 3, 2, 1, 3, 1, 0, 0, 2, 0, 1, 1, 3, 3, 0, 1, 2, 3, 1, 2, 3, 1, 1, 1, 2, 0, 2, 0, 1, 0, 3, 1, 0, 3, 3, 1, 3, 1, 1, 2, 2, 0, 2, 0, 1, 0, 3, 1, 1, 1, 3, 3, 0, 0, 1, 1, 2, 3, 0, 2, 0, 1, 1, 3, 3, 1, 1, 0, 0, 2, 0, 1, 2, 2, 2, 3, 1, 1, 1, 0, 3, 0, 0, 0, 1, 0, 1, 3, 1, 2, 2, 1, 2, 2]
}
As you can see, Gene1 has a length of 201, and Gene2 has a length of 301. However, Gene3 has a length of 428. I want to summarize each of these lists so that, for an arbitrary number of bins (nBins), I can partition the list into a list of lists.
For example, for the first two genes, if I chose nBins=100, then Gene1 would look like [[3,1],[1,2],[3,1],[1,1]...] while Gene2 would look like [[3,0,0],[0,3,3],[1,3,3]...]. That is, I want to partition based on the positions and not the values themselves. My dataset is large, so I'm looking for a library that can do this most efficiently.

Are you sure the length of Gene1 isn't 201?
You don't say what you want to happen in the case where the length isn't divisible by the number of bins. My code mixes sublists of length floor(length/nBins) and ceiling(length/nBins) to get the right number of bins.
new_data = {key : [value[
int(bin_number*len(value)/nBins):
int((bin_number+1)*len(value)/nBins)
]
for bin_number in range(nBins)] for key, value in myData.items()}

You don't need a library. Pure python should be fast enough in 90% of the cases:
nBins = 100
def group(l, size):
return [l[i:i + size] for i in range(0, len(l) + len(l) % size, size)]
bin_data = {k: group(l, len(l) // nBins ) for k, l in myData.items()}
print(bin_data)

Related

While rendering open gym enviorment, not able to view my print statement simultaneously in the console

For the below code print does not work in simultaneously when rendring. All the print displays after completion of rendering.
import gym
env = gym.make("MountainCar-v0", render_mode="human")
env.reset(seed=9)
done = False
step = 0
while not done:
print(step,end="|")
action = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 0, 1, 2, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2][step]
new_state, reward, done, truncated, info = env.step(action)
step+=1
env.render()
env.close()
Is it possible to print or view while rendering the enviorment?

Difference in prediction results from kmeans tsne on load_iris python

I am running KMeans clustering with t-SNE dimensionality reduction technique on the iris dataset in Python. I am arriving at different predictions results when I load the iris dataset in 2 different ways.
Method 1:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
iris = load_iris()
X1 = iris.data
y1 = iris.target
km = KMeans(n_clusters = 3, random_state=146)
tsne = TSNE(perplexity = 30, random_state=146)
km.fit(X1)
X1_tsne = tsne.fit_transform(X1)
y1_pred = km.fit_predict(X1_tsne)
print(y1.tolist())
print(y1_pred.tolist())
print(X1[77])
print(y1[77])
print(y1_pred[77])
output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1]
[6.7 3. 5. 1.7]
1
2
Method 2:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
X2,y2 = load_iris(return_X_y=True, as_frame=True)
km = KMeans(n_clusters = 3, random_state=146)
tsne = TSNE(perplexity = 30, random_state=146)
# X2 & y2
km.fit(X2)
X2_tsne = tsne.fit_transform(X2)
y2_pred = km.fit_predict(X2_tsne)
print(y2.tolist())
print(y2_pred.tolist())
print(X2.iloc[77])
print(y2[77])
print(y2_pred[77])
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1]
sepal length (cm) 6.7
sepal width (cm) 3.0
petal length (cm) 5.0
petal width (cm) 1.7
Name: 77, dtype: float64
1
1
Why is index 77 predicted as 2 on Method 1 but predicted as 1 in Method 2?

Counting occurences of an item in an ndarray based on multiple conditions?

How do you specify multiple conditions in the np.count_nonzero function.
This is for counting the numbers inside an array that have a value between two values. I know you can subtract the outcomes of two individual count_nonzero lines. But I would like to know if there is an easy way to pass multiple conditions to np.count_nonzero.
import numpy as np
array = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0],
[0, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 0],
[0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 3, 2, 1, 0],
[0, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 5, 4, 3, 2, 0],
[0, 2, 3, 4, 5, 6, 7, 8, 8, 7, 6, 5, 4, 3, 2, 0],
[0, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 5, 4, 3, 2, 0],
[0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 3, 2, 1, 0],
[0, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 0],
[0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# Count occurences of values between 5 and 8 in array.
result1 = np.count_nonzero(array <= 8)
result2 = np.count_nonzero(array <= 5)
result = result 1 - result2
I would like to know if there is a way that looks something like:
np.count_nonzero(array >= 6 and array <= 8)
Can this be what you are looking for:
np.count_nonzero(np.logical_and(array>=5, array<=8))
#24

Python combinations with replacement with random placement of values (list)

I am trying to create a list of combinations from a list ([0,1,2,4,6]).
I want combinations with 12 values.
Eg:
"(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2)"
"(0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2)"
"(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4)"
"(0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2)"
"(0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4)"
"(0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2)"
"(0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 4)"
"(0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 6)"
"(0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2)"
This is working perfectly but now what I want to do is that the position of these values in each output should be random.
Something like:
"(0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2)" should be "(2, 0, 2, 0, 1, 2, 2, 0, 2, 1, 2, 0)"
This is the code, I have written.
combinations_list = [comb for i in range(1, 13) for comb in combinations_with_replacement(numbers, i) if sum(comb) == match_points]
where match_points can be any number. Say, for the above output, match_points was 14. and numbers = [0, 1, 2, 4, 6]
How shall I randomise the combination values? Also, I need to restrict the count of 0s in the combination to 6.
Eg:
"(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 6, 6)"
"(0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 6)"
"(0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 6, 6)"
shouldn't be generated.
Just shuffle your list.
import random
# .. code
random.shuffle(your_list) # It does the shuffle inplace.

k-means in python: Determine which data are associated with each centroid

I've been using scipy.cluster.vq.kmeans for doing some k-means clustering, but was wondering if there's a way to determine which centroid each of your data points is (putativly) associated with.
Clearly you could do this manually, but as far as I can tell the kmeans function doesn't return this?
There is a function kmeans2 in scipy.cluster.vq that returns the labels, too.
In [8]: X = scipy.randn(100, 2)
In [9]: centroids, labels = kmeans2(X, 3)
In [10]: labels
Out[10]:
array([2, 1, 2, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1, 2, 0,
1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 0, 0,
2, 2, 0, 1, 0, 0, 0, 2, 2, 2, 0, 0, 1, 2, 1, 0, 0, 0, 2, 1, 1, 1, 1,
1, 0, 0, 1, 0, 1, 2, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2, 0, 2, 2, 0,
1, 1, 0, 1, 0, 0, 0, 2])
Otherwise, if you must use kmeans, you can also use vq to get labels:
In [17]: from scipy.cluster.vq import kmeans, vq
In [18]: codebook, distortion = kmeans(X, 3)
In [21]: code, dist = vq(X, codebook)
In [22]: code
Out[22]:
array([1, 0, 1, 0, 2, 2, 2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
2, 2, 1, 2, 0, 1, 1, 0, 2, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1,
0, 1, 2, 0, 1, 2, 2, 1, 1, 1, 2, 2, 0, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2,
0, 1, 1, 2, 1, 0, 0, 0, 0, 1, 2, 1, 2, 0, 2, 0, 2, 2, 1, 1, 1, 1, 1,
2, 0, 2, 0, 2, 1, 1, 1])
Documentation: scipy.cluster.vq

Categories