Predict z-score based on multiple conditions - python

I want to obtain the probability (or z-score) of whether clust1 or clust2 correspond to "type 1" or "type 2" using the mrna_norm.columns values.
Conditions:
Item is either "type 1" or "type 2"
If mrna_norm.columns are in clust1_tmb1 BUT NOT in clust1_tmb2, it has a higher probability to be "type 1"
If mrna_norm.columns are in clust2_tmb1 BUT NOT in clust1_tmb1, it has a higher probability to be "type 2"
If mrna_norm.columns are in clust1_tmb1 BUT NOT in clust2_tmb1, it has a higher probability to be "type 1"
If mrna_norm.columns are in clust1_tmb2 BUT NOT in clust1_tmb1, it has a higher probability to be "type 2"
If mrna_norm.columns are in methylation_cimp_cpg.columns, it has a higher probability to be "type 2"
Here, it happens that len(clust1_tmb1)==len(clust1_tmb2) and
len(clust2_tmb1)==len(clust2_tmb2). But ignore this coincidence.
My code:
prob_type1 = 0
prob_type2 = 0
total = len(mrna_norm.columns)
if not mrna_norm[clust1_tmb2].columns:
prob_type1 += 1
prob_type2 -= 1
if not mrna_norm[clust2_tmb1].columns:
prob_type1 += 1
prob_type2 -= 1
if not mrna_norm[clust2_tmb1].columns:
prob_type2 += 1
prob_type1 -= 1
if not mrna_norm[clust1_tmb2].columns:
prob_type2 += 1
prob_type1 -= 1
if mrna_norm.columns == methylation_cimp_cpg.columns:
prob_type2 += 1
prob_type1 -= 1
# Get the z-score
stats.zscore(prob_type1)
stats.zscore(prob_type2)
Expected output:
A header
Type 1 probability
type 2 probability
TCGA-2Z-A9J1-01A
0.1
0.9
TCGA-2Z-A9J3-01A
0.8
0.2
TCGA-2Z-A9J8-01A
0.3
0.7
TCGA-2Z-A9JD-01A
0.4
0.6

Related

Looking for a faster solution to 'maximum number of teams' problem

My solution exceeds the time limit and I can't come up with a faster solution, still very much a beginner. How can I improve it?
The problem:
A perfect ICPC team is made up of at least 1 mathematician and 1 programmer and it must have 3 members. You have the number of mathematicians, programmers and students that have no specialization. What is the maximum number of perfect teams you can make? C students are programmers, M students are mathematicians and X don't have a specialization.
Example input:
1 1 1
Example output:
1
Example input:
3 6 0
Example output:
3
Example input:
10 1 10
Example output:
1
My solution:
cmx = [int(x) for x in input().split()]
i = 0
while 0 not in cmx:
cmx[0] -= 1
cmx[1] -= 1
cmx[2] -= 1
i += 1
if cmx[0] != 0 and cmx[1] != 0 and cmx[2] == 0:
while sum(cmx) >= 3 and cmx[0] != 0 and cmx[1] != 0:
if cmx[0] >= cmx[1]:
cmx[0] -= 2
cmx[1] -= 1
i += 1
elif cmx[0] < cmx[1]:
cmx[0] -= 1
cmx[1] -= 2
i += 1
print(i)
Assume that M ≤ C. (Proof works identically if C ≤ M). How many teams can I make. It's clear that if M + C + X ≥ 3M, then I can easily make M teams. (Every team has a mathematician, a programmer, and either a second programmer or a "none".) And I can't make more than M teams. If M + C + X < 3M, then the most I can have is (M + C + X) / 3 teams, and again you make them the same way, since you have sufficient mathematicians and programmers.
The proof works identically if C ≤ M.
So min(M, C, (M + C + X) // 3). As stated above.
A simpler way of looking at it is that C, M, and (C + M + X)//3 are each, independently, an upper bound on the number of teams that you can form. You just have to show the smallest of these three upper bounds is, in fact, a reachable value.

IndexError: arrays used as indices must be of integer (or boolean) type python3

here is my code.
for i in output:
if output[i] >= 0.80 and output[i] < 1 :
output[i] = "very positive"
elif output[i] >= 0.60 and output[i] < 0.80 :
output[i] = "positive"
elif output[i] >= 0.40 and output[i] < 0.60 :
output[i] = "notr"
elif output[i] >= 0.20 and output[i] < 0.40 :
output[i] = "negative"
elif output[i] >= 0 and output[i] < 0.20 :
output[i] = "very negative"
and here is error.
IndexError Traceback (most recent call last)
<ipython-input-81-84cbeed85d45> in <module>
1 for i in output:
----> 2 if output[i] >= 0.80 and output[i] < 1 :
3 output[i] = "very positive"
4 elif output[i] >= 0.60 and output[i] < 0.80 :
5 output[i] = "positive"
IndexError: arrays used as indices must be of integer (or boolean) type
Output variable consists of values ​​between 0 and 1.
please help guys.
That's not how for loops work.
i is just a temporary which holds an actual value from output.
It is not an index. Assigning to it doesn't do anything useful. Your code probably was trying to do this:
output = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for i in output:
if i >= 0.80 and i < 1 :
print (i, "very positive")
elif i >= 0.60 and i < 0.80 :
print (i, "positive")
elif i >= 0.40 and i < 0.60 :
print (i, "notr")
elif i >= 0.20 and i < 0.40 :
print (i, "negative")
elif i >= 0 and i < 0.20 :
print (i, "very negative")
0.1 very negative
0.2 negative
0.3 negative
0.4 notr
0.5 notr
0.6 positive
0.7 positive
0.8 very positive
0.9 very positive
you are doing wrong with the value of i. with for i in output you are already selecting the values within the output list.
now you can replace all your output[i] with i like this:
for i in output:
if i >= 0.80 and i < 1 :
i = "very positive"
elif i >= 0.60 and i < 0.80 :
i = "positive"
elif i >= 0.40 and i < 0.60 :
i = "notr"
elif i >= 0.20 and i < 0.40 :
i = "negative"
elif i >= 0 and i < 0.20 :
i = "very negative"
or you can use this line:
for i, val in enumerate(output): .
when you use the for loop for i in output , i is going to take the values present in your array 'output'. for instance, if I have an array T containing names, this loop
for name in T:
print name
will print all the names present in the array T.
Back to your problem now. You want to loop over your array. To do so, you used this:
for i in output:
Therefore, the values contained in your array will be stored in i. So your code should be:
for i in output:
if i >= 0.80 and i < 1 :
i = "very positive"
elif i >= 0.60 and i < 0.80 :
i = "positive"
elif i >= 0.40 and i < 0.60 :
i = "notr"
elif i >= 0.20 and i < 0.40 :
i = "negative"
elif i >= 0 and i < 0.20 :
i = "very negative"
Hope this helps ^^
PS: in case you were trying to print those comments ("very negative, negative, notr,...), you should use print("[your comment]") instead of assigning that comment to i.

A test interview question I could not figure out

So I wrote a piece of code in pycharm
to solve this problem:
pick any 5 positive integers that add up to 100
and by addition,subtraction or just using one of the five values
you should be able to make every number up to 100
for example
1,22,2,3,4
for 1 I could give in 1
for 2 i could give in 2
so on
for 21 I could give 22 - 1
for 25 I could give (22 + 2) - 1
li = [1, 1, 1, 1, 1]
lists_of_li_that_pass_T1 = []
while True:
if sum(li) == 100:
list_of_li_that_pass_T1.append(li)
if li[-1] != 100:
li[-1] += 1
else:
li[-1] = 1
if li[-2] != 100:
li[-2] += 1
else:
li[-2] = 1
if li[-3] != 100:
li[-3] += 1
else:
li[-3] = 1
if li[-4] != 100:
li[-4] += 1
else:
li[-4] = 1
if li[-5] != 100:
li[-5] += 1
else:
break
else:
if li[-1] != 100:
li[-1] += 1
else:
li[-1] = 1
if li[-2] != 100:
li[-2] += 1
else:
li[-2] = 1
if li[-3] != 100:
li[-3] += 1
else:
li[-3] = 1
if li[-4] != 100:
li[-4] += 1
else:
li[-4] = 1
if li[-5] != 100:
li[-5] += 1
else:
break
this should give me all the number combinations that add up to 100 out of the total 1*10 ** 10
but its not working please help me fix it so it prints all of the sets of integers
I also can't think of what I would do next to get the perfect sets that solve the problem
After #JohnY comments, I assume that the question is:
Find a set of 5 integers meeting the following requirements:
their sum is 100
any number in the [1, 100] range can be constructed using at most once the elements of the set and only additions and substractions
A brute force way is certainly possible, but proving that any number can be constructed that way would be tedious. But a divide and conquer strategy is possible: to construct all numbers up to n with a set of m numbers u0..., um-1, it is enough to build all numbers up to (n+2)/3 with u0..., um-2 and use um-1 = 2*n/3. Any number in the ((n+2)/3, um-1) range can be written as um-1-x with x in the [1, (n+2)/3] range, and any number in the (um-1, n] range as um-1+y with y in the same low range.
So we can use here u4 = 66 and find a way to build numbers up to 34 with 4 numbers.
Let us iterate: u3 = 24 and build numbers up to 12 with 3 numbers.
One more step u2 = 8 and build numbers up to 4 with 2 numbers.
Ok: u0 = 1 and u1 = 3 give immediately:
1 = u0
2 = 3 - 1 = u1 - u0
3 = u1
4 = 3 + 1 = u1 + u0
Done.
Mathematical disgression:
In fact u0 = 1 and u1 = 3 can build all numbers up to 4, so we can use u2 = 9 to build all numbers up to 9+4 = 13. We can prove easily that the sequence ui = 3i verifies sum(ui for i in [0, m-1]) = 1 + 3 + ... + 3m-1 = (3m - 1)/(3 - 1) = (um - 1) / 2.
So we could use u0=1, u1=3, u2=9, u3=27 to build all numbers up to 40, and finally set u4 = 60.
In fact, u0 and u1 can only be 1 and 3 and u2 can be 8 or 9. Then if u2 == 8, u3 can be in the [22, 25] range, and if u2 == 9, u3 can be in the [21, 27] range. The high limit is given by the 3i sequence, and the low limit is given by the requirement to build numbers up to 12 with 3 numbers, and up to 34 with 4 ones.
No code was used, but I think that way much quicker and less error prone. It is now possible to use Python to show that all numbers up to 100 can be constructed from one of those sets using the divide and conquer strategy.

How to find count of values within certain range in pandas?

I have a pandas dataframe which contains a list of error values. I want to find the proportion of my errors in certain ranges e.g. what percentage of my error is within +-1%, +-5%, +-10%, +-20% and +-50% etc. A histogram of my data is shown below:
So far I have looked at functions such as pd.cut() and plt.hist() but no libraries seem to give me the answer where my ranges overlap each other so I'm having to resort to a very long custom made function - which is below:
def error_distribution(df):
total_length = len(df.index)
one_perc = five_perc = ten_perc = fifteen_perc = twenty_perc = thirty_perc \
= fourty_perc = fifty_perc = over_fifty = 0
for index, row in df.iterrows():
value = abs(row['Errors'])
if value <= 0.01:
one_perc += 1
five_perc += 1
ten_perc += 1
fifteen_perc += 1
twenty_perc += 1
thirty_perc += 1
fourty_perc += 1
fifty_perc += 1
elif value <= 0.05:
five_perc += 1
ten_perc += 1
fifteen_perc += 1
twenty_perc += 1
thirty_perc += 1
fourty_perc += 1
fifty_perc += 1
elif value <= 0.1:
ten_perc += 1
fifteen_perc += 1
twenty_perc += 1
thirty_perc += 1
fourty_perc += 1
fifty_perc += 1
elif value <= 0.15:
fifteen_perc += 1
twenty_perc += 1
thirty_perc += 1
fourty_perc += 1
fifty_perc += 1
elif value <= 0.2:
twenty_perc += 1
thirty_perc += 1
fourty_perc += 1
fifty_perc += 1
elif value <= 0.3:
thirty_perc += 1
fourty_perc += 1
fifty_perc += 1
elif value <= 0.4:
fourty_perc += 1
fifty_perc += 1
elif value <= 0.5:
fifty_perc += 1
else:
over_fifty += 1
print("Sub 1%: {0:.2f}%".format(one_perc/total_length*100))
print("Sub 5%: {0:.2f}%".format(five_perc/total_length*100))
print("Sub 10%: {0:.2f}%".format(ten_perc/total_length*100))
print("Sub 15%: {0:.2f}%".format(fifteen_perc/total_length*100))
print("Sub 20%: {0:.2f}%".format(twenty_perc/total_length*100))
print("Sub 30%: {0:.2f}%".format(thirty_perc/total_length*100))
print("Sub 40%: {0:.2f}%".format(fourty_perc/total_length*100))
print("Sub 50%: {0:.2f}%".format(fifty_perc/total_length*100))
print("Over 50%: {0:.2f}%".format(over_fifty/total_length*100))
And the output I'm looking for is this:
error_distribution(error_dataset1)
Output:
Sub 1%: 16.55%
Sub 5%: 56.61%
Sub 10%: 71.62%
Sub 15%: 78.53%
Sub 20%: 82.97%
Sub 30%: 88.46%
Sub 40%: 91.09%
Sub 50%: 92.59%
Over 50%: 7.41%
Does anyone know of a standard library that could do this?
Can you try the following:
import numpy as np
arr = np.random.uniform(low=0, high=100, size=(200,))
count, division = np.histogram(arr, bins=[0, .01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 1])
print(count, division)

2D binary string pyevolve

I am new to pyevolve and GA in Python.
I am trying to make a 2D binary array representing matches. Something like this:
A B C
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
My goal is to have only one "1" in each row and the "1" in the array should be equal to the number of rows. One number can be matched with only one letter but a letter can be matched with multiple numbers.
I wrote this code in Evaluation function
def eval_func(chromosome):
score = 0.0
num_of_rows = chromosome.getHeight()
num_of_cols = chromosome.getWidth()
# create 2 lists. One with the sums of each row and one
# with the sums of each column
row_sums = [sum(chromosome[i]) for i in xrange(num_of_rows)]
col_sums = [sum(x) for x in zip(*chromosome)]
# if the sum of "1"s in a row is > 1 then a number (1,2,3,4) is matched with
# more than one letter. We want to avoid that.
for row_sum in row_sums:
if row_sum <= 1:
score += 0.5
else:
score -= 1.0
# col_sum is actually the number of "1"s in the array
col_sum = sum(col_sums)
# if all the numbers are matched we increase the score
if col_sum == num_of_rows:
score += 0.5
if score < 0:
score = 0.0
return score
Seems to work but when I add some other checks, eg if 1 is in A, 2 can not be in C, it fails.
how can this become possible? (many checks)
Thanks in advance.

Categories