GC skew method using sliding window concept in python

GC skew method using sliding window concept in python - python

I have completed a beginner's course in python and I am working on a problem to improve my coding skills. In this problem, I have to calculate the GC-skew by dividing the entire sequence into subsequences of equal length. I am working in a jupyter notebook.
I have to create a code so that I'll get the number of C's and G's from the sequence and then calculate GC skew in each window. window size = 5kb with an increment of 1kb.
What I have done so far is that first stored the sequence in a list and took user input for length of box/window and increment of the box. Then I tried to create a loop for calculating the number of C's and G's in each window but here I am facing an issue as instead of getting number of C's and G's in a window/box, I am getting number of C's and G's from the entire sequence for number of times the loop is running. I want number total number of C's and total no of G's in each window.
Please suggest how can I get the mentioned number of characters and GC skew for each overlapping sliding window/box. Also is there any concept of sliding window in python which I can use it here?
char = []
with open('keratin.txt') as f:
for line in f:
line = line.strip()
for ch in line:
char.append(ch)
print(char)
len(char)
f1 = open('keratin.txt','r')
f2 = open('keratin.txt','a+')
lob = input('Enter length of box =')
iob = input('Enter the increment of the box =')
i=0
lob = 5000
iob = 1000
nob = 1 #no. of boxes
for i in range (0,len(char)-lob):
b = i
while( b < lob + i and b < len(char)):
nC = 0
nG = 0
if char[b] == 'C':
nC = nC + 1
elif char[b] == 'G':
nG = nG + 1
b = b + 1
print(nC)
print(nG)
i = i + iob
nob = nob + 1

I hope this would help in understanding,
number_of_C_and_G = []
# Go from 0 to end, skipping length of box and increment. 0, 6000, 12000 ...
for i in range(0, len(char), lob+inc):
nC = 0
nG = 0
# Go from start to length of box, 0 to 5000, 6000 to 11000 ...
for j in range(i, lob):
if char[j] == 'C':
nC += 1
else if char[j] == 'G':
nG += 1
# Put the value for the box in the list
number_of_C_and_G.append( (nC, nG) )

Related

How to properly add gradually increasing/decreasing space between objects?

I've trying to implement transition from an amount of space to another which is similar to acceleration and deceleration, except i failed and the only thing that i got from this was this infinite stack of mess, here is a screenshot showing this in action:
you can see a very black circle here, which are in reality something like 100 or 200 circles stacked on top of each other
and i reached this result using this piece of code:
def Place_circles(curve, circle_space, cs, draw=True, screen=None):
curve_acceleration = []
if type(curve) == tuple:
curve_acceleration = curve[1][0]
curve_intensity = curve[1][1]
curve = curve[0]
#print(curve_intensity)
#print(curve_acceleration)
Circle_list = []
idx = [0,0]
for c in reversed(range(0,len(curve))):
for p in reversed(range(0,len(curve[c]))):
user_dist = circle_space[curve_intensity[c]] + curve_acceleration[c] * p
dist = math.sqrt(math.pow(curve[c][p][0] - curve[idx[0]][idx[1]][0],2)+math.pow(curve [c][p][1] - curve[idx[0]][idx[1]][1],2))
if dist > user_dist:
idx = [c,p]
Circle_list.append(circles.circles(round(curve[c][p][0]), round(curve[c][p][1]), cs, draw, screen))
This place circles depending on the intensity (a number between 0 and 2, random) of the current curve, which equal to an amount of space (let's say between 20 and 30 here, 20 being index 0, 30 being index 2 and a number between these 2 being index 1).
This create the stack you see above and isn't what i want, i also came to the conclusion that i cannot use acceleration since the amount of time to move between 2 points depend on the amount of circles i need to click on, knowing that there are multiple circles between each points, but not being able to determine how many lead to me being unable to the the classic acceleration formula.
So I'm running out of options here and ideas on how to transition from an amount of space to another.
any idea?
PS: i scrapped the idea above and switched back to my master branch but the code for this is still available in the branch i created here https://github.com/Mrcubix/Osu-StreamGenerator/tree/acceleration .
So now I'm back with my normal code that don't possess acceleration or deceleration.
TL:DR i can't use acceleration since i don't know the amount of circles that are going to be placed between the 2 points and make the time of travel vary (i need for exemple to click circles at 180 bpm of one circle every 0.333s) so I'm looking for another way to generate gradually changing space.

First, i took my function that was generating the intensity for each curves in [0 ; 2]
Then i scrapped the acceleration formula as it's unusable.
Now i'm using a basic algorithm to determine the maximum amount of circles i can place on a curve.
Now the way my script work is the following:
i first generate a stream (multiple circles that need to be clicked at high bpm)
this way i obtain the length of each curves (or segments) of the polyline.
i generate an intensity for each curve using the following function:
def generate_intensity(Circle_list: list = None, circle_space: int = None, Args: list = None):
curve_intensity = []
if not Args or Args[0] == "NewProfile":
prompt = True
while prompt:
max_duration_intensity = input("Choose the maximum amount of curve the change in intensity will occur for: ")
if max_duration_intensity.isdigit():
max_duration_intensity = int(max_duration_intensity)
prompt = False
prompt = True
while prompt:
intensity_change_odds = input("Choose the odds of occurence for changes in intensity (1-100): ")
if intensity_change_odds.isdigit():
intensity_change_odds = int(intensity_change_odds)
if 0 < intensity_change_odds <= 100:
prompt = False
prompt = True
while prompt:
min_intensity = input("Choose the lowest amount of spacing a circle will have: ")
if min_intensity.isdigit():
min_intensity = float(min_intensity)
if min_intensity < circle_space:
prompt = False
prompt = True
while prompt:
max_intensity = input("Choose the highest amount of spacing a circle will have: ")
if max_intensity.isdigit():
max_intensity = float(max_intensity)
if max_intensity > circle_space:
prompt = False
prompt = True
if Args:
if Args[0] == "NewProfile":
return [max_duration_intensity, intensity_change_odds, min_intensity, max_intensity]
elif Args[0] == "GenMap":
max_duration_intensity = Args[1]
intensity_change_odds = Args[2]
min_intensity = Args[3]
max_intensity = Args[4]
circle_space = ([min_intensity, circle_space, max_intensity] if not Args else [Args[0][3],circle_space,Args[0][4]])
count = 0
for idx, i in enumerate(Circle_list):
if idx == len(Circle_list) - 1:
if random.randint(0,100) < intensity_change_odds:
if random.randint(0,100) > 50:
curve_intensity.append(2)
else:
curve_intensity.append(0)
else:
curve_intensity.append(1)
if random.randint(0,100) < intensity_change_odds:
if random.randint(0,100) > 50:
curve_intensity.append(2)
count += 1
else:
curve_intensity.append(0)
count += 1
else:
if curve_intensity:
if curve_intensity[-1] == 2 and not count+1 > max_duration_intensity:
curve_intensity.append(2)
count += 1
continue
elif curve_intensity[-1] == 0 and not count+1 > max_duration_intensity:
curve_intensity.append(0)
count += 1
continue
elif count+1 > 2:
curve_intensity.append(1)
count = 0
continue
else:
curve_intensity.append(1)
else:
curve_intensity.append(1)
curve_intensity.reverse()
if curve_intensity.count(curve_intensity[0]) == len(curve_intensity):
print("Intensity didn't change")
return circle_space[1]
print("\n")
return [circle_space, curve_intensity]
with this, i obtain 2 list, one with the spacing i specified, and the second one is the list of randomly generated intensity.
from there i call another function taking into argument the polyline, the previously specified spacings and the generated intensity:
def acceleration_algorithm(polyline, circle_space, curve_intensity):
new_circle_spacing = []
for idx in range(len(polyline)): #repeat 4 times
spacing = []
Length = 0
best_spacing = 0
for p_idx in range(len(polyline[idx])-1): #repeat 1000 times / p_idx in [0 ; 1000]
# Create multiple list containing spacing going from circle_space[curve_intensity[idx-1]] to circle_space[curve_intensity[idx]]
spacing.append(np.linspace(circle_space[curve_intensity[idx]],circle_space[curve_intensity[idx+1]], p_idx).tolist())
# Sum distance to find length of curve
Length += abs(math.sqrt((polyline[idx][p_idx+1][0] - polyline[idx][p_idx][0]) ** 2 + (polyline [idx][p_idx+1][1] - polyline[idx][p_idx][1]) ** 2))
for s in range(len(spacing)): # probably has 1000 list in 1 list
length_left = Length # Make sure to reset length for each iteration
for dist in spacing[s]: # substract the specified int in spacing[s]
length_left -= dist
if length_left > 0:
best_spacing = s
else: # Since length < 0, use previous working index (best_spacing), could also jsut do `s-1`
if spacing[best_spacing] == []:
new_circle_spacing.append([circle_space[1]])
continue
new_circle_spacing.append(spacing[best_spacing])
break
return new_circle_spacing
with this, i obtain a list with the space between each circles that are going to be placed,
from there, i can Call Place_circles() again, and obtain the new stream:
def Place_circles(polyline, circle_space, cs, DoDrawCircle=True, surface=None):
Circle_list = []
curve = []
next_circle_space = None
dist = 0
for c in reversed(range(0, len(polyline))):
curve = []
if type(circle_space) == list:
iter_circle_space = iter(circle_space[c])
next_circle_space = next(iter_circle_space, circle_space[c][-1])
for p in reversed(range(len(polyline[c])-1)):
dist += math.sqrt((polyline[c][p+1][0] - polyline[c][p][0]) ** 2 + (polyline [c][p+1][1] - polyline[c][p][1]) ** 2)
if dist > (circle_space if type(circle_space) == int else next_circle_space):
dist = 0
curve.append(circles.circles(round(polyline[c][p][0]), round(polyline[c][p][1]), cs, DoDrawCircle, surface))
if type(circle_space) == list:
next_circle_space = next(iter_circle_space, circle_space[c][-1])
Circle_list.append(curve)
return Circle_list
the result is a stream with varying space between circles (so accelerating or decelerating), the only issue left to be fixed is pygame not updating the screen with the new set of circle after i call Place_circles(), but that's an issue i'm either going to try to fix myself or ask in another post
the final code for this feature can be found on my repo : https://github.com/Mrcubix/Osu-StreamGenerator/tree/Acceleration_v02

Rolling a roulette with chances and counting the maximum same-side sequences of a number of rolls

I am trying to generate a random sequence of numbers, with each "result" having a chance of {a}48.6%/{b}48.6%/{c}2.8%.
Counting how many times in a sequence of 6 or more {a} occurred, same for {b}.
{c} counts as neutral, meaning that if an {a} sequence is happening, {c} will count as {a}, additionally if a {b} sequence is happening, then {c} will count as {b}.
The thing is that the results seem right, but every "i" iteration seems to give results that are "weighted" either on the {a} side or the {b} side. And I can't seem to figure out why.
I would expect for example to have a result of :
{a:6, b:7, a:8, a:7, b:9} but what I am getting is {a:7, a:9, a:6, a:8} OR {b:7, b:8, b:6} etc.
Any ideas?
import sys
import random
from random import seed
from random import randint
from datetime import datetime
import time
loopRange = 8
flips = 500
median = 0
for j in range(loopRange):
random.seed(datetime.now())
sequenceArray = []
maxN = 0
flag1 = -1
flag2 = -1
for i in range(flips):
number = randint(1, 1000)
if(number <= 486):
flag1 = 0
sequenceArray.append(number)
elif(number > 486 and number <= 972):
flag1 = 1
sequenceArray.append(number)
elif(number > 972):
sequenceArray.append(number)
if(flag1 != -1 and flag2 == -1):
flag2 = flag1
if(flag1 != flag2):
sequenceArray.pop()
if(len(sequenceArray) > maxN):
maxN = len(sequenceArray)
if(len(sequenceArray) >= 6):
print(len(sequenceArray))
# print(sequenceArray)
# print(sequenceArray)
sequenceArray = []
median += maxN
print("Maximum sequence is %d " % maxN)
print("\n")
time.sleep(random.uniform(0.1, 1))
median = float(median/loopRange)
print("\n")
print(median)

I would implement something with two cursors prev (previous) and curr (current) since you need to detect a change between the current and the previous state.
I just write the code of the inner loop on i since the external loop adds complexity without focusing on the source of the problem. You can then include this in your piece of code. It seems to work for me, but I am not sure to understand perfectly how you want to manage all the behaviours (especially at start).
prev = -1
curr = -1
seq = []
maxN = 0
for i in range(flips):
number = randint(1, 1000)
if number<=486:
curr = 0 # case 'a'
elif number<=972:
curr = 1 # case 'b'
else:
curr = 2 # case 'c', the joker
if (prev==-1) and (curr==2):
# If we start with the joker, don't do anything
# You can add code here to change this behavior
pass
else:
if (prev==-1):
# At start, it is like the previous element is the current one
prev=curr
if (prev==curr) or (curr==2):
# We continue the sequence
seq.append(number)
else:
# Break the sequence
maxN = max(maxN,len(seq)) # save maximum length
if len(seq)>=6:
print("%u: %r"%(prev,seq))
seq = [] # reset sequence
seq.append(number) # don't forget to append the new number! It breaks the sequence, but starts a new one
prev=curr # We switch case 'a' for 'b', or 'b' for 'a'

Q: Expected number of coin tosses to get N heads in a row, in Python. My code gives answers that don't match published correct ones, but unsure why

I'm trying to write Python code to see how many coin tosses, on average, are required to get a sequences of N heads in a row.
The thing that I'm puzzled by is that the answers produced by my code don't match ones that are given online, e.g. here (and many other places) https://math.stackexchange.com/questions/364038/expected-number-of-coin-tosses-to-get-five-consecutive-heads
According to that, the expected number of tosses that I should need to get various numbers of heads in a row are: E(1) = 2, E(2) = 6, E(3) = 14, E(4) = 30, E(5) = 62. But I don't get those answers! For example, I get E(3) = 8, instead of 14. The code below runs to give that answer, but you can change n to test for other target numbers of heads in a row.
What is going wrong? Presumably there is some error in the logic of my code, but I confess that I can't figure out what it is.
You can see, run and make modified copies of my code here: https://trinket.io/python/17154b2cbd
Below is the code itself, outside of that runnable trinket.io page. Any help figuring out what's wrong with it would be greatly appreciated!
Many thanks,
Raj
P.S. The closest related question that I could find was this one: Monte-Carlo Simulation of expected tosses for two consecutive heads in python
However, as far as I can see, the code in that question does not actually test for two consecutive heads, but instead tests for a sequence that starts with a head and then at some later, possibly non-consecutive, time gets another head.
# Click here to run and/or modify this code:
# https://trinket.io/python/17154b2cbd
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 3
possible_tosses = [ 'h', 't' ]
num_trials = 1000
target_seq = ['h' for i in range(0,n)]
toss_sequence = []
seq_lengths_rec = []
for trial_num in range(0,num_trials):
if (trial_num % 100) == 0:
print 'Trial num', trial_num, 'out of', num_trials
# (The free version of trinket.io uses Python2)
target_reached = 0
toss_num = 0
while target_reached == 0:
toss_num += 1
random.shuffle(possible_tosses)
this_toss = possible_tosses[0]
#print([toss_num, this_toss])
toss_sequence.append(this_toss)
last_n_tosses = toss_sequence[-n:]
#print(last_n_tosses)
if last_n_tosses == target_seq:
#print('Reached target at toss', toss_num)
target_reached = 1
seq_lengths_rec.append(toss_num)
print 'Average', sum(seq_lengths_rec) / len(seq_lengths_rec)

You don't re-initialize toss_sequence for each experiment, so you start every experiment with a pre-existing sequence of heads, having a 1 in 2 chance of hitting the target sequence on the first try of each new experiment.
Initializing toss_sequence inside the outer loop will solve your problem:
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 4
possible_tosses = [ 'h', 't' ]
num_trials = 1000
target_seq = ['h' for i in range(0,n)]
seq_lengths_rec = []
for trial_num in range(0,num_trials):
if (trial_num % 100) == 0:
print('Trial num {} out of {}'.format(trial_num, num_trials))
# (The free version of trinket.io uses Python2)
target_reached = 0
toss_num = 0
toss_sequence = []
while target_reached == 0:
toss_num += 1
random.shuffle(possible_tosses)
this_toss = possible_tosses[0]
#print([toss_num, this_toss])
toss_sequence.append(this_toss)
last_n_tosses = toss_sequence[-n:]
#print(last_n_tosses)
if last_n_tosses == target_seq:
#print('Reached target at toss', toss_num)
target_reached = 1
seq_lengths_rec.append(toss_num)
print(sum(seq_lengths_rec) / len(seq_lengths_rec))
You can simplify your code a bit, and make it less error-prone:
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 3
possible_tosses = [ 'h', 't' ]
num_trials = 1000
seq_lengths_rec = []
for trial_num in range(0, num_trials):
if (trial_num % 100) == 0:
print('Trial num {} out of {}'.format(trial_num, num_trials))
# (The free version of trinket.io uses Python2)
heads_counter = 0
toss_counter = 0
while heads_counter < n:
toss_counter += 1
this_toss = random.choice(possible_tosses)
if this_toss == 'h':
heads_counter += 1
else:
heads_counter = 0
seq_lengths_rec.append(toss_counter)
print(sum(seq_lengths_rec) / len(seq_lengths_rec))

We cam eliminate one additional loop by running each experiment long enough (ideally infinite) number of times, e.g., each time toss a coin n=1000 times. Now, it is likely that the sequence of 5 heads will appear in each such trial. If it does appear, we can call the trial as an effective trial, otherwise we can reject the trial.
In the end, we can take an average of number of tosses needed w.r.t. the number of effective trials (by LLN it will approximate the expected number of tosses). Consider the following code:
N = 100000 # total number of trials
n = 1000 # long enough sequence of tosses
k = 5 # k heads in a row
ntosses = []
pat = ''.join(['1']*k)
effective_trials = 0
for i in range(N): # num of trials
seq = ''.join(map(str,random.choices(range(2),k=n))) # toss a coin n times (long enough times)
if pat in seq:
ntosses.append(seq.index(pat) + k)
effective_trials += 1
print(effective_trials, sum(ntosses) / effective_trials)
# 100000 62.19919
Notice that the result may not be correct if n is small, since it tries to approximate infinite number of coin tosses (to find expected number of tosses to obtain 5 heads in a row, n=1000 is okay since actual expected value is 62).

Longest expected head streak in 200 coinflips

I was trying to calculate the expected value for the longest consecutive heads streak in 200 coin flips, using python. I came up with a code which I think does the job right but it's just not efficient because of the amount of calculations and data storage it requires, and I was wondering if someone could help me out with this, making it faster and more efficient (I took only one course of python programming in last semester without any previous knowledge of the subject).
My code was
import numpy as np
from itertools import permutations
counter = 0
sett = 0
rle = []
matrix = np.zeros(200)
for i in range (0,200):
matrix[i] = 1
for j in permutations(matrix):
for k in j:
if k == 1:
counter += 1
else:
if counter > sett:
sett == counter
counter == 0
rle.append(sett)
After finding rle, I'd iterate over it to get how many streaks of which length there are, and their sum divided by 2^200 would give me the expected value I'm looking for.
Thanks in advance for help, much appreciated!

You don't have to try all the permutations (in fact you cannot), but you can do a simple Monte Carlo style simulation. Repeat the 200 coin flips many times. Average the lengths of longest streaks you get and this will be a good approximation of the expected value.
def oneTrial (noOfCoinFlips):
s = numpy.random.binomial(1, 0.5, noOfCoinFlips)
maxCount = 0
count = 0
for x in s:
if x == 1:
count += 1
if x == 0:
count = 0
maxCount = max(maxCount, count)
return maxCount
numpy.mean([oneTrial(200) for x in range(10000)])
Output: 6.9843
Also see this thread for exact computation without using Python simulation.

This is an answer to a slightly different question. But, as I had invested an hour and half of my time into it, I didn't wanna scrape it off.
Let E(k) denote a k head streak, i.e., you get k consecutive heads from the first toss onwards.
E(0): T { another 199 tosses that we do not care about }
E(1): H T { another 198 tosses... }
.
.
E(198): { 198 heads } T H
E(199): { 199 heads } T
E(200): { 200 heads }
Note that P(0) = 0.5, which is P(tails in first toss)
whereas P(1) = 0.25 , i.e., P(heads in first toss and tails in the second)
P(0) = 2**-1
P(1) = 2**-2
.
.
.
P(198) = 2**-199
P(199) = 2**-200
P(200) = 2**-200 #same as P(199)
Which means if you toss a coin 2**200 times, you'd get
E(0) 2**199 times
E(1) 2**198 times
.
.
E(198) 2**1 times
E(199) 2**0 times and
E(200) 2**0 times.
Thus, the expected value reduces to
(0*(2**199) + 1*(2**198) + 2*(2**197) + ... + 198*(2**1) + 199*(2**0) + 200*(2**0))/2**200
This number is virtually equal to 1.
Expected_value = 1 - 2**-200
How I got the difference.
>>> diff = 2**200 - sum([ k*(2**(199-k)) for k in range(200)], 200*(2**0))
>>> diff
1
This can be generalized to n tosses as
f(n) = 1 - 2**(-n)

sum( array, 1) giving 'nan' in Python

First of all i know nan stands for "not a number" but I am not sure how i am getting an invalid number in my code. What i am doing is using a python script that reads a file for a list of vectors (x,y,z) and then converts it to a long array of values, but if i don't use the file and i make a for loop that generates random numbers i don't get any 'nan's.
After this i am using Newtons law of gravity to calculate the pos of stars, F= GMm/r^2 to calculate positions and then that data gets sent through a socket server to my c# visualizing software that i developed for watching simulations. Unfortuanately my python script that does the calculating has only but been troublesome to get working.
poslist = []
plist = []
mlist = []
lineList = []
coords = []
with open("Hyades Vectors.txt", "r") as text_file:
content = text_file.readlines()
#remove /n
for i in range(len(content)):
for char in "\n":
line = content[i].replace(char,"")
lineList.append(line)
lines = array(lineList)
#split " " within each line
for i in range(len(lines)):
coords.append(lines[i].split(" "))
coords = array(coords)
#convert coords string to integer
for i in range(len(coords)):
x = np.float(coords[i,0])
y = np.float(coords[i,1])
z = np.float(coords[i,2])
poslist.append((x,y,z))
pos = array(poslist)
quite often it is sending nan's after the second time going through this loop
vcm = sum(p)/sum(m) #velocity of centre mass
p = p-m*vcm #make total initial momentum equal zero
Myr = 8.4
dt = 1
pos = pos-(p/m)*(dt/2.) #initial half-step
finished = False
while not finished: # or NBodyVis.Oppenned() == False
r = pos-pos[:,newaxis] #all pairs of star-to-star vectors
for n in range(Nstars):
r[n,n] = 1e6 #otherwise the self-forces are infinite
rmag = sqrt(sum(square(r),-1)) #star-to star scalar distances
F = G*m*m[:,newaxis]*r/rmag[:,:,newaxis]**3 # all force pairs
for n in range(Nstars):
F[n,n] = 5 # no self-forces
p = p+sum(F,1)*dt #sum(F,1) is where i get a nan!!!!!!!!!!!!!!!!
pos -= (p/m)*dt
if Time <= 0:
finished = True
else:
Time -= 1
What am i doing wrong?????? I don't fully understand nans but i can't have them if my visualizing software is to read a nan, as for then nothing will apear for visuals. I know that the error is sum(F,1) I went and printed everything through until i got a nan and that is where, but how is it getting a nan from summing. Here is what part of the text file looks like that i am reading:
51.48855 4.74229 -85.24499
121.87149 11.44572 -140.79644
59.81673 68.8417 18.76767
31.95567 37.23007 6.59515
29.81066 34.76371 6.18374
41.35333 49.52844 14.12314
32.10481 38.46982 7.96628
48.13239 60.4019 37.45474
26.37793 34.53385 15.9054
76.02468 103.98826 25.96607
51.52072 71.17618 32.09829
please help

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

GC skew method using sliding window concept in python - python

Related

How to properly add gradually increasing/decreasing space between objects?

Rolling a roulette with chances and counting the maximum same-side sequences of a number of rolls

Q: Expected number of coin tosses to get N heads in a row, in Python. My code gives answers that don't match published correct ones, but unsure why

Longest expected head streak in 200 coinflips

sum( array, 1) giving 'nan' in Python

Categories

Resources