I have a file with lines of DNA in a file called 'DNASeq.txt'. I need a code to read each line and split each line at random places (inserting spaces) throughout the line. Each line needs to be split at different places.
EX: I have:
AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ
And I need something like this:
AAA ADSF DFAFDDSAF ADF ADSF AFD AFAD
I have tried (!!!very new to python!!):
import random
for x in range(10):
print(random.randint(50,250))
but that prints me random numbers. Is there some way to get a random number generated as like a variable?
You can read a file line wise, write each line character-wise in a new file and insert spaces randomly:
Create demo file without spaces:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJG
ASDFJSDGFIJ
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGO""")
Read and rewrite demo file:
import random
max_no_space = 9 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDF SFD GHJEQWRJIJG
A SDFJ SDG FIJ
SADFJSD FJ JDSFJIDFJG I JSRGJSDJ FIDJFG
The pattern of spaces is random. You can influence it by settin max_no_spaces or remove the randomness to split after max_no_spaces all the time
Edit:
This way of writing 1 character at a time if you need to read 200+ en block is not very economic, you can do it with the same code like so:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGG
ASDFJSDGFIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJK
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJF
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG""")
import random
min_no_space = 10
max_no_space = 20 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if no_space > min_no_space:
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDFSFDGHJEQ WRJIJSADFJSDF JJDSFJIDFJGIJ SRGJSDJFIDJFGG
ASDFJSDGFIJSA DFJSDFJJDSFJIDF JGIJSRGJSDJFIDJ FGSADFJSDFJJ DSFJIDFJGIJK
SADFJ SDFJJDSFJIDFJG IJSRGJSDJFIDJ FGSADFJSDFJJDS FJIDFJGIJSRG JSDJFIDJF
SDFJG IKDSFGOROHPTLPASDMKFGD OKRAMGSADFJSDF JJDSFJIDFJGI JSRGJSDJFIDJFG
If you want to split your DNA fixed amount of times (10 in my example) here's what you could try:
import random
DNA = 'AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ'
splitted_DNA = ''
for split_idx in sorted(random.sample(range(len(DNA)), 10)):
splitted_DNA += DNA[len(splitted_DNA)-splitted_DNA.count(' ') :split_idx] + ' '
splitted_DNA += DNA[split_idx:]
print(splitted_DNA) # -> AAACCCHT HTH D AF HD SA F JANFAJDSNFA DK FAFJ
import random
with open('source', 'r') as in_file:
with open('dest', 'w') as out_file:
for line in in_file:
newLine = ''.join(map(lambda x:x+' '*random.randint(0,1), line)).strip() + '\n'
out_file.write(newLine)
Since you mentioned being new, I'll try to explain
I'm writing the new sequences to another file for precaution. It's
not safe to write to the file you are reading from.
The with constructor is so that you don't need to explicitly close
the file you opened.
Files can be read line by line using for loop.
''.join() converts a list to a string.
map() applies a function to every element of a list and returns the
results as a new list.
lambda is how you define a function without naming it. lambda x:
2*x doubles the number you feed it.
x + ' ' * 3 adds 3 spaces after x. random.randint(0, 1) returns
either 1 or 0. So I'm randomly selecting if I'll add a space after
each character or not. If the random.randint() returns 0, 0 spaces are added.
You can toss a coin after each character whether to add space there or not.
This function takes string as input and returns output with space inserted at random places.
def insert_random_spaces(str):
from random import randint
output_string = "".join([x+randint(0,1)*" " for x in str])
return output_string
I have found so much information from previous search on this website but I seem to be stuck on the following issue.
I have two text files that looks like this
Inter.txt ( n-lines but only showed 4 lines,you get the idea)
7275
30000
6693
855
....
rules.txt (2n-lines)
7275
8500
6693
7555
....
3
1000
8
5
....
I want to compare the first line of Inter.txt with rules.txt and in case of a match, I jump for n-lines in order to get the score of that line. (E.g. with 7275, there is a match, I jump n to get the score 3)
I produced the following code but for some reasons, I only have the ouput of the first line when I should have one for each match from my first file. With the previous example, I should have 8 as an output for 6693.
import linecache
inter = open("Inter.txt", "r")
rules = open("rules.txt", "r")
iScore = 0
jump = 266
i=0
for lineInt in inter:
#i = i+1
#print(i)
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
All the print(i) are there because I checked that all the lines were read. I am a novice in Python.
To sum up, I don't understand why I only have one output. Thanks in advance !
Ok, I think the main thing that blocks you from getting forward is that the for loops on files gets the pointer to the end of the file, and doesn't resets when you starts the loops again.
So when you only open rules.txt once, and uses its intance in the inner loop it only goes through all the lines at the first iteration of the outer loop, the second time it tries to go over the remains lines, which are non.
The solution is to close and open the file outside the inner loop.
This code worked for me.
import linecache
inter = open("Inter.txt", "r")
iScore = 0
jump = 4
for lineInt in inter:
i=0
#i = i+1
#print(i)
rules = open("rules.txt", "r")
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
rules.close()
I also moved where you set the i to 0 to the beginning of the outer loop, but I guess you'd find it yourself.
And I changed jump to 4 to fit the example files your gave :p
Can you please try this solution:
def get_rules_values(rules_file):
with open(rules_file, "r") as rules:
return map(int, rules.readlines())
def get_rules_dict(rules_values):
return dict(zip(rules_values[:len(rules_values)/2], rules_values[len(rules_values)/2:]))
def get_inter_values(inter_file):
with open(inter_file, "r") as inter:
return map(int, inter.readlines())
rules_dict = get_rules_dict(get_rules_values("rules.txt"))
inter_values = get_inter_values("inter.txt")
for inter_value in inter_values:
print inter_value, rules_dict[inter_value]
Hope it's working for you!
I have a data.dat file with this format:
REAL PART
FREQ 1.6 5.4 2.1 13.15 13.15 17.71
FREQ 51.64 51.64 82.11 133.15 133.15 167.71
.
.
.
IMAGINARY PART
FREQ 51.64 51.64 82.12 132.15 129.15 161.71
FREQ 5.64 51.64 83.09 131.15 120.15 160.7
.
.
.
REAL PART
FREQ 1.6 5.4 2.1 13.15 15.15 17.71
FREQ 51.64 57.64 82.11 183.15 133.15 167.71
.
.
.
IMAGINARY PART
FREQ 53.64 53.64 81.12 132.15 129.15 161.71
FREQ 5.64 55.64 83.09 131.15 120.15 160.7
All over the document REAL and IMAGINARY blocks are reported
Within the REAL PART block,
I would like to split each line that starts with FREQ.
I have managed to:
1) split lines and extract the value of FREQ and
2) append this result to a list of lists, and
3) create a final list, All_frequencies:
FREQ = []
fname ='data.dat'
f = open(fname, 'r')
for line in f:
if line.startswith(' FREQ'):
FREQS = line.split()
FREQ.append(FREQS)
print 'Final FREQ = ', FREQ
All_frequencies = list(itertools.chain.from_iterable(FREQ))
print 'All_frequencies = ', All_frequencies
The problem with this code is that it also extracts the IMAGINARY PART values of FREQ. Only the REAL PART values of FREQ would have to be extracted.
I have tried to make something like:
if line.startswith('REAL PART'):
if line.startswith('IMAGINARY PART'):
code...
or:
if line.startswith(' REAL') and line.startswith(' FREQ'):
code...
But this does not work. I would appreciate if you could help me
It appears based on the sample data in the question that lines starting with 'REAL' or 'IMAGINARY' don't have any data on them, they just mark the beginning of a block. If that's the case (and you don't go changing the question again), you just need to keep track of which block you're in. You can also use yield instead of building up an ever-larger list of frequencies, as long as this code is in a function.
def read_real_parts(fname):
f = open(fname, 'r')
real_part = False
for line in f:
if line.startswith(' REAL'):
real_part = True
elif line.startswith(' IMAGINARY'):
real_part = False
elif line.startswith(' FREQ') and real_part:
FREQS = line.split()
yield FREQS
FREQ = read_real_parts('data.dat') #this gives you a generator
All_frequencies = list(itertools.chain.from_iterable(FREQ)) #then convert to list
Think of this as a state machine having two states. In one state, when the program has read a line with REAL at the beginning it goes into the REAL state and aggregates values. When it reads a line with IMAGINARY it goes into the alternate state and ignores values.
REAL, IMAGINARY = 1,2
FREQ = []
fname = 'data.dat'
f = open(fname)
state = None
for line in f:
line = line.strip()
if not line: continue
if line.startswith('REAL'):
state = REAL
continue
elif line.startswith('IMAGINARY'):
state = IMAGINARY
continue
else:
pass
if state == IMAGINARY:
continue
freqs = line.split()[1:]
FREQ.extend(freqs)
I assume that you want only the numeric values; hence the [:1] at the end of the assignment to freqs near the end of the script.
Using your data file, without the ellipsis lines, produces the following result in FREQ:
['1.6', '5.4', '2.1', '13.15', '13.15', '17.71', '51.64', '51.64', '82.11', '133.15', '133.15', '167.71', '1.6', '5.4', '2.1', '13.15', '15.15', '17.71', '51.64', '57.64', '82.11', '183.15', '133.15', '167.71']
You would need to keep track of which part you are looking at, so you can use a flag to do this:
section = None #will change to either "real" or "imag"
for line in f:
if line.startswith("IMAGINARY PART"):
section = "imag"
elif line.startswith('REAL PART'):
section = "real"
else:
freqs = line.split()
if section == "real":
FREQ.append(freqs)
#elif section == "imag":
# IMAG_FREQ.append(freqs)
by the way, instead of appending to FREQ then needing to use itertools.chain.from_iterable you might consider just extending FREQ instead.
we start with a flag set to False. if we find a line that contains "REAL", we set it to True to start copying the data below the REAL part, until we find a line that contains IMAGINARY, which sets the flag to False and goes to the next line until another "REAL" is found (and hence the flag turns back to True)
using the flag concept in a simple way:
with open('this.txt', 'r') as content:
my_lines = content.readlines()
f=open('another.txt', 'w')
my_real_flag = False
for line in my_lines:
if "REAL" in line:
my_real_flag = True
elif "IMAGINARY" in line:
my_real_flag = False
if my_real_flag:
#do code here because we found real frequencies
f.write(line)
else:
continue #because my_real_flag isn't true, so we must have found a
f.close()
this.txt looks like this:
REAL
1
2
3
IMAGINARY
4
5
6
REAL
1
2
3
IMAGINARY
4
5
6
another.txt ends up looking like this:
REAL
1
2
3
REAL
1
2
3
Original answer that only works when there is one REAL section
If the file is "small" enough to be read as an entire string and there is only one instance of "IMAGINARY PART", you can do this:
file_str = file_str.split("IMAGINARY PART")[0]
which would get you everything above the "IMAGINARY PART" line.
You can then apply the rest of your code to this file_str string that contains only the real part
to elaborate more, file_str is a str which is obtained by the following:
with open('data.dat', 'r') as my_data:
file_str = my_data.read()
the "with" block is referenced all over stack exchange, so there may be a better explanation for it than mine. I intuitively think about it as "open a file named 'data.dat' with the ability to only read it and name it as the variable my_data. once its opened, read the entirety of the file into a str, file_str, using my_data.read(), then close 'data.dat' "
now you have a str, and you can apply all the applicable str functions to it.
if "IMAGINARY PART" happens frequently throughout the file or the file is too big, Tadgh's suggestion of a flag a break works well.
for line in f:
if "IMAGINARY PART" not in line:
#do stuff
else:
f.close()
break
I have two files both of which are tab delimited. One of the file is almost 800k lines and it is a An Exonic Coordinates file and the other file is almost 200k lines (It is a VCF File).
I am writing a code in python to find and filter the position in the VCF that is within an exonic coordinates (Exon Start and End from Exonic Coordinates File) and writes it to a file.
However, because the files are big, it took a couple of days to get the filtrated output file?
So the code below is partially solve the issue of speed but the problem is to figure out is to speed the filtration process which is why I used a break to exit the second loop and I want to start from the beginning of the outer loop instead taking the next element from the first loop (outer loop)?
Here is my code:
import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
list_coord.append((row[0],row[1],row[2]))
def parseVcf(vcf,src):
done = False
with open(vcf,'r') as f:
reader=csv.reader((f),delimiter='\t')
vcf_out_split = vcf.split('.')
vcf_out_split.insert(2,"output_CORRECT2")
outpt = open('.'.join(vcf_out_split),'a')
for coord in list_coord:
for row in reader:
if '#' not in row[0]:
coor_genom = int(row[1])
coor_exon1 = int(coord[1])+1
coor_exon2 = int(coord[2])
coor_genom_chr = row[0]
coor_exon_chr = coord[0]
ComH = row[7].split(';')
for x in ComH:
if 'DP4=' in x:
DP4_split=x[4:].split(',')
if (coor_exon1 <= coor_genom <= coor_exon2):
if (coor_genom_chr == coor_exon_chr):
if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
done = True
outpt.write('\t'.join(row) + '\n')
if done:
break
outpt.close()
for root,dirs,files in os.walk("."):
for file in files:
pathname=os.path.join(root,file)
if file.find("1_1")==0:
print "Parsing " + file
parseVcf(pathname, "1_1")
ref_ordered.txt:
1 69090 70008
1 367658 368597
1 621095 622034
1 861321 861393
1 865534 865716
1 866418 866469
1 871151 871276
1 874419 874509
1_1 Input File:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT directory
1 14907 rs79585140 A G 20 . DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ 0/1:50,0,51:50
1 14930 rs75454623 A G 44 . DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38 GT:PL:GQ 0/1:74,0,58:61
1 15211 rs78601809 T G 9.33 . DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171 GT:PL:GQ 1/1:41,10,0:13
1 16146 . A C 25 . DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ 0/1:55,0,68:58
1 16257 rs78588380 G C 40 . DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308 GT:PL:GQ 0/1:70,0,249:73
1 16378 rs148220436 T C 39 . DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227 GT:PL:GQ 0/1:69,0,90:72
OUTPUT File:
1 877831 rs6672356 T C 44.8 . DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 878000 . C T 44.8 . DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 881627 rs2272757 G A 205 . DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1 GT:PL:GQ 1/1:238,27,0:51
First of all, I did not include any code because it looks like homework to me (I have had homework like this). I will however try to explain the steps I took to improve my scripts, even though I know my solutions are far from perfect.
your script could be slow because for every line in your csv file you open, write and close your output file. Try to make a list of lines you want to add to the output file, and after you are done with reading and filtering, then start writing.
You also might want to consider to write functions per filter and call these functions with the line as variable. That way you can easily add filters later on. I use a counter to keep track of the amount of succeeded filters and if in the end counter == len(amountOfUsedFilers) I add my line to the list.
Also, why do you use outpt = open('.'.join(vcf_out_split),'a') and with open(vcf,'r') as f: try to be consistent and smart in your choices.
Bioinformatics for the win!
If both of your files are ordered, you can save a lot of time by iterating over them in parallel, always advancing the one with lowest coordinates. This way you will only handle each line once, not many times.
Here's a basic version of your code that only does the coordinate checking (I don't fully understand your DP4 condition, so I'll leave it to you to add that part back in):
with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
coords = csv.reader(coords_f, delimiter="\t")
vcf = csv.reader(vcf_f, delimiter="\t")
out = csv.writer(out_f, delimiter="\t")
next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!
try:
c = next(coords)
r = next(vcf)
while True:
if int(c[1]) >= int(r[1]): # vcf file is behind
r = next(vcf)
elif int(c[2]) < int(r[1]): # coords file is behind
c = next(coords)
else: # int(c[1]) < int(r[1]) <= int(c[2])
out.writeline(r) # add DP4 check here, and indent this line under it
r = next(vcf) # don't indent this line
except StopIteration: # one of the files has ended
pass