Iterating Over Two Large Lists using Python - python
I have two files both of which are tab delimited. One of the file is almost 800k lines and it is a An Exonic Coordinates file and the other file is almost 200k lines (It is a VCF File).
I am writing a code in python to find and filter the position in the VCF that is within an exonic coordinates (Exon Start and End from Exonic Coordinates File) and writes it to a file.
However, because the files are big, it took a couple of days to get the filtrated output file?
So the code below is partially solve the issue of speed but the problem is to figure out is to speed the filtration process which is why I used a break to exit the second loop and I want to start from the beginning of the outer loop instead taking the next element from the first loop (outer loop)?
Here is my code:
import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
list_coord.append((row[0],row[1],row[2]))
def parseVcf(vcf,src):
done = False
with open(vcf,'r') as f:
reader=csv.reader((f),delimiter='\t')
vcf_out_split = vcf.split('.')
vcf_out_split.insert(2,"output_CORRECT2")
outpt = open('.'.join(vcf_out_split),'a')
for coord in list_coord:
for row in reader:
if '#' not in row[0]:
coor_genom = int(row[1])
coor_exon1 = int(coord[1])+1
coor_exon2 = int(coord[2])
coor_genom_chr = row[0]
coor_exon_chr = coord[0]
ComH = row[7].split(';')
for x in ComH:
if 'DP4=' in x:
DP4_split=x[4:].split(',')
if (coor_exon1 <= coor_genom <= coor_exon2):
if (coor_genom_chr == coor_exon_chr):
if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
done = True
outpt.write('\t'.join(row) + '\n')
if done:
break
outpt.close()
for root,dirs,files in os.walk("."):
for file in files:
pathname=os.path.join(root,file)
if file.find("1_1")==0:
print "Parsing " + file
parseVcf(pathname, "1_1")
ref_ordered.txt:
1 69090 70008
1 367658 368597
1 621095 622034
1 861321 861393
1 865534 865716
1 866418 866469
1 871151 871276
1 874419 874509
1_1 Input File:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT directory
1 14907 rs79585140 A G 20 . DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ 0/1:50,0,51:50
1 14930 rs75454623 A G 44 . DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38 GT:PL:GQ 0/1:74,0,58:61
1 15211 rs78601809 T G 9.33 . DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171 GT:PL:GQ 1/1:41,10,0:13
1 16146 . A C 25 . DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ 0/1:55,0,68:58
1 16257 rs78588380 G C 40 . DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308 GT:PL:GQ 0/1:70,0,249:73
1 16378 rs148220436 T C 39 . DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227 GT:PL:GQ 0/1:69,0,90:72
OUTPUT File:
1 877831 rs6672356 T C 44.8 . DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 878000 . C T 44.8 . DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 881627 rs2272757 G A 205 . DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1 GT:PL:GQ 1/1:238,27,0:51
First of all, I did not include any code because it looks like homework to me (I have had homework like this). I will however try to explain the steps I took to improve my scripts, even though I know my solutions are far from perfect.
your script could be slow because for every line in your csv file you open, write and close your output file. Try to make a list of lines you want to add to the output file, and after you are done with reading and filtering, then start writing.
You also might want to consider to write functions per filter and call these functions with the line as variable. That way you can easily add filters later on. I use a counter to keep track of the amount of succeeded filters and if in the end counter == len(amountOfUsedFilers) I add my line to the list.
Also, why do you use outpt = open('.'.join(vcf_out_split),'a') and with open(vcf,'r') as f: try to be consistent and smart in your choices.
Bioinformatics for the win!
If both of your files are ordered, you can save a lot of time by iterating over them in parallel, always advancing the one with lowest coordinates. This way you will only handle each line once, not many times.
Here's a basic version of your code that only does the coordinate checking (I don't fully understand your DP4 condition, so I'll leave it to you to add that part back in):
with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
coords = csv.reader(coords_f, delimiter="\t")
vcf = csv.reader(vcf_f, delimiter="\t")
out = csv.writer(out_f, delimiter="\t")
next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!
try:
c = next(coords)
r = next(vcf)
while True:
if int(c[1]) >= int(r[1]): # vcf file is behind
r = next(vcf)
elif int(c[2]) < int(r[1]): # coords file is behind
c = next(coords)
else: # int(c[1]) < int(r[1]) <= int(c[2])
out.writeline(r) # add DP4 check here, and indent this line under it
r = next(vcf) # don't indent this line
except StopIteration: # one of the files has ended
pass
Related
Looking for first line of data with python
I have a data file that looks like this, and the file type is a list. ############################################################ # Tool # File: test # # mass: mass in GeV # spectrum: from 1 to 100 GeV ########################################################### # mass (GeV) spectrum (1-100 GeV) 10 0.2822771608053263 20 0.8697454394829301 30 1.430461657476815 40 1.9349004472432392 50 2.3876849629827412 60 2.796620869276766 70 3.1726347734996727 80 3.5235401505002244 90 3.8513847250834106 100 4.157478780924807 For me to read the data I would normally have to count how many lines before the first set of numbers and then for loop through the file. In this file its 8 lines spectrum=[] mass=[] with open ('test.in') as m: test=m.readlines() for i in range(8,len(test)): single_line=test[i].split('\t') mass.appened(float(single_line[0])) spectrum.append(float(single_line[1])) Let's say I didn't want to open the file to check how many lines there are from the intro statement to the first line of data points. How would I make python automatically start at the first line of data points, and go through the end of the file?
This is a general solution, but should work in your specific case. you could for each line, check if it starts with a number. psedo-code for line in test: if line.split()[0].isdigit(): DoStuffWithData
spectrum=[] mass=[] with open ('test.in') as m: test=m.readlines() for line in test: if line[0] == '#': continue single_line=line.split('\t') mass.append(float(single_line[0])) spectrum.append(float(single_line[1]))
You can filter all lines that start with # by regex or startswith method of string import re spectrum=[] mass=[] with open ('test.in') as m: test= [i for i in f.readlines() if not re.match("^#.*", i)] for i in test: single_line = i.split('\t') mass.appened(float(single_line[0])) spectrum.append(float(single_line[1])) OR spectrum = [] mass = [] with open('test.in') as m: test = [i for i in f.readlines() if not i.startwith("#")] for i in test: single_line = i.split('\t') mass.appened(float(single_line[0])) spectrum.append(float(single_line[1])) This will filter out all the lines that start with #.
pseudo code: for r in m: if r.startwith('#'): continue spt = r.split('\t') if len(spt) < 2: continue ## todo: .....
Python: How to compare string from two text files and retrieve an additional line of one in case of match
I have found so much information from previous search on this website but I seem to be stuck on the following issue. I have two text files that looks like this Inter.txt ( n-lines but only showed 4 lines,you get the idea) 7275 30000 6693 855 .... rules.txt (2n-lines) 7275 8500 6693 7555 .... 3 1000 8 5 .... I want to compare the first line of Inter.txt with rules.txt and in case of a match, I jump for n-lines in order to get the score of that line. (E.g. with 7275, there is a match, I jump n to get the score 3) I produced the following code but for some reasons, I only have the ouput of the first line when I should have one for each match from my first file. With the previous example, I should have 8 as an output for 6693. import linecache inter = open("Inter.txt", "r") rules = open("rules.txt", "r") iScore = 0 jump = 266 i=0 for lineInt in inter: #i = i+1 #print(i) for lineRul in rules: i = i+1 #print(i) if lineInt == lineRul: print("Match") inc = linecache.getline("rules.txt", i + jump) #print(inc) iScore = iScore + int(inc) print(iScore) #break else: continue All the print(i) are there because I checked that all the lines were read. I am a novice in Python. To sum up, I don't understand why I only have one output. Thanks in advance !
Ok, I think the main thing that blocks you from getting forward is that the for loops on files gets the pointer to the end of the file, and doesn't resets when you starts the loops again. So when you only open rules.txt once, and uses its intance in the inner loop it only goes through all the lines at the first iteration of the outer loop, the second time it tries to go over the remains lines, which are non. The solution is to close and open the file outside the inner loop. This code worked for me. import linecache inter = open("Inter.txt", "r") iScore = 0 jump = 4 for lineInt in inter: i=0 #i = i+1 #print(i) rules = open("rules.txt", "r") for lineRul in rules: i = i+1 #print(i) if lineInt == lineRul: print("Match") inc = linecache.getline("rules.txt", i + jump) #print(inc) iScore = iScore + int(inc) print(iScore) #break else: continue rules.close() I also moved where you set the i to 0 to the beginning of the outer loop, but I guess you'd find it yourself. And I changed jump to 4 to fit the example files your gave :p
Can you please try this solution: def get_rules_values(rules_file): with open(rules_file, "r") as rules: return map(int, rules.readlines()) def get_rules_dict(rules_values): return dict(zip(rules_values[:len(rules_values)/2], rules_values[len(rules_values)/2:])) def get_inter_values(inter_file): with open(inter_file, "r") as inter: return map(int, inter.readlines()) rules_dict = get_rules_dict(get_rules_values("rules.txt")) inter_values = get_inter_values("inter.txt") for inter_value in inter_values: print inter_value, rules_dict[inter_value] Hope it's working for you!
Double if conditional in the line.startswith strategy
I have a data.dat file with this format: REAL PART FREQ 1.6 5.4 2.1 13.15 13.15 17.71 FREQ 51.64 51.64 82.11 133.15 133.15 167.71 . . . IMAGINARY PART FREQ 51.64 51.64 82.12 132.15 129.15 161.71 FREQ 5.64 51.64 83.09 131.15 120.15 160.7 . . . REAL PART FREQ 1.6 5.4 2.1 13.15 15.15 17.71 FREQ 51.64 57.64 82.11 183.15 133.15 167.71 . . . IMAGINARY PART FREQ 53.64 53.64 81.12 132.15 129.15 161.71 FREQ 5.64 55.64 83.09 131.15 120.15 160.7 All over the document REAL and IMAGINARY blocks are reported Within the REAL PART block, I would like to split each line that starts with FREQ. I have managed to: 1) split lines and extract the value of FREQ and 2) append this result to a list of lists, and 3) create a final list, All_frequencies: FREQ = [] fname ='data.dat' f = open(fname, 'r') for line in f: if line.startswith(' FREQ'): FREQS = line.split() FREQ.append(FREQS) print 'Final FREQ = ', FREQ All_frequencies = list(itertools.chain.from_iterable(FREQ)) print 'All_frequencies = ', All_frequencies The problem with this code is that it also extracts the IMAGINARY PART values of FREQ. Only the REAL PART values of FREQ would have to be extracted. I have tried to make something like: if line.startswith('REAL PART'): if line.startswith('IMAGINARY PART'): code... or: if line.startswith(' REAL') and line.startswith(' FREQ'): code... But this does not work. I would appreciate if you could help me
It appears based on the sample data in the question that lines starting with 'REAL' or 'IMAGINARY' don't have any data on them, they just mark the beginning of a block. If that's the case (and you don't go changing the question again), you just need to keep track of which block you're in. You can also use yield instead of building up an ever-larger list of frequencies, as long as this code is in a function. def read_real_parts(fname): f = open(fname, 'r') real_part = False for line in f: if line.startswith(' REAL'): real_part = True elif line.startswith(' IMAGINARY'): real_part = False elif line.startswith(' FREQ') and real_part: FREQS = line.split() yield FREQS FREQ = read_real_parts('data.dat') #this gives you a generator All_frequencies = list(itertools.chain.from_iterable(FREQ)) #then convert to list
Think of this as a state machine having two states. In one state, when the program has read a line with REAL at the beginning it goes into the REAL state and aggregates values. When it reads a line with IMAGINARY it goes into the alternate state and ignores values. REAL, IMAGINARY = 1,2 FREQ = [] fname = 'data.dat' f = open(fname) state = None for line in f: line = line.strip() if not line: continue if line.startswith('REAL'): state = REAL continue elif line.startswith('IMAGINARY'): state = IMAGINARY continue else: pass if state == IMAGINARY: continue freqs = line.split()[1:] FREQ.extend(freqs) I assume that you want only the numeric values; hence the [:1] at the end of the assignment to freqs near the end of the script. Using your data file, without the ellipsis lines, produces the following result in FREQ: ['1.6', '5.4', '2.1', '13.15', '13.15', '17.71', '51.64', '51.64', '82.11', '133.15', '133.15', '167.71', '1.6', '5.4', '2.1', '13.15', '15.15', '17.71', '51.64', '57.64', '82.11', '183.15', '133.15', '167.71']
You would need to keep track of which part you are looking at, so you can use a flag to do this: section = None #will change to either "real" or "imag" for line in f: if line.startswith("IMAGINARY PART"): section = "imag" elif line.startswith('REAL PART'): section = "real" else: freqs = line.split() if section == "real": FREQ.append(freqs) #elif section == "imag": # IMAG_FREQ.append(freqs) by the way, instead of appending to FREQ then needing to use itertools.chain.from_iterable you might consider just extending FREQ instead.
we start with a flag set to False. if we find a line that contains "REAL", we set it to True to start copying the data below the REAL part, until we find a line that contains IMAGINARY, which sets the flag to False and goes to the next line until another "REAL" is found (and hence the flag turns back to True) using the flag concept in a simple way: with open('this.txt', 'r') as content: my_lines = content.readlines() f=open('another.txt', 'w') my_real_flag = False for line in my_lines: if "REAL" in line: my_real_flag = True elif "IMAGINARY" in line: my_real_flag = False if my_real_flag: #do code here because we found real frequencies f.write(line) else: continue #because my_real_flag isn't true, so we must have found a f.close() this.txt looks like this: REAL 1 2 3 IMAGINARY 4 5 6 REAL 1 2 3 IMAGINARY 4 5 6 another.txt ends up looking like this: REAL 1 2 3 REAL 1 2 3 Original answer that only works when there is one REAL section If the file is "small" enough to be read as an entire string and there is only one instance of "IMAGINARY PART", you can do this: file_str = file_str.split("IMAGINARY PART")[0] which would get you everything above the "IMAGINARY PART" line. You can then apply the rest of your code to this file_str string that contains only the real part to elaborate more, file_str is a str which is obtained by the following: with open('data.dat', 'r') as my_data: file_str = my_data.read() the "with" block is referenced all over stack exchange, so there may be a better explanation for it than mine. I intuitively think about it as "open a file named 'data.dat' with the ability to only read it and name it as the variable my_data. once its opened, read the entirety of the file into a str, file_str, using my_data.read(), then close 'data.dat' " now you have a str, and you can apply all the applicable str functions to it. if "IMAGINARY PART" happens frequently throughout the file or the file is too big, Tadgh's suggestion of a flag a break works well. for line in f: if "IMAGINARY PART" not in line: #do stuff else: f.close() break
how to create an index to parse big text file
I have two files A and B in FASTQ format, which are basically several hundred million lines of text organized in groups of 4 lines starting with an # as follows: #120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1 GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN +120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1 bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB I need to compare the 5:1101:1156:2031#0/ part between files A and B and write the groups of 4 lines in file B that matched to a new file. I got a piece of code in python that does that, but only works for small files as it parses through the entire #-lines of file B for every #-line in file A, and both files contain hundreds of millions of lines. Someone suggested that I should create an index for file B; I have googled around without success and would be very grateful if someone could point out how to do this or let me know of a tutorial so I can learn. Thanks. ==EDIT== In theory each group of 4 lines should only exist once in each file. Would it increase the speed enough if breaking the parsing after each match or do I need a different algorithm altogether?
An index is just a shortened version of the information you are working with. In this case, you will want the "key" - the text between the first colon(':') on the #-line and the final slash('/') near the end - as well as some kind of value. Since the "value" in this case is the entire contents of the 4-line block, and since our index is going to store a separate entry for each block, we would be storing the entire file in memory if we used the actual value in the index. Instead, let's use the file position of the beginning of the 4-line block. That way, you can move to that file position, print 4 lines, and stop. Total cost is the 4 or 8 or however many bytes it takes to store an integer file position, instead of however-many bytes of actual genome data. Here is some code that does the job, but also does a lot of validation and checking. You might want to throw stuff away that you don't use. import sys def build_index(path): index = {} for key, pos, data in parse_fastq(path): if key not in index: # Don't overwrite duplicates- use first occurrence. index[key] = pos return index def error(s): sys.stderr.write(s + "\n") def extract_key(s): # This much is fairly constant: assert(s.startswith('#')) (machine_name, rest) = s.split(':', 1) # Per wikipedia, this changes in different variants of FASTQ format: (key, rest) = rest.split('/', 1) return key def parse_fastq(path): """ Parse the 4-line FASTQ groups in path. Validate the contents, somewhat. """ f = open(path) i = 0 # Note: iterating a file is incompatible with fh.tell(). Fake it. pos = offset = 0 for line in f: offset += len(line) lx = i % 4 i += 1 if lx == 0: # #machine: key key = extract_key(line) len1 = len2 = 0 data = [ line ] elif lx == 1: data.append(line) len1 = len(line) elif lx == 2: # +machine: key or something assert(line.startswith('+')) data.append(line) else: # lx == 3 : quality data data.append(line) len2 = len(line) if len2 != len1: error("Data length mismatch at line " + str(i-2) + " (len: " + str(len1) + ") and line " + str(i) + " (len: " + str(len2) + ")\n") #print "Yielding #%i: %s" % (pos, key) yield key, pos, data pos = offset if i % 4 != 0: error("EOF encountered in mid-record at line " + str(i)); def match_records(path, index): results = [] for key, pos, d in parse_fastq(path): if key in index: # found a match! results.append(key) return results def write_matches(inpath, matches, outpath): rf = open(inpath) wf = open(outpath, 'w') for m in matches: rf.seek(m) wf.write(rf.readline()) wf.write(rf.readline()) wf.write(rf.readline()) wf.write(rf.readline()) rf.close() wf.close() #import pdb; pdb.set_trace() index = build_index('afile.fastq') matches = match_records('bfile.fastq', index) posns = [ index[k] for k in matches ] write_matches('afile.fastq', posns, 'outfile.fastq') Note that this code goes back to the first file to get the blocks of data. If your data is identical between files, you would be able to copy the block from the second file when a match occurs. Note also that depending on what you are trying to extract, you may want to change the order of the output blocks, and you may want to make sure that the keys are unique, or perhaps make sure the keys are not unique but are repeated in the order they match. That's up to you - I'm not sure what you're doing with the data.
these guys claim to parse a few gigs file while using a dedicated library, see http://www.biostars.org/p/15113/ fastq_parser = SeqIO.parse(fastq_filename, "fastq") wanted = (rec for rec in fastq_parser if ...) SeqIO.write(wanted, output_file, "fastq") a better approach IMO would be to parse it once and load the data to some database instead of that output_file (i.e mysql) and latter run the queries there
Complex parsing query
I have a very complex parsing problem. Any thoughts would be appreciated here. I have a test.dat file.The file to be parsed looks like this: * Number = 40 Time = 0 1 10.13 10 10.11 12 13 . . Time = n 1 10 10 10 12.50 13 . . There are N time blocks and each block has 40 lines like shown above. What I would like to do is add e.g. the 1st line of first block , then 1st line in block #2 .. and so on to to a new file -test_1.dat. Similarly, 2nd line of every block to test_2.datand so on.The lines in the block should be written as is to the new _n.dat file. Is there any way to do this? The number I have assumed here is 40, so if the * number = 40 there will be 40 lines under each time block. regards, Ris
You can read the file in as a list of strings (call it fileList), where each string is a different line: f = open('filename') fileList = f.readlines() Then, remove the "header" part of your file with fileList.pop(0) fileList.pop(0) Then, do outFileContents = {} # This will be a dict, where number -> content of test_number.dat for outFileName in range(1,41): #outFileName will be the number going after the _ in your filename outFileContents[outFileName] = [] for n in range(40): # Counting through the time blocks currentRowIndex = (42 * n) + outFileName # 42 to account for the Time = and blank row outFileContents[outFileName].append(fileList[currentRowIndex]) Finally you can loop through outFileContents and write the contents of each value to separate files.