how to create an index to parse big text file

how to create an index to parse big text file - python

I have two files A and B in FASTQ format, which are basically several hundred million lines of text organized in groups of 4 lines starting with an # as follows:
#120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN
+120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB
I need to compare the
5:1101:1156:2031#0/
part between files A and B and write the groups of 4 lines in file B that matched to a new file. I got a piece of code in python that does that, but only works for small files as it parses through the entire #-lines of file B for every #-line in file A, and both files contain hundreds of millions of lines.
Someone suggested that I should create an index for file B; I have googled around without success and would be very grateful if someone could point out how to do this or let me know of a tutorial so I can learn. Thanks.
==EDIT==
In theory each group of 4 lines should only exist once in each file. Would it increase the speed enough if breaking the parsing after each match or do I need a different algorithm altogether?

An index is just a shortened version of the information you are working with. In this case, you will want the "key" - the text between the first colon(':') on the #-line and the final slash('/') near the end - as well as some kind of value.
Since the "value" in this case is the entire contents of the 4-line block, and since our index is going to store a separate entry for each block, we would be storing the entire file in memory if we used the actual value in the index.
Instead, let's use the file position of the beginning of the 4-line block. That way, you can move to that file position, print 4 lines, and stop. Total cost is the 4 or 8 or however many bytes it takes to store an integer file position, instead of however-many bytes of actual genome data.
Here is some code that does the job, but also does a lot of validation and checking. You might want to throw stuff away that you don't use.
import sys
def build_index(path):
index = {}
for key, pos, data in parse_fastq(path):
if key not in index:
# Don't overwrite duplicates- use first occurrence.
index[key] = pos
return index
def error(s):
sys.stderr.write(s + "\n")
def extract_key(s):
# This much is fairly constant:
assert(s.startswith('#'))
(machine_name, rest) = s.split(':', 1)
# Per wikipedia, this changes in different variants of FASTQ format:
(key, rest) = rest.split('/', 1)
return key
def parse_fastq(path):
"""
Parse the 4-line FASTQ groups in path.
Validate the contents, somewhat.
"""
f = open(path)
i = 0
# Note: iterating a file is incompatible with fh.tell(). Fake it.
pos = offset = 0
for line in f:
offset += len(line)
lx = i % 4
i += 1
if lx == 0: # #machine: key
key = extract_key(line)
len1 = len2 = 0
data = [ line ]
elif lx == 1:
data.append(line)
len1 = len(line)
elif lx == 2: # +machine: key or something
assert(line.startswith('+'))
data.append(line)
else: # lx == 3 : quality data
data.append(line)
len2 = len(line)
if len2 != len1:
error("Data length mismatch at line "
+ str(i-2)
+ " (len: " + str(len1) + ") and line "
+ str(i)
+ " (len: " + str(len2) + ")\n")
#print "Yielding #%i: %s" % (pos, key)
yield key, pos, data
pos = offset
if i % 4 != 0:
error("EOF encountered in mid-record at line " + str(i));
def match_records(path, index):
results = []
for key, pos, d in parse_fastq(path):
if key in index:
# found a match!
results.append(key)
return results
def write_matches(inpath, matches, outpath):
rf = open(inpath)
wf = open(outpath, 'w')
for m in matches:
rf.seek(m)
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
rf.close()
wf.close()
#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')
Note that this code goes back to the first file to get the blocks of data. If your data is identical between files, you would be able to copy the block from the second file when a match occurs.
Note also that depending on what you are trying to extract, you may want to change the order of the output blocks, and you may want to make sure that the keys are unique, or perhaps make sure the keys are not unique but are repeated in the order they match. That's up to you - I'm not sure what you're doing with the data.

these guys claim to parse a few gigs file while using a dedicated library, see http://www.biostars.org/p/15113/
fastq_parser = SeqIO.parse(fastq_filename, "fastq")
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")
a better approach IMO would be to parse it once and load the data to some database instead of that output_file (i.e mysql) and latter run the queries there

Related

Replace space-char by newline-char within max-line-length in python

I am trying to write a python script,
which breaks a continuous string into lines,
when the max_line_length has been exceeded.
It shall not break words,
and searches therefore the last occurrence of a whitespace-char,
which will be replaced by a newline-char.
For some reason it does not break within the specified limit.
E.g. when defining the max_line_length = 80,
the text sometimes breaks at 82 or 83, etc.
Since quite some time I am trying to fix the problem,
however it feels like i am having the tunnel vision
and don't see the problem here:
#!/usr/bin/python
import sys
if len(sys.argv) < 3:
print('usage: $ python3 breaktext.py <max_line_length> <file>')
print('example: $ python3 breaktext.py 80 infile.txt')
exit()
filename = str(sys.argv[2])
with open(filename, 'r') as file:
text_str = file.read().replace('\n', '')
m = int(sys.argv[1]) # max_line_length
text_list = list(text_str) # convert string to list
l = 0; # line_number
i = m+1 # line_character_index
index = m+1 # total_list_index
while index < len(text_list):
while text_list[l * m + i] != ' ':
i -= 1
pass
text_list[l * m + i] = '\n'
l += 1
i = m+1
index += m+1
pass
text_str = ''.join(text_list)
print(text_str)

I guess we'll take this from the top.
text_str = file.read().replace('\n', '')
Here's one assumption about the input data I don't know if it's true. You're replacing all the newline characters with nothing; if there weren't spaces next to them, this means the code below will never break the lines in the same places.
text_list = list(text_str) # convert string to list
This splits the input file into single character strings. I guess you might have done so to make it mutable, such that you can replace individual characters, but it's a very expensive operation and loses all the features of a string. Python is a high level language that would allow you to split into e.g. words instead.
index = m+1 # total_list_index
while index < len(text_list):
#...
index += m+1
Let's consider what this means. We're not entering into the loop if index exceeds the text_list length. But index is advancing in steps of m+1. So we're splitting math.floor(len(text)/(max_line_length+1)) times. Unless every line is exactly max_line_length characters, not counting its space we replace with a newline, that's too few times. Too few times means too long lines, at least at the end.
l = 0; # line_number
i = m+1 # line_character_index
#loop:
while text_list[l * m + i] != ' ':
i -= 1
text_list[l * m + i] = '\n'
l += 1
i = m+1
This is making things difficult with index math. Quite clearly the one index we ever use is l * m + i. This moves in a quite odd way; it searches backwards for a space, then leaps forward as l increments and i resets. Whatever position it had reversed to is lost as all the leaps are in steps of m.
Let's apply m=5 to the string "Fee fie faw fum who did you see now". For the first iteration, 0 * 5 + 5+1 hits the second word, and i seeks back to the first space. The first line then is "Fee", as expected. The second search starts at 1*5 + 5+1, which is a space, and the second line becomes "fie faw", which already exceeds our limit of 5! The reason is that l * m isn't the beginning of the line; it's actually in the middle of "fie", a discrepancy which can only grow as you continue through the file. It grows whenever you split off a line that is shorter than m.
The solution involves remembering where you did your split. That could be as simple as replacing l * m with index, and updating it by index += i instead of m+1.
Another odd effect happens if you ever encounter a word that exceeds the maximum line length. Beyond meaning a line is longer than the limit, i will still search backwards until it finds a space; that space could then be in an earlier line altogether, producing extra short lines as well as too long ones. That's a result of handling the entire text as one array and not limiting which section we're looking at.
Personally I'd much rather use Python's built in methods, such as str.rindex, which can find a particular character in a given region within a string:
s = "Fee fie faw fum who did you see now"
maxlen = 5
start = 8
end = s.rindex(' ', start, start+maxlen)
print(s[start:end])
start = end + 1
We also, as PaulMcG pointed out, can go full "batteries included" and use the standard library textwrap module for the entire task.

Drawing multiple sequences from 1 file, based on shared fields in another file

I'm trying to run a python script to draw sequences from a separate file (merged.fas), in respect to a list (gene_fams_eggnog.txt) I have as output from another program.
The code is as follows:
from Bio import SeqIO
import os, sys, re
from collections import defaultdict
sequences = "merged.fas"
all_seqs = SeqIO.index(sequences, "fasta")
gene_fams = defaultdict(list)
gene_fams_file = open("gene_fams_eggnog.txt")
for line in gene_fams_file:
fields = re.split("\t", line.rstrip())
gene_fams[fields[0]].append[fields[1]]
for fam in gene_fams.keys():
output_filename = str(fam) + ".fasta"
outh = open(output_filename, "w")
for id in gene_fams[fam]:
if id in all_seqs:
outh.write(">" + all_seqs[id].description + "\n" + str(all_seqs[id].seq) + "\n")
else:
print "Uh oh! Sequence with ID " + str(id) + " is not in the all_seqs file!"
quit()
outh.close()
The list looks like this:
1 Saccharomycescerevisiae_DAA09367.1
1 bieneu_EED42827.1
1 Asp_XP_749186.1
1 Mag_XP_003717339.1
2 Mag_XP_003716586.1
2 Mag_XP_003709453.1
3 Asp_XP_749329.1
The field 0 denotes a grouping based by a similarity between the sequences. The script was meant to take all the sequences from merged.fas that correspond to the code in the field 1 and write them into a file base on field 0.
So in the case of the portion of the list I have shown, all the sequences that have a 1 in field 0 (Saccharomycescerevisiae_DAA09367.1, bieneu_EED42827.1, Asp_XP_749186.1, Mag_XP_003717339.1) would have been written into a file called 1.fasta. This should continue from 2.fasta-however many groups there are.
So this has worked, however it doesn't include all the sequences that are in the group, it'll only include the last one to be listed as a part of that group. Using my example above, I'd only have a file (1.fasta) with one sequence (Mag_XP_003717339.1), instead of all four.
Any and all help is appreciated,
Thanks,
JT

Although I didn't spot the cause of the issue you complained about, I'm surprised your code runs at all with this error:
gene_fams[fields[0]].append[fields[1]]
i.e. append[...] instead of append(...). But perhaps that's also, "not there in the actual script I'm running". I rewrote your script below, and it works fine for me. One issue was your use of the variable name id which is a Python builtin. You'll see I go to an extreme to avoid such errors:
from Bio import SeqIO
from collections import defaultdict
SEQUENCE_FILE_NAME = "merged.fas"
FAMILY_FILE_NAME = "gene_families_eggnog.txt"
all_sequences = SeqIO.index(SEQUENCE_FILE_NAME, "fasta")
gene_families = defaultdict(list)
with open(FAMILY_FILE_NAME) as gene_families_file:
for line in gene_families_file:
family_id, gene_id = line.rstrip().split()
gene_families[family_id].append(gene_id)
for family_id, gene_ids in gene_families.items():
output_filename = family_id + ".fasta"
with open(output_filename, "w") as output:
for gene_id in gene_ids:
assert gene_id in all_sequences, "Sequence {} is not in {}!".format(gene_id, SEQUENCE_FILE_NAME)
output.write(all_sequences[gene_id].format("fasta"))

Python: How to compare string from two text files and retrieve an additional line of one in case of match

I have found so much information from previous search on this website but I seem to be stuck on the following issue.
I have two text files that looks like this
Inter.txt ( n-lines but only showed 4 lines,you get the idea)
7275
30000
6693
855
....
rules.txt (2n-lines)
7275
8500
6693
7555
....
3
1000
8
5
....
I want to compare the first line of Inter.txt with rules.txt and in case of a match, I jump for n-lines in order to get the score of that line. (E.g. with 7275, there is a match, I jump n to get the score 3)
I produced the following code but for some reasons, I only have the ouput of the first line when I should have one for each match from my first file. With the previous example, I should have 8 as an output for 6693.
import linecache
inter = open("Inter.txt", "r")
rules = open("rules.txt", "r")
iScore = 0
jump = 266
i=0
for lineInt in inter:
#i = i+1
#print(i)
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
All the print(i) are there because I checked that all the lines were read. I am a novice in Python.
To sum up, I don't understand why I only have one output. Thanks in advance !

Ok, I think the main thing that blocks you from getting forward is that the for loops on files gets the pointer to the end of the file, and doesn't resets when you starts the loops again.
So when you only open rules.txt once, and uses its intance in the inner loop it only goes through all the lines at the first iteration of the outer loop, the second time it tries to go over the remains lines, which are non.
The solution is to close and open the file outside the inner loop.
This code worked for me.
import linecache
inter = open("Inter.txt", "r")
iScore = 0
jump = 4
for lineInt in inter:
i=0
#i = i+1
#print(i)
rules = open("rules.txt", "r")
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
rules.close()
I also moved where you set the i to 0 to the beginning of the outer loop, but I guess you'd find it yourself.
And I changed jump to 4 to fit the example files your gave :p

Can you please try this solution:
def get_rules_values(rules_file):
with open(rules_file, "r") as rules:
return map(int, rules.readlines())
def get_rules_dict(rules_values):
return dict(zip(rules_values[:len(rules_values)/2], rules_values[len(rules_values)/2:]))
def get_inter_values(inter_file):
with open(inter_file, "r") as inter:
return map(int, inter.readlines())
rules_dict = get_rules_dict(get_rules_values("rules.txt"))
inter_values = get_inter_values("inter.txt")
for inter_value in inter_values:
print inter_value, rules_dict[inter_value]
Hope it's working for you!

Double if conditional in the line.startswith strategy

I have a data.dat file with this format:
REAL PART
FREQ 1.6 5.4 2.1 13.15 13.15 17.71
FREQ 51.64 51.64 82.11 133.15 133.15 167.71
.
.
.
IMAGINARY PART
FREQ 51.64 51.64 82.12 132.15 129.15 161.71
FREQ 5.64 51.64 83.09 131.15 120.15 160.7
.
.
.
REAL PART
FREQ 1.6 5.4 2.1 13.15 15.15 17.71
FREQ 51.64 57.64 82.11 183.15 133.15 167.71
.
.
.
IMAGINARY PART
FREQ 53.64 53.64 81.12 132.15 129.15 161.71
FREQ 5.64 55.64 83.09 131.15 120.15 160.7
All over the document REAL and IMAGINARY blocks are reported
Within the REAL PART block,
I would like to split each line that starts with FREQ.
I have managed to:
1) split lines and extract the value of FREQ and
2) append this result to a list of lists, and
3) create a final list, All_frequencies:
FREQ = []
fname ='data.dat'
f = open(fname, 'r')
for line in f:
if line.startswith(' FREQ'):
FREQS = line.split()
FREQ.append(FREQS)
print 'Final FREQ = ', FREQ
All_frequencies = list(itertools.chain.from_iterable(FREQ))
print 'All_frequencies = ', All_frequencies
The problem with this code is that it also extracts the IMAGINARY PART values of FREQ. Only the REAL PART values of FREQ would have to be extracted.
I have tried to make something like:
if line.startswith('REAL PART'):
if line.startswith('IMAGINARY PART'):
code...
or:
if line.startswith(' REAL') and line.startswith(' FREQ'):
code...
But this does not work. I would appreciate if you could help me

It appears based on the sample data in the question that lines starting with 'REAL' or 'IMAGINARY' don't have any data on them, they just mark the beginning of a block. If that's the case (and you don't go changing the question again), you just need to keep track of which block you're in. You can also use yield instead of building up an ever-larger list of frequencies, as long as this code is in a function.
def read_real_parts(fname):
f = open(fname, 'r')
real_part = False
for line in f:
if line.startswith(' REAL'):
real_part = True
elif line.startswith(' IMAGINARY'):
real_part = False
elif line.startswith(' FREQ') and real_part:
FREQS = line.split()
yield FREQS
FREQ = read_real_parts('data.dat') #this gives you a generator
All_frequencies = list(itertools.chain.from_iterable(FREQ)) #then convert to list

Think of this as a state machine having two states. In one state, when the program has read a line with REAL at the beginning it goes into the REAL state and aggregates values. When it reads a line with IMAGINARY it goes into the alternate state and ignores values.
REAL, IMAGINARY = 1,2
FREQ = []
fname = 'data.dat'
f = open(fname)
state = None
for line in f:
line = line.strip()
if not line: continue
if line.startswith('REAL'):
state = REAL
continue
elif line.startswith('IMAGINARY'):
state = IMAGINARY
continue
else:
pass
if state == IMAGINARY:
continue
freqs = line.split()[1:]
FREQ.extend(freqs)
I assume that you want only the numeric values; hence the [:1] at the end of the assignment to freqs near the end of the script.
Using your data file, without the ellipsis lines, produces the following result in FREQ:
['1.6', '5.4', '2.1', '13.15', '13.15', '17.71', '51.64', '51.64', '82.11', '133.15', '133.15', '167.71', '1.6', '5.4', '2.1', '13.15', '15.15', '17.71', '51.64', '57.64', '82.11', '183.15', '133.15', '167.71']

You would need to keep track of which part you are looking at, so you can use a flag to do this:
section = None #will change to either "real" or "imag"
for line in f:
if line.startswith("IMAGINARY PART"):
section = "imag"
elif line.startswith('REAL PART'):
section = "real"
else:
freqs = line.split()
if section == "real":
FREQ.append(freqs)
#elif section == "imag":
# IMAG_FREQ.append(freqs)
by the way, instead of appending to FREQ then needing to use itertools.chain.from_iterable you might consider just extending FREQ instead.

we start with a flag set to False. if we find a line that contains "REAL", we set it to True to start copying the data below the REAL part, until we find a line that contains IMAGINARY, which sets the flag to False and goes to the next line until another "REAL" is found (and hence the flag turns back to True)
using the flag concept in a simple way:
with open('this.txt', 'r') as content:
my_lines = content.readlines()
f=open('another.txt', 'w')
my_real_flag = False
for line in my_lines:
if "REAL" in line:
my_real_flag = True
elif "IMAGINARY" in line:
my_real_flag = False
if my_real_flag:
#do code here because we found real frequencies
f.write(line)
else:
continue #because my_real_flag isn't true, so we must have found a
f.close()
this.txt looks like this:
REAL
1
2
3
IMAGINARY
4
5
6
REAL
1
2
3
IMAGINARY
4
5
6
another.txt ends up looking like this:
REAL
1
2
3
REAL
1
2
3
Original answer that only works when there is one REAL section
If the file is "small" enough to be read as an entire string and there is only one instance of "IMAGINARY PART", you can do this:
file_str = file_str.split("IMAGINARY PART")[0]
which would get you everything above the "IMAGINARY PART" line.
You can then apply the rest of your code to this file_str string that contains only the real part
to elaborate more, file_str is a str which is obtained by the following:
with open('data.dat', 'r') as my_data:
file_str = my_data.read()
the "with" block is referenced all over stack exchange, so there may be a better explanation for it than mine. I intuitively think about it as "open a file named 'data.dat' with the ability to only read it and name it as the variable my_data. once its opened, read the entirety of the file into a str, file_str, using my_data.read(), then close 'data.dat' "
now you have a str, and you can apply all the applicable str functions to it.
if "IMAGINARY PART" happens frequently throughout the file or the file is too big, Tadgh's suggestion of a flag a break works well.
for line in f:
if "IMAGINARY PART" not in line:
#do stuff
else:
f.close()
break

Conversion of Multiple Strings To ASCII

This seems fairly trivial but I can't seem to work it out
I have a text file with the contents:
B>F
I am reading this with the code below, stripping the '>' and trying to convert the strings into their corresponding ASCII value, minus 65 to give me a value that will correspond to another list index
def readRoute():
routeFile = open('route.txt', 'r')
for line in routeFile.readlines():
route = line.strip('\n' '\r')
route = line.split('>')
#startNode, endNode = route
startNode = ord(route[0])-65
endNode = ord(route[1])-65
# Debug (this comment was for my use to explain below the print values)
print 'Route Entered:'
print line
print startNode, ',', endNode, '\n'
return[startNode, endNode]
However I am having slight trouble doing the conversion nicely, because the text file only contains one line at the moment but ideally I need it to be able to support more than one line and run an amount of code for each line.
For example it could contain:
B>F
A>D
C>F
E>D
So I would want to run the same code outside this function 4 times with the different inputs
Anyone able to give me a hand
Edit:
Not sure I made my issue that clear, sorry
What I need it do it parse the text file (possibly containing one line or multiple lines like above. I am able to do it for one line with the lines
startNode = ord(route[0])-65
endNode = ord(route[1])-65
But I get errors when trying to do more than one line because the ord() is expecting different inputs
If I have (below) in the route.txt
B>F
A>D
This is the error it gives me:
line 43, in readRoute endNode = ord(route[1])-65
TypeError: ord() expected a character, but string of length 2 found
My code above should read the route.txt file and see that B>F is the first route, strip the '>' - convert the B & F to ASCII, so 66 & 70 respectively then minus 65 from both to give 1 & 5 (in this example)
The 1 & 5 are corresponding indexes for another "array" (list of lists) to do computations and other things on
Once the other code has completed it can then go to the next line in route.txt which could be A>D and perform the above again

Perhaps this will work for you. I turned the fileread into a generator so you can do as you please with the parsed results in the for-i loop.
def readRoute(file_name):
with open(file_name, 'r') as r:
for line in r:
yield (ord(line[0])-65, ord(line[2])-65)
filename = 'route.txt'
for startnode, endnode in readRoute(filename):
print startnode, endnode

If you can't change readRoute, change the contents of the file before each call. Better yet, make readRoute take the filename as a parameter (default it to 'route.txt' to preserve the current behavior) so you can have it process other files.

What about something like this? It takes the routes defined in your file and turns them into path objects with start and end member variables. As an added bonus PathManager.readFile() allows you to load multiple route files without overwriting the existing paths.
import re
class Path:
def __init__(self, start, end):
self.start = ord(start) - 65 # Scale the values as desired
self.end = ord(end) - 65 # Scale the values as desired
class PathManager:
def __init__(self):
self.expr = re.compile("^([A-Za-z])[>]([A-Za-z])$") # looks for string "C>C"
# where C is a char
self.paths = []
def do_logic_routine(self, start, end):
# Do custom logic here that will execute before the next line is read
# Return True for 'continue reading' or False to stop parsing file
return True
def readFile(self, path):
file = open(path,"r")
for line in file:
item = self.expr.match(line.strip()) # strip whitespaces before parsing
if item:
'''
item.group(0) is *not* used here; it matches the whole expression
item.group(1) matches the first parenthesis in the regular expression
item.group(2) matches the second
'''
self.paths.append(Path(item.group(1), item.group(2)))
if not do_logic_routine(self.paths[-1].start, self.paths[-1].end):
break
# Running the example
MyManager = PathManager()
MyManager.readFile('route.txt')
for path in MyManager.paths:
print "Start: %s End: %s" % (path.start, path.end)
Output is:
Start: 1 End: 5
Start: 0 End: 3
Start: 2 End: 5
Start: 4 End: 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to create an index to parse big text file - python

Related

Replace space-char by newline-char within max-line-length in python

Drawing multiple sequences from 1 file, based on shared fields in another file

Python: How to compare string from two text files and retrieve an additional line of one in case of match

Double if conditional in the line.startswith strategy

Conversion of Multiple Strings To ASCII

Categories

Resources