Parsing Huge structured file in python 2.7

Parsing Huge structured file in python 2.7 - python

I am a newbie in the python world and bioinformatics. I am dealing with a almost 50GB structured file to write it out. So I would like to take some great tips from you.
The file goes like this. (it's actually called FASTQ_format)
#Machinename:~:Team1:atcatg 1st line.
atatgacatgacatgaca 2nd line.
+ 3rd line.
asldjfwe!##$#%$ 4th line.
These four lines are repeated in order. Those 4 lines are like a team.
And I have nearly 30 candidates DNA sequences. e.g. atgcat, tttagc
What I am doing is have each candidate DNA sequence going through the huge file to find whether a candidate sequence is similar to Team dna sequence, which means allowing one mismatch to each (e.g. taaaaa = aaaata) and if they are similar or same, I use dictionary to store them to write it out later. key for candidate DNA sequence. Value for (4 lines) in List to store them in order by line order
So what I have done is:
def myfunction(str1, str2): # to find if they are similar( allowed one mis match) if they are similar, it returns true
f = open('hugefile')
diction = {}
mylist = ['candidate dna sequences1','dna2','dna3','dna4'...]
while True:
line = f.readline()
if not line:
break
if "machine name" in line:
teamseq = line.split(':')[-1]
if my function(candidate dna, team dna) == True:
if not candidate dna in diction.keys():
diction[candidate dna] = []
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
else: # chances some same team dna are repeated.
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
f.close()
wf = open(hughfile+".out", 'w')
for i in candidate dna list: # dna 1 , dna2, dna3
wf.write(diction[i] + '\n')
wf.close()
My function doesn't use any global variables (I think I am happy with my function), whereas the dictionary variable is a global variable and takes all the data as well as making lots of list instances. The code is simple but so slow and such a big pain in the butt to the CPU and memory. I use pypy though.
So any tips write it out in order by line order?

I suggest opening input and output files simultaneously and writing to the output as you step through the input. As it is now, you are reading 50GB into memory and then writing it out. That is both slow and unnecessary.
IN PSEUDOCODE:
with open(huge file) as fin, open(hughfile+".out", 'w') as fout:
for line in f:
if "machine name" in line:
# read the following 4 lines from fin as a record
# process that record
# write the record to fout
# the input record in no longer needed -- allow to be garbage collected...
As I have outlined it, the previous 4 line records are written as they are encountered and then disposed of. If you need to refer to diction.keys() for previous records, only keep the minimum necessary as a set() to cut down the total size of the in-memory data.

Related

Joblib too slow using "if not in" loop

I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]

This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)

How can I check if any part of a line from a list contains the full line of another list? PYTHON

I've tried to find the correct code and have written multiple attempts at this myself but am hoping that someone here can help.
I have a small file and a large file:
Small: 90,000 lines
Large: 1,200,000 lines
I'm trying to print any line from the large file to a result file if that line from the large file contains any line from the small file. Also I should add to lower case as that should not matter.
Example:
Large file line: '{"data": "SN#StackOverflow"}'
Small file line (ANY): '#stackoVerfloW'
-> Prints large line.
This should yield a print of the large line to the result file. Note that the small files line would be a subset of the larger files line. This is the goal as I am not looking for matching lines but rather lines that contain data from the small file and then saving those large file lines as results.
I've tried to do this with a nested for loop however it is not printing any of the results. Also I loaded the files in as lists hoping to reduce time as the lines will be stored in memory. It is also not possible for me to sort these lines as the lines are not uniform.
results = open("results.txt", "w")
small = []
large = []
for line in open('small.txt'):
small.append(line)
for line in open('large.txt'):
large.append(line)
for j in large:
for i in small:
if i in j:
results.write(j + '\n')
Please let me know if there is anything I can do to clarify my issue and sorry this is my first post and I hope to write better questions in the future.

The problem there can be though -
as you put it, the naive approach of trying to match all 90000 lines against each of the 1,200,000 lines on the other file, and even while at that, performing an 'contains' (Python in ) operator on the large file line, will lead to a prohibitive processing cost. You are basically talking about M x N x O operations, M being the large file size, N being the small file, and O the average length of lines in the larger file (minus average length of lines in the small file) - those are 1 trillion operations to start with - with computer operating in the GHz range, it could be feasible in a few hours, if the small file can fit into memory.
A smarter alternative would use location-independent hashes for each line in the large file. Location independent hashes can be complicated - but a few hashes of all possible word subset in the wider line can be matched against a dictionary containing all the 90.000 lines in the smaller file in constant time O(1) - doing it once for each of the 1,200,000 lines can be done in linear time -reducing this search to a few seconds from hours or days, just using string normalization and python dictionaries.
In the end, this should be all the needed code:
import re
def normalize(text):
text = text.strip().lower()
# strip punctuation
text = re.sub('[^\w\ \d]', ' ', text)
words = tuple(text.split())
return words
def normalized_subsets(text):
words = normalize(text)
combinations = set()
for i in range(len(words)):
for j in range(i + 1, len(words) + 1):
combinations.add(words[i: j])
return combinations
def main():
small = {normalize(line): linenum for
linenum, line in enumerate(open("small.txt")) if line.strip()}
with open("large.txt") as large, open("results.txt", "w") as results:
for line_num, line in large:
for combination in combinations(line):
if combination in small:
results.write(f"{linenum} {small[combination]} {line}")
main()
So, while you still see nested for loops in this code, the nested versions only loops through the possible subsets of words in each line - for a line with 30 words, that will be less than 500 combinations - we make 500 comparisons to match any of these word-subgroup in the 90000 smaller file dictionary, as opposed to 90.000 comparisons.
So, it is still a quadratic algorithm in the end, but should be sharply faster - (for the sample line in the question, after removing the punctuation, it will try a match for every element in
{('data',),
('data', 'sn'),
('data', 'sn', 'stackoverflow'),
('sn',),
('sn', 'stackoverflow'),
('stackoverflow',)}
Which is just 6 comparisons (down for a linear search in 90000 lines)
(For greater value this code is also recording the line numbers in the large file and on the smaller file at the beggining of the match line in the results file)

You can also simplify the "file to list-of-line" step by doing:
small = open('small.txt', 'r').readlines()
large = open('large.txt', 'r').readlines()
then iterating as follow:
with open("results.txt", "w") as results:
for j in large:
for i in small:
if i.lower() in j.lower():
results.write(j)
Good luck

There is no need to read all of the large file into memory at the same time. You would also almost certainly want to strip off the newline characters from the lines from the small file before doing your in test (you can use strip for this, to strip any leading and trailing whitespace).
Looking at your example strings, it also seems that you need to do a case insensitive comparison, hence using lower() here to convert both to lower case before comparing them.
You would presumably also only write each output line once even if there are multiple lines in the small file that match it, hence the break. Note also you don't need to write an additional newline if you have not stripped it from the input line from the large file.
Putting these together would give something like this.
small = []
with open('small.txt') as f:
for line in f:
small.append(line.strip().lower())
with open('results.txt', 'w') as fout:
with open('large.txt') as fin:
for line in fin:
for i in small:
if i in line.lower():
fout.write(line)
break # breaks from inner loop only (for i in small)

If your sample data is indicative of the actual data you're using, it could be as simple as comparing them lower-case:
# ... your I/O code
for j in large:
for i in small:
if i.lower() in j.lower():
results.write(j + '\n')
Note the .lower() calls, which is the only modification I've made to your code.
If this still doesn't work, please post a few more lines from each file to help us assess.

Python: losing nucleotides from fasta file to dictionary

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio

Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

How to jump between lines in for loop in python?

I have a file of sequence information, so the file will be structured like this,
[SEQUENCE ID]
atgctagctagatcga
[SEQUENCE ID]
agatcgatggctagatc
What I've been doing is comparing between files to see what sequences IDs are shared, which is simple enough, but now I want to pull out the actual sequence associated with the ID. The files I'm using are huge (10 GB+) so using a dictionary or anything that would involve reading all the lines into the system memory is out.
Basically what the code is intended to do is if the sequence ID from file 1 isn't found in file 2, then return the line after the sequence ID from file 1. Any tips?

So you only need line N and line N+1? In this case read the file in chunks of two lines. Then you always have access to both the sequence ID and the sequence.
from itertools import izip
with open('data.txt', 'r') as f:
for line1, line2 in izip(*(iter(f),) * 2):
print line1, line2

Short answer: you will have to use a third party Python library to keep one of the data sequences searchable in better than O(n).
If they are not sorted, you will have to sort at least one of the files. Think of it this way:
I get the sequence ID from file 1 - and to check if it is not present in file2, I'dhave to read all the file - much eless feasible than reading the file once.
Than - better than sorting, it would be usefull to have a data-structure that could hold the sorted data on disc in a way to provide for fast searchs, and still be able to grow - that woulf facilitate sorting as well,a s all you'd have to do in a first step would be reading the entries in file 2, and just inserting then into this growing-sorted disk-persisted data structure.
While certainly you could roll your own data-structure to do this, I'd suggest the ue of ZODB - ZOPE's object oriented DATABSe, witha btree folder, and have your "2 lines of data" made into a minimal object for your task.

Assuming the [SEQUENCE ID] s do fit in memory, and that the bulk of your data is actually on the sequence line (unlike the examples provided) - you have the option to parse a file (file2 in your question), and anotate not only te [SEQUENCE ID] - but the file postion for each such identifier. This approach would enable you to proceed without braking much of your current workflow (like, having to learn about a database)
:
def get_indexes(filename):
with open(filename, "rt") as file:
sequences = {}
while True:
position = file.tell()
id = file.readline()
if not id:
break()
sequences[id.strip()] = position
# skip corresponding data line:
file.readline()
return sequences
def fetcher(filename1, filename2, sequences):
with open(filename1, "rt") as file1, open(filename2, "rt" as file2):
while True:
id = file.readline()
data = file.readline()
if not id:
break
id = id.strip()
if id in sequences:
# postion file2 reading at the identifier:
file2.seek(sequences[id])
# throw away id:
file2.readline()
data = file.readline()
yield id, data
if __name__== "__main__":
sequences = getindexes("/data/file2")
for id, data in fetcher("/data/file1", "/data/file2", sequences):
print "%s\n%s"% (id, data)

Printing elements out of list

I have a certain check to be done and if the check satisfies, I want the result to be printed. Below is the code:
import string
import codecs
import sys
y=sys.argv[1]
list_1=[]
f=1.0
x=0.05
write_in = open ("new_file.txt", "w")
write_in_1 = open ("new_file_1.txt", "w")
ligand_file=open( y, "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines ) #Remove the newline character from all the pdb file names
ligand_file.close()
ligand_file=open( "unique_count_c_from_ac.txt", "r" ) #Open the receptor.txt file
ligand_lines_1=ligand_file.readlines() # Read all the lines into the array
ligand_lines_1=map( string.strip, ligand_lines_1 ) #Remove the newline character from all the pdb file names
ligand_file.close()
s=[]
for i in ligand_lines:
for j in ligand_lines_1:
j = j.split()
if i == j[1]:
print j
The above code works great but when I print j, it prints like ['351', '342'] but I am expecting to get 351 342 (with one space in between). Since it is more of a python question, I have not included the input files (basically they are just numbers).
Can anyone help me?
Cheers,
Chavanak

To convert a list of strings to a single string with spaces in between the lists's items, use ' '.join(seq).
>>> ' '.join(['1','2','3'])
'1 2 3'
You can replace ' ' with whatever string you want in between the items.

Mark Rushakoff seems to have solved your immediate problem, but there are some other improvements that could be made to your code.
Always use context managers (with open(filename, mode) as f:) for opening files rather than relying on close getting called manually.
Don't bother reading a whole file into memory very often. Looping over some_file.readilines() can be replaced with looping over some_file directly.
For example, you could have used map(string.strip, ligland_file) or better yet [line.strip() for line in ligland_file]
Don't choose names to include the type of the object they refer to. This information can be found other ways.
For exmaple, the code you posted can be simplified to something along the lines of
import sys
from contextlib import nested
some_real_name = sys.argv[1]
other_file = "unique_count_c_from_ac.txt"
with nested(open(some_real_name, "r"), open(other_file, "r")) as ligand_1, ligand_2:
for line_1 in ligand_1:
# Take care of the trailing newline
line_1 = line_1.strip()
for line_2 in ligand_2:
line_2 = line2.strip()
numbers = line2.split()
if line_1 == numbers[1]:
# If the second number from this line matches the number that is
# in the user's file, print all the numbers from this line
print ' '.join(numbers)
which is more reliable and I believe more easily read.
Note that the algorithmic performance of this is far from ideal because of these nested loops. Depending on your need, this could potentially be improved, but since I don't know exactly what data you need to extract to tell you whether you can.
The time this takes currently in my code and yours is O(nmq), where n is the number of lines in one file, m is the number of lines in the other, and q is the length of lines in unique_count_c_from_ac.txt. If two of these are fixed/small, then you have linear performance. If two can grow arbitrarily (I sort of imagine n and m can?), then you could look into improving your algorithm, probably using sets or dicts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Huge structured file in python 2.7 - python

Related

Joblib too slow using "if not in" loop

How can I check if any part of a line from a list contains the full line of another list? PYTHON

Python: losing nucleotides from fasta file to dictionary

How to jump between lines in for loop in python?

Printing elements out of list

Categories

Resources