Is the index_db method from the SeqIO object in Biopython slow? - python

I've got this:
files = glob.glob(str(dir_path) + "*.fa")
index = SeqIO.index_db(index_filename, files, "fasta")
seq = index[accession] # Slow
index.close()
return seq
and i'm working on big files (gene sequences) but for some reasons, it takes about 4 secondes to get the sequence I'm looking for. I'm wondering if the index_db method is suppose to be that slow? Am I using the right method?
Thanks.

The first time the database is created it can take some time. The next times, if you don't delete the index_filename created, it should go faster.
Lets say you have your 25 files each with some genes. This method creates a SQLite DB that helps locating the sequences among the files, like "Get me the gene XXX" and the SQLite/index_db knows that the gene is in the file 12.fasta and its exact location inside the file. So Biopython opens the file and scans quickly to the gene position.
Without that index_db you have to load every Records into memory, which is fast but some files might not fit in the RAM.
If you want speed to fetch regions you can use FastaFile from pysam and samtools. Like this:
You have to index all the fasta files with faidx:
$ samtools faidx big_fasta.fas
From your code write something like this:
from pysam import FastaFile
rec = FastaFile("big_fasta.fas") # big_fasta.fas.fai must exist.
seq = rec.fetch(reference=gene_name, start=1000, end= 1200)
print(s)
In my computer this times 2 orders of magnitude faster than Biopython for the same operation, but you only get the pure sequence of bases.

Related

Summarizing huge amounts of data

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

Python Running Increasingly Slower, Garbage Collection Issue?

So I have code that grabs a list of files from a directory that initially had over 14 millions files. This is a hex-core machine with 20 GB RAM running Ubuntu 14.04 desktop and just grabbing a list of files takes hours - I haven't actually timed it.
Over the past week or so I've run code that doesn't nothing more than gather this list of files, open each file to determine when it was created, and move it to a directory based on the month and year it was created. (The files have been both scp'd and rsync'd so the timestamp the OS provides is meaningless at this point, hence opening the file.)
When I first started running this loop it was moving 1000 files in about 90 seconds. Then after several hours like this that 90 seconds became 2.5 min, then 4, then 5, then 9, and eventually 15 min. So I shut it down and started over.
I noticed that today once it was done gathering a list of over 9 millions files that moving 1000 files took 15 min right off the bat. I just shut the process down again and rebooted the machine because the time to move 1000 files had climbed to over 90 min.
I had hoped to find some means of doing a while + list.pop() style strategy to free memory as the loop progressed. Then found a couple of SO posts that said it could be done with for i in list: ... list.remove(...) but that this was a terrible idea.
Here's the code:
from basicconfig.startup_config import *
arc_dir = '/var/www/data/visits/'
def step1_move_files_to_archive_dirs(files):
"""
:return:
"""
cntr = 0
for f in files:
cntr += 1
if php_basic_files.file_exists(f) is False:
continue
try:
visit = json.loads(php_basic_files.file_get_contents(f))
except:
continue
fname = php_basic_files.basename(f)
try:
dt = datetime.fromtimestamp(visit['Entrance Time'])
except KeyError:
continue
mYr = dt.strftime("%B_%Y")
# Move the lead to Monthly archive
arc_path = arc_dir + mYr + '//'
if not os.path.exists(arc_path):
os.makedirs(arc_path, 0777)
if not os.path.exists(arc_path):
print "Directory: {} was not created".format(arc_path)
else:
# Move the file to the archive
newFile = arc_path + fname
#print "File moved to {}".format(newFile)
os.rename(f, newFile)
if cntr % 1000 is 0:
print "{} files moved ({})".format(cntr, datetime.fromtimestamp(time.time()).isoformat())
def step2_combine_visits_into_1_file():
"""
:return:
"""
file_dirs = php_basic_files.glob(arc_dir + '*')
for fd in file_dirs:
arc_files = php_basic_files.glob(fd + '*.raw')
arc_fname = arc_dir + php_basic_str.str_replace('/', '', php_basic_str.str_replace(arc_dir, '', fd)) + '.arc'
try:
arc_file_data = php_basic_files.file_get_contents(arc_fname)
except:
arc_file_data = {}
for f in arc_files:
uniqID = moduleName = php_adv_str.fetchBefore('.', php_basic_files.basename(f))
if uniqID not in arc_file_data:
visit = json.loads(php_basic_files.file_get_contents(f))
arc_file_data[uniqID] = visit
php_basic_files.file_put_contents(arc_fname, json.dumps(arc_file_data))
def main():
"""
:return:
"""
files = php_basic_files.glob('/var/www/html/ver1/php/VisitorTracking/data/raw/*')
print "Num of Files: {}".format(len(files))
step1_move_files_to_archive_dirs(files)
step2_combine_visits_into_1_file()
Notes:
basicconfig is essentially a bunch of constants I have for the environment and a few commonly used libraries like all the php_basic_* libraries. (I used PHP for years before picking up Python so I built a library to mimic the more common functions I used in order to be up and running with Python faster.)
The step1 def is as far as the program gets so far. The step2 def could, and likely should, be run in parallel. However, I figured I/O was the bottleneck and doing even more of it in parallel would likely slow all functions down a lot more. (I have been tempted to rsync the archive directories to another machine for aggregation thus getting parallel speed without the I/O bottleneck but figured the rsync would also be quite slow.)
The files themselves are all 3 Kb each so not very large.
----- Final Thoughts -------
Like I said, it doesn't appear, to me at least, that any data is being stored from each file opening. Therefore memory should not be an issue. However, I do notice that only 1.2 GB of RAM is being used right now and over 12 GB of was being used before. A big chunk of that 12 could be storing 14 million file names and paths. I've only just started the processing again so for next several hours python will be gathering a list of files and that list isn't in memory yet.
So I was wondering if there was a garbage collection issue or something else I was missing. Why is it slowing down as it progresses through the loop?
step1_move_files_to_archive_dirs:
Here's some reasons Step 1 might be taking longer than you expected...
The response to any exception during Step 1 is to continue to the next file. If you have any corrupted data files, they will stay in the filesystem forever, increasing the amount of work this function has to do next time (and the next, and the next...).
You are reading in every file and converting it from JSON to a dict, just to extract one date. So everything is read and converted at least once. If you control the creation of these files, it might be worth storing this value in the filename or in a separate index / log, so you don't have to go searching for that value again later.
If the input directories and output / archive directories are on separate filesystems, os.rename(f, newFile) can't just rename the file, but has to copy every byte from the source filesystem to the target filesystem. So either every file is near-instantaneously renamed, or every input file is slowly copied.
PS: It's weird that this function double-checks things like whether the input file still exists, or if os.makedirs worked, but then allows any exception from os.rename to crash you mid-loop.
step2_combine_visits_into_1_file:
All your file I/O is hidden inside that PHP library, but it looks to this PHP outsider like you're trying to store in RAM the contents of all the files in each subdirectory. Then, you accumulate all those contents inside some smaller number of archive files, while preserving (most of?) the data that was already there. Not only is that probably slow to begin with, it will get slower as time goes on.
Function code mostly replaced by comments:
file_dirs = # arch_dir/* --- Maybe lots, maybe only a few.
for fd in file_dirs:
arc_files = # arch_dir/subdir*.raw or maybe arch_dir/subdir/*.raw.
arc_fname = # subdir.arc
arc_file_data = # Contents of JSON file subdir.arc, as a dict.
for f in arc_files: # The *.raw files.
uniqID = # String based on f's filename.
if uniqID not in arc_file_data:
# Add to arc_file_data the uniqID key, and the
# _ entire contents_ of the .raw file as its value.
php_basic_files.file_put_contents # (...)
# Convert the arc_file_data dict into one _massive_ string,
# and replace the contents of the subdir.arc file.
Unless you have some maintenance job that periodically trims the *.arc files, you will eventually have the entire contents of all 14 million files (plus any older files) inside the *.arc files. Each of those .arc files gets read into a dict, converted to a mega-string, grown (probably), and then written back to the filesystem. That's a ton of I/O, even if the average .arc file isn't very big (which can only happen if there are lots of them).
Why do all this anyway? By the start of Step 2, you've already got a unique ID for each .raw input file, and it's already in the filename --- so why not use the filesystem itself to store /arch_dir/subdir/unique_id.json?
If you really do need all this data in a few huge archives, that shouldn't require so much work. The .arc files are little more than the unaltered contents of the .raw files, with bits of a JSON dictionary between them. A simple shell script could slap that together without ever interpreting the JSON itself.
(If the values are not just JSON but quoted JSON, you would have to change whatever reads the .arc files to not un-quote those values. But now I'm purely speculating, since I can only see some of what's happening.)
PS: Am I missing something, or is arc_files a list of *.raw filenames. Shouldn't it be raw_files?
Other Comments:
As others have noted, if your file-globbing function returns a mega-list of 14 million filenames, it would be vastly more memory-efficient as a generator that can yield one filename at a time.
Finally, you mentioned popping filenames off a list (although I don't see that in your code)... There is a huge time penalty for inserting or removing the first element of a large list --- del my_list[0] or my_list.pop(0) or my_list.insert(0, something) --- because items 1 though n-1 all have to be copied one index toward 0. That turns an O(n) operation into O(n**2)... again, if that's in your code anywhere.

How to efficiently iterate two files in python?

I have two text files which should have a lot of matching lines and I want to find out exactly how many lines match between the files. The problem is that both of the files are quite large (one file is about 3gb and the other is over 16gb). So obviously reading them into the system memory using read() or readlines() could be very problematic. Any tips? The code I'm writing is basically just a 2 loops and an if statement to compare them.
Since the input files are very large, if you care about performance, you should consider simply using grep -f. The -f option reads patterns from a file, so depending on the exact semantics you're after, it may do what you need. You probably want the -x option too, to take only whole-line matches. So the whole thing in Python might look something like this:
child = subprocess.Popen(['grep', '-xf', file1, file2], stdout=subprocess.PIPE)
for line in child.stdout:
print line
why not use unix grep? if you want your solution platform independent then this solution will not work. But in unix it works. Run this command from your python script.
grep --fixed-strings --file=file_B file_A > result_file
Also this problem seems to be a good reason to go for map-reduce.
UPDATE 0: To elucidate. --fixed-strings = Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. and --file= Obtain patterns from FILE, one per line.
So what we ar doing is getting patterns from file_B matched against the content in file_A and fixed-strings treats them as a sequence of patterns the way they are in a file. Hope this makes it clearer.
Since you want the count of matching lines a slight modification of the above grep we get the count -
grep --fixed-strings --file=file_B file_A | wc -l
UPDATE 1: You could you do this. first go through each file separately line by line. dont read the entire file into memory. when you read one line compute md5 hash of this line and write it to another file. When you do this 2 both files, you get 2 new files filled with md5 hashes. I am hoping that these 2 files are substantially smaller in size the original files, since md5 is 16bytes irrespective of i/p string. Now you can probably do grep or other diffing techniques with little or no memory problem. – Srikar 3 mins ago edit
UPDATE 2: (few days later) Can you do this? create 2 tables table1, table2 in mysql. Both having only 2 fields id, data. Insert both the files into both these tables, line by line. After which run a query to find count of duplicates. You have to go through both files. Thats given. We cant run away from that fact. Now the optimisations can be done in how dups are found. MySQL is one such option. It removes a lot of things that you need to do like RAM space, index creation etc.
Well thanks all for your input! But what I ended up doing was painfully simple. I was trying things like this, which read in the whole file.
file = open(xxx,"r")
for line in file:
if.....
What I ended up doing was
for line in open(xxx)
if.....
The second one takes the file line by line. It's very time consuming, but I've pretty much accepted that there isn't some magically way to do this that will take very little time :(

What is the best way to do a find and replace of multiple queries on multiple files?

I have a file that has over 200 lines in this format:
name old_id new_id
The name is useless for what I'm trying to do currently, but I still want it there because it may become useful for debugging later.
Now I need to go through every file in a folder and find all the instances of old_id and replace them with new_id. The files I'm scanning are code files that could be thousands of lines long. I need to scan every file with each of the 200+ ids that I have, because some may be used in more than one file, and multiple times per file.
What is the best way to go about doing this? So far I've been creating python scripts to figure out the list of old ids and new ids and which ones match up with each other, but I've been doing it very inefficient because I basically scanned the first file line by line and got the current id of the current line, then I would scan the second file line by line until I found a match. Then I did this over again for each line in the first file, which ended up with my reading the second file a lot. I didn't mind doing this inefficiently because they were small files.
Now that I'm searching probably somewhere around 30-50 files that can have thousands of line of code in it, I want it to be a little more efficient. This is just a hobbyist project, so it doesn't need to be super good, I just don't want it to take more than 5 minutes to find and replace everything, then look at the result and see that I made a little mistake and need to do it all over again. Taking a few minutes is fine(although I'm sure with computers nowadays they can do this almost instantly still) but I just don't want it to be ridiculous.
So what's the best way to go about doing this? So far I've been using python but it doesn't need to be a python script. I don't care about elegance in the code or way I do it or anything, I just want an easy way to replace all of my old ids with my new ids using whatever tool is easiest to use or implement.
Examples:
Here is a line from the list of ids. The first part is the name and can be ignored, the second part is the old id, and the third part is the new id that needs to replace the old id.
unlock_music_play_grid_thumb_01 0x108043c 0x10804f0
Here is an example line in one of the files to be replaced:
const v1, 0x108043c
I need to be able to replace that id with the new id so it looks like this:
const v1, 0x10804f0
Use something like multiwordReplace (I've edited it for your situation) with mmap.
import os
import os.path
import re
from mmap import mmap
from contextlib import closing
id_filename = 'path/to/id/file'
directory_name = 'directory/to/replace/in'
# read the ids into a dictionary mapping old to new
with open(id_filename) as id_file:
ids = dict(line.split()[1:] for line in id_file)
# compile a regex to do the replacement
id_regex = re.compile('|'.join(map(re.escape, ids)))
def translate(match):
return ids[match.group(0)]
def multiwordReplace(text):
return id_regex.sub(translate, text)
for code_filename in os.listdir(directory_name):
with open(os.path.join(directory, code_filename), 'r+') as code_file:
with closing(mmap(code_file.fileno(), 0)) as code_map:
new_file = multiword_replace(code_map)
with open(os.path.join(directory, code_filename), 'w') as code_file:
code_file.write(new_file)

Confusing loop problem (python)

this is similar to the question in merge sort in python
I'm restating because I don't think I explained the problem very well over there.
basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)
a typical file might look like
something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org
but obviosuly longer.
I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.
What I want to do basically is
open all the files
loop:
read 1 line from each open file
put them all in a list and sort with .sort(get_tld)
write each item from the list to a new file
the problem I'm having is that I can't figure out how to loop over the files
I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.
any advice is much appreciated.
Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).
The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).
import heapq
def merge(inputfiles, outputfile, key):
"""inputfiles: list of input, sorted files open for reading.
outputfile: output file open for writing.
key: callable supplying the "key" to use for each line.
"""
# prepare the heap: items are lists with [thekey, k, theline, thefile]
# where k is an arbitrary int guaranteed to be different for all items,
# theline is the last line read from thefile and not yet written out,
# (guaranteed to be a non-empty string), thekey is key(theline), and
# thefile is the open file
h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
h = [[key(s), k, s, i] for k, s, i in h if s]
heapq.heapify(h)
while h:
# get and output the lowest available item (==available item w/lowest key)
item = heapq.heappop(h)
outputfile.write(item[2])
# replenish the item with the _next_ line from its file (if any)
item[2] = item[3].readline()
if not item[2]: continue # don't reinsert finished files
# compute the key, and re-insert the item appropriately
item[0] = key(item[2])
heapq.heappush(h, item)
Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,
def tld(domain):
return domain.rsplit('.', 1)[-1].strip()
or something along these lines is probably a reasonable approach under this constraint.
If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.
You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.
Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.
So:
3 input files split into 26+ source files
26+ source files could be loaded individually, sorted in RAM and then written to the combined file.
If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.
Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.
#! /usr/bin/env python
"""Usage: unconfuse.py file1 file2 ... fileN
Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""
import sys, os
spools = {}
for name in sys.argv[1:]:
for line in file(name):
if (line == "\n"): continue
tld = line[line.rindex(".")+1:-1]
spool = spools.get(tld, None)
if (spool == None):
spool = file(tld + ".spool", "w+")
spools[tld] = spool
spool.write(line)
for tld in sorted(spools.iterkeys()):
spool = spools[tld]
spool.seek(0)
for line in spool:
sys.stdout.write(line)
spool.close()
os.remove(spool.name)

Categories