Biopython Global Alignment : Out of Memory

Biopython Global Alignment : Out of Memory - python

Im trying the global alignment method from the Biopython module. Using it on short sequences is easy and gives an alignment matrix straightaway. However I really need to run it on larger sequences I have (an average lenght of 2000 nucleatides (or) characters). However I keep running into the Out of Memory error. I looked on SO and found this previous question. The answers provided are not helpfull as they link to the this same website which cant be accessed now.Apart from this I have tried these steps:
I tried using a 64-bit python since my personal computer has 4gb RAM.
sshed to a small school server with 16gb RAM and tried running on that. Its still running after close to 4 hours.
Since its is a small script im unsure how to modify it. ANy help will be greatly appreciated.
My script:
import os
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
file_list = []
file_list = [each for each in os.listdir(os.getcwd()) if each.endswith(".dna")]
align_file = open("seq_align.aln","w")
seq_list = []
for each_file in file_list:
f_o = open(each_file,"r")
seq_list.append(f_o.read())
for a in pairwise2.align.globalmx(seq_list[0],seq_list[1]):
align_file.write(format_alignment(*a))
align_file.close()

So the school server finally completed the task. What I realized was that for each alignment there were 1000 matrices constructed and calculated. The align.globalxx method has a variable MAX_ALIGNMENT which is by default set to 1000. Changing it via monkey patching dint really change anything . The documentation says that, the method tries all possible alignments (yes 1000) but in my case all the matrices have the same alignment score (as well as few test sequences I tried). Finally a small piece of comment in the documentation states that if you need only 1 score, use the optional parameter one_alignment_only which accepts a boolean value only. All i did was this:
for a in pairwise2.align.globalmx(seq_list[0],seq_list[1],one_alignment_only=True):
align_file.write(format_alignment(*a))
This reduced the time considerably. However my PC still crashed so I assume this is a very memory intensive task and requires much more RAM (16gb on a small server). So probably a more efficient way of reading the sequences in the matrix should be thought of.

Related

Detect memory swapping in Python

How to detect when OS starts swapping some resources of the running process to disk?
I came here from basically the same question. The psutil library is obviously great and provides a lot of information, yet, I don't know how to use it to solve the problem.
I created a simple test script.
import psutil
import os
import numpy as np
MAX = 45000
ar = []
pr_mem = []
swap_mem = []
virt_mem = []
process = psutil.Process()
for i in range(MAX):
ar.append(np.zeros(100_000))
pr_mem.append(process.memory_info())
swap_mem.append(psutil.swap_memory())
virt_mem.append(psutil.virtual_memory())
Then, I plotted the course of those statistics.
import matplotlib.pyplot as plt
plt.figure(figsize=(16,12))
plt.plot([x.rss for x in pr_mem], label='Resident Set Size')
plt.plot([x.vms for x in pr_mem], label='Virtual Memory Size')
plt.plot([x.available for x in virt_mem], label='Available Memory')
plt.plot([x.total for x in swap_mem], label='Total Swap Memory')
plt.plot([x.used for x in swap_mem], label='Used Swap Memory')
plt.plot([x.free for x in swap_mem], label='Free Swap Memory')
plt.legend(loc='best')
plt.show()
I cannot see, how I can use the information about swap memory to detect the swapping of my process.
The 'Used Swap Memory' information is kind of meaningless, as it is high from the very first moment (it counts global consumption) when the data of the process are obviously not swapped.
It seems best to look at the difference between 'Virtual Memory Size' and 'Resident Set Size' of the process. If VMS greatly exceeds RSS, it is a sign that the data are not on RAM but on disk.
However, here is described a problem with sudden explosion of VMS which makes it irrelevant in some cases, if I understand it correctly.
Other approach is to watch 'Available Memory' and be sure that it does not drop below a certain threshold, as psutil documentation suggests. But to me it seems complicated to set the threshold properly. They suggests (in the small snippet) 100 MB, but on my machine it is something like 500 MB.
So the question stands. How to detect when OS starts swapping some resources of the running process to disk?
I work on Windows, but the solution needs to be cross-platform (or at least as cross-platform as possible).
Any suggestion, advice or useful link is welcomed. Thank you!
For the context, I write a library which needs to manage its memory consumption.
I believe that with the knowledge of the program logic, my manual swapping (serializing data to disk) can work better (faster) than OS swapping. More importantly, the OS swap space it limited, thus, sometimes it is necessary to do the manual swapping which does not utilize the OS swap space.
In order to start the manual swapping at the right time, it is crucial to know when OS swapping starts.

Memory error when running k_cliques_community

I run k_cliques_community on my network data and received a memory error after a long respond time. The code works perfectly fine for my other data but not this one.
c = list(k_clique_communities(G_fb, 3))
list(c[0])
Here is a snap shot of the trace error

I tried running your code on my system with 16 GB RAM and i7 -7700HQ, the kernel died after returning a memory error. I think it's because the computation of k-cliques of size 3 is taking a lot of computational power/memory since you have quite a large no. of nodes and edges. I think you need to look into other ways to find k-cliques, like GraphX - Apache Spark

You probably have an x32 installation of python. x32 installations are limited to 2 gigabytes of memory regardless of how much system memory you have. This is inconvenient, but it works. Uninstall python and install an x64 installation. Again, it's inconvenient, but it's the only solution that I've ever seen.

As stated in the networkx documentation:
To obtain a list of all maximal cliques, use list(find_cliques(G)). However, be aware that in the worst-case, the length of this list can be exponential in the number of nodes in the graph.
(this is referred to find_cliques, but it is related to your problem also).
That means, if there is a huge number of cliques, the strings you create with list will be so many that you'll probably run out of memory: strings need more space than int values to be saved.
I don't know if there is a work-around tho.

Is faster than 10 screenshots per second possible with Python?

Right now I'm in need of a really fast screenshotter to feed screenshot into a CNN that will update mouse movement based on the screenshot. I'm looking to model the same kind of behavior presented in this paper, and similarly do the steps featured in Figure 6 (without the polar conversion). As a result of needing really fast input, I've searched around a bit and have been able to get this script from here slightly modified that outputs 10fps
from PIL import ImageGrab
from datetime import datetime
while True:
im = ImageGrab.grab([320, 180, 1600, 900])
dt = datetime.now()
fname = "pic_{}.{}.png".format(dt.strftime("%H%M_%S"), dt.microsecond // 100000)
im.save(fname, 'png')
Can I expect anything faster? I'd be fine with using a different program if it's available.

Writing to disk is very slow, and is probably a big part of what's making your loop take so long. Try commenting out the im.save() line and seeing how many screenshots can be captured (add a counter variable or something similar to count how many screenshots are being captured).
Assuming the disk I/O is the bottleneck, you'll want to split these two tasks up. Have one loop that just captures the screenshots and stores them in memory (e.g. to a dictionary with the timestamp as the key), then in a separate thread pull elements out of the dictionary and write them to disk.
See this question for pointers on threading in Python if you haven't done much of that before.

Prime number hard drive storage for very large primes - Sieve of Atkin

I have implemented the Sieve of Atkin and it works great up to primes nearing 100,000,000 or so. Beyond that, it breaks down because of memory problems.
In the algorithm, I want to replace the memory based array with a hard drive based array. Python's "wb" file functions and Seek functions may do the trick. Before I go off inventing new wheels, can anyone offer advice? Two issues appear at the outset:
Is there a way to "chunk" the Sieve of Atkin to work on segment in memory, and
is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them.
Why am I doing this? An old geezer looking for entertainment and to keep the noodle working.

Implementing the SoA in Python sounds fun, but note it will probably be slower than the SoE in practice. For some good monolithic SoE implementations, see RWH's StackOverflow post. These can give you some idea of the speed and memory use of very basic implementations. The numpy version will sieve to over 10,000M on my laptop.
What you really want is a segmented sieve. This lets you constrain memory use to some reasonable limit (e.g. 1M + O(sqrt(n)), and the latter can be reduced if needed). A nice discussion and code in C++ is shown at primesieve.org. You can find various other examples in Python. primegen, Bernstein's implementation of SoA, is implemented as a segmented sieve (Your question 1: Yes the SoA can be segmented). This is closely related (but not identical) to sieving a range. This is how we can use a sieve to find primes between 10^18 and 10^18+1e6 in a fraction of a second -- we certainly don't sieve all numbers to 10^18+1e6.
Involving the hard drive is, IMO, going the wrong direction. We ought to be able to sieve faster than we can read values from the drive (at least with a good C implementation). A ranged and/or segmented sieve should do what you need.
There are better ways to do storage, which will help some. My SoE, like a few others, uses a mod-30 wheel so has 8 candidates per 30 integers, hence uses a single byte per 30 values. It looks like Bernstein's SoA does something similar, using 2 bytes per 60 values. RWH's python implementations aren't quite there, but are close enough at 10 bits per 30 values. Unfortunately it looks like Python's native bool array is using about 10 bytes per bit, and numpy is a byte per bit. Either you use a segmented sieve and don't worry about it too much, or find a way to be more efficient in the Python storage.

First of all you should make sure that you store your data in an efficient manner. You could easily store the data for up to 100,000,000 primes in 12.5Mb of memory by using bitmap, by skipping obvious non-primes (even numbers and so on) you could make the representation even more compact. This also helps when storing the data on hard drive. You getting into trouble at 100,000,000 primes suggests that you're not storing the data efficiently.
Some hints if you don't receive a better answer.
1.Is there a way to "chunk" the Sieve of Atkin to work on segment in memory
Yes, for the Eratosthenes-like part what you could do is to run multiple elements in the sieve list in "parallell" (one block at a time) and that way minimize the disk accesses.
The first part is somewhat more tricky, what you would want to do is to process the 4*x**2+y**2, 3*x**2+y**2 and 3*x**2-y**2 in a more sorted order. One way is to first compute them and then sort the numbers, there are sorting algorithms that work well on drive storage (still being O(N log N)), but that would hurt the time complexity. A better way would be to iterate over x and y in such a way that you run on a block at a time, since a block is determined by an interval you could for example simply iterate over all x and y such that lo <= 4*x**2+y**2 <= hi.
2.is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them
In order to achieve this (no matter how and when the program is terminated) you have to first have journalizing disk accesses (fx use a SQL database to keep the data, but with care you could do it yourself).
Second since the operations in the first part are not indempotent you have to make sure that you don't repeat those operations. However since you would be running that part block by block you could simply detect which was the last block processed and resume there (if you can end up with partially processed block you'd just discard that and redo that block). For the Erastothenes part it's indempotent so you could just run through all of it, but for increasing speed you could store a list of produced primes after the sieving of them has been done (so you would resume with sieving after the last produced prime).
As a by-product you should even be able to construct the program in a way that makes it possible to keep the data from the first step even when the second step is running and thereby at a later moment extending the limit by continuing the first step and then running the second step again. Perhaps even having two program where you terminate the first when you've got tired of it and then feeding it's output to the Eratosthenes part (thereby not having to define a limit).

You could try using a signal handler to catch when your application is terminated. This could then save your current state before terminating. The following script shows a simple number count continuing when it is restarted.
import signal, os, cPickle
class MyState:
def __init__(self):
self.count = 1
def stop_handler(signum, frame):
global running
running = False
signal.signal(signal.SIGINT, stop_handler)
running = True
state_filename = "state.txt"
if os.path.isfile(state_filename):
with open(state_filename, "rb") as f_state:
my_state = cPickle.load(f_state)
else:
my_state = MyState()
while running:
print my_state.count
my_state.count += 1
with open(state_filename, "wb") as f_state:
cPickle.dump(my_state, f_state)
As for improving disk writes, you could try experimenting with increasing Python's own file buffering with a 1Mb or more sized buffer, e.g. open('output.txt', 'w', 2**20). Using a with handler should also ensure your file gets flushed and closed.

There is a way to compress the array. It may cost some efficiency depending on the python interpreter, but you'll be able to keep more in memory before having to resort to disk. If you search online, you'll probably find other sieve implementations that use compression.
Neglecting compression though, one of the easier ways to persist memory to disk would be through a memory mapped file. Python has an mmap module that provides the functionality. You would have to encode to and from raw bytes, but it is fairly straightforward using the struct module.
>>> import struct
>>> struct.pack('H', 0xcafe)
b'\xfe\xca'
>>> struct.unpack('H', b'\xfe\xca')
(51966,)

Maximum size of pandas dataframe

I'm trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?
For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.

I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
1) Check for code errors
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)
2) Make your code more efficient
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
3) Check The Total Memory of the object
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here
to find the size of an object in bites you can always use sys.getsizeof():
import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
4) Check the memory while running
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
use the code below to see the documentation straight in Jupyter Notebook:
%mprun?
%memit?
Sample use:
%load_ext memory_profiler
def lol(x):
return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
5) This one may be first.... but Check for simple things like bit version
As in your case, a simple switching of the version of python you were running solved the issue.
Usually the above steps solve my issues.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.