I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character); they aren't DNA sequences or protein sequences.
I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?
Thanks!
The easiest way to align multiple sequences is to do a number of pairwise alignments.
First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.
When you are aligning a sequence to
the aligned sequences, (based on a
pairwise alignment), when you insert a
gap in the sequence that is already in
the set, you insert gaps in the same
place in all sequences in the aligned
set.
Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.
In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.
Are you looking for something quick and dirty, as in the following?
from difflib import SequenceMatcher
a = "dsa jld lal"
b = "dsajld kll"
c = "dsc jle kal"
d = "dsd jlekal"
ss = [a,b,c,d]
s = SequenceMatcher()
for i in range(len(ss)):
x = ss[i]
s.set_seq1(x)
for j in range(i+1,len(ss)):
y = ss[j]
s.set_seq2(y)
print
print s.ratio()
print s.get_matching_blocks()
MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:
mafft --text input_text.fa > output_alignment.fa
Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.
I've pretty recently written a python script that runs the Smith-Waterman algorithm (which is what is used to generate gapped local sequence alignments for DNA or protein sequences). It's almost certainly not the fastest implementation, as I haven't optimized it for speed at all (not my bottleneck at the moment), but it works and doesn't care about the identity of each character in the strings. I could post it here or email you the files if that's the kind of thing you're looking for.
Related
I have a data frame which contains multiple number sequences, i.e.:
1324123
1235324
12342212
4313423
221231;
...
these numbers met the following requirement: the number of each digit is from 1 - 4.
What I want to do is find all unique sequences and their reads. Regarding the unique sequence, two-digit differences are allowed.
For example:
12344
12344
12334
1234
123444
are considered as the same sequence and the original sequence is 1234 and the associated read is 5.
I want to accomplish this in python and only basic python packages are allowed: numpy, pandas, etc.
EDIT
the real case is DNA sequence. For a simple DNA sequence ATGCTAGC, due to reading errors, the output of this actual sequence might be:
ATGCTAG(deleted), ATGCTAGG(altered), ATGCTAGCG(insertion), ATGCTAGC(unchanged).
These four sequences are considered the same sequence, and read is the time of appearance.
As it is, the problem isn't defined well enough - it is underconstrained.
(I'm going to be using case-sensitive sequences of [A-Za-z] for examples, since using unique characters makes the reasoning easier, but the same things apply to [1-4] and [ACGT] as well; For the same reason, I'm allowing only single-character differences in the examples. When I include a number in parenthesis after a sequence, it denotes the read)
Just a few examples off the top of my head:
For {ABCD, ABCE}, which one should be selected as the real sequence? By random?
What about {ABCD, ABCE, ABCE}? Is random still okay?
For {ABCD, ABCE, ABED}, should ABCD(3) be selected, since there's a single-letter difference between it and the other two, even though there's a two-letter difference between ABCE and ABED?
For {ABCE, ABED}, should ABCD(2) be selected, since there's a single-letter difference between it and the other two, even though the sequence doesn't exist in the input itself?
For {ABCD, ABCZ, ABYZ}, should ABCZ(3) be selected? Why not {ABCD(2), ABYZ(2)}?
For {ABCD, ABCZ, ABYZ, AXYZ}, should {ABCD(2), AXYZ(2)} be selected? Why not {ABCZ(3), ABYZ(3)}? (Or maybe you want it to chain, so you'd get a read of 4, even though the maximum difference is already 3 letters?)
In the comments, you said:
I am just listing a very simple example, the real case is much longer.
How long? What's the minimum length? (What's the maximum?) It's relevant information.
And finally - before I get to the meat of the problem - what are you doing this for? If it's just for learning - as a personal exercise - that's fine. But if you're actually doing some real research: For all that is good and holy, please research existing tools/libraries for dealing with DNA sequences and/or enlist the help of someone who is familiar with those. I'm sure there are heaps of tools available that can do better and faster, than what I'm about to present. That being said...
Let's look at this logically. If you have a big collection of strings, and you want to quickly find if it contains a specific string, you'd use a set (or a dictionary, if there's associated data). The problem, of course, is that you don't want to find only exact matches. But since the number of allowed errors is constrained and extremely small, there are some easy workarounds.
For one, you could just generate all the possible sequences with the allowable amount of error, and try to lookup each of them - but that really only makes sense if the if the strings are short and there's only one allowable error, since the amount of possible error combinations scales up really fast.
If the strings are long enough, and aren't expected to generally share large chunks (unless they're within the allowable error, so the strings are considered same), you can make the observation that if there's a maximum of two modifications, and you cut a string into 3 parts (it doesn't matter if there's leftovers), then one of the parts must match the corresponding part of the original string. This can be extended to insertions and deletions by generating the 3 parts for 3 different shifts of the string (and choosing/dealing with the part lengths suitably). So by generating 9 keys for each sequence, and using a dictionary, you can quickly find all sequences that are capable of matching the sequence with 2 errors. (Of course, as I said at the start, this doesn't work if a large part of unrelated strings share big chunks: If all of your strings only have differences at the beginning, and have the same end, you'll just end up with all the strings grouped together, and no closer to solving the problem) (Also: If the sequence you want to select doesn't necessarily exist in the input, like described in the 4th example, you need 5 parts with 5 shifts to guarantee a matching key, since the difference between the existing sequences can be up to 4)
An example:
Original sequence:
ABCDEFGHIJKLMNOP
Generated parts: (Divided into 3 parts (of size 4), with 3 different shifts)
[ABCD][EFGH][IJKL]MNOP
A[BCDE][FGHI][JKLM]NOP
AB[CDEF][GHIJ][KLMN]OP
If you now make any two modifications to the original sequence, and generate parts for it in the same manner, at least one of the parts will always match. If the sequences are all approximately the same size, the part size can just be statically set to a suitable value (there must be at least 2 characters left over after the shift, as shown here, so a string with two deletions can still generate the same keys). If not, eg. powers of two can be used, taking care to generate keys for both sides when the string length is such that matching sequences could fall into a neighbouring size bucket.
But those are in essence just examples of how you could approach coming up with solutions when presented with this kind of a problem; just random ad hoc methods. For a smarter, more general solution, you could look at eg. generalized suffix trees - they should allow you to find matching sequences with mismatches allowed very fast, though I'm not sure if that includes insertions/deletions, or how easy that would be to do.
You may use levenshtein distance to measure the number of substitutions and deletions:
>>> import Levenshtein
>>> Levenshtein.distance( '12345', '1234' )
1
>>> Levenshtein.distance( '12345', '12354' )
2
I am using a lexicon of positive and negative words, and I want to count how many positive and negative words appear in each document from a large corpus. The corpus has almost 2 million documents, so the code I'm running is taking too long to count all these occurrences.
I have tried using numpy, but get a memory error when trying to convert the list of documents into an array.
This is the code I am currently running to count just the positive words in each document.
reviews_pos_wc = []
for review in reviews_upper:
pos_words = 0
for word in review:
if word in pos_word_list:
pos_words += 1
reviews_pos_wc.append(pos_words)
After running this for half an hour, it only gets through 300k documents.
I have done a search for similar questions on this website. I found someone else doing a similar thing, but not nearly on the same scale as they only used one document. The answer suggested using the Counter class, but I thought this would just add more overhead.
It appears that your central problem is that you don't have the hardware needed to do the job you want in the time you want. For instance, your RAM appears insufficient to hold the names of 2M documents in both list and array form.
I do see a couple of possibilities. Note that "vectorization" is not a magic solution to large problems; it's merely a convenient representation that allows certain optimizations to occur among repeated operations.
Regularize your file names, so that you can represent their names in fewer bytes. Iterate through a descriptive expression, rather than the full file names. This could give you freedom to vectorize something later.
Your variable implies that your lexicon is a list. This has inherently linear access. Change this to a data structure amenable to faster search, such as a set (hash function) or some appropriate search tree. Even a sorted list with an interpolation search would speed up your work.
Do consider using popular modules (such as Collections); let the module developers optimize the common operations on your behalf. Write a prototype and time its performance: given the simplicity of your processing, the coding shouldn't take long.
Does that give you some ideas for experimentation? I'm hopeful that my first paragraph proves to be unrealistically pessimistic (i.e. that something does provide a solution, especially the lexicon set).
I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?
You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.
If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.
If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.
If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.
In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.
First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.
I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).
My prototype uses Python's set as the obvious data type.
When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.
Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.
I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.
I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.
Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?
Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)
(Edit: to clarify what I'm looking for from an API, I want something like:
def merge(document_sets):
probe_i = 0
probe_set = document_sets[probe_i]
document_id = GET_FIRST(probe_set)
while IS_VALID(document_id):
# See if the document is present in all sets
for i in range(1, len(document_sets)):
# dynamically adapt to favor the least matching set
target_i = (i + probe_i) % len(document_sets)
target = document_sets[target_i]
if document_id not in target_set:
probe_i = target_id
probe_set = document_sets[probe_i]
document_id = GET_NEXT(probe_set, document_id)
break
else:
yield document_id
I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.
This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.)
)
Yes, there are some, though I'm not sure if they're suitable for your use case: but it seems none of them are what you asked for.
BioPython has a Trie implementation in C.
Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520
Other (some very stale) implementations:
http://pypi.python.org/pypi/radix
py-radix is an implementation of a
radix tree data structure for the
storage and retrieval of IPv4 and IPv6
network prefixes.
https://bitbucket.org/markon/patricia-tree/src
A Python implementation of
patricia-tree
http://pypi.python.org/pypi/trie
A prefix tree (trie) implementation.
http://pypi.python.org/pypi/logilab-common/0.50.3
patricia.py : A Python implementation
of PATRICIA trie (Practical Algorithm
to Retrieve Information Coded in
Alphanumeric).
I've recently added iteration support to datrie, you may give it a try.
I'm developing a Genetic Algorithm in python were chromosomes are composed of strings and integers. To apply the genetic operations, I want to convert these groups of integers and strings into bit strings.
For example, if one chromosome is:
["Hello", 4, "anotherString"]
I'd like it to become something like:
0100100100101001010011110011
(this is not actual translation). So... How can I do this? Chromosomes will contain the same amount of strings and integers, but this numbers can vary from one algorithm run to another.
To be clear, what I want to obtain is the bit representation of each element in the chromosome concatenated.
If you think this would not be the best way to apply genetic operators (such as mutation and simple crossover) just tell me! I'm open to new ideas.
Thanks a lot!
Manuel
You can turn strings and integers into bytestrings (and back) with the struct module, and that's exactly 8 bits to a byte. If for some reason you want these binary bytestrings as text strings made up of 0 and 1 characters, you can print them in binary form, of course.
Edit: forgot to remind you how to format a byte into a text string made up of 0 and 1 characters -- in Python 2.6 or better:
>>> format(23, '08b')
'00010111'
and to get back from such a string to a byte, of course:
>>> int('00010111', 2)
23
Converting everything into one concatenated string, and than applying genetic operations doesn't seem to be the best idea. Genetic operations can break here many things (especially if you have some constrains on individuals), additionally effectiveness of such solution is probably low. I would suggest different approach.
Try implementing individual using SuperGene concept (wiki). Example of applying it to GA is described here. Additionally as per this they say it improves overall GA performance.
In my opinion it will make design clearer. I would try this approach.
Once you describe exactly how the translation from strings to bitstrings should go, the "how" should be fairly easy. If the genetic algorithms should work on a bit-level then obviously a bit level string makes sense, but it is probably way slower than using numbers or character strings.