How efficient is list.index(value, start, end)?

How efficient is list.index(value, start, end)? - python

Today I realized that python's list.index can also take an optional start (and even end) parameter.
I was wondering whether or not this is efficiently implemented and which of these two is better:
pattern = "qwertyuytresdftyuioknn"
words_list = ['queen', 'quoin']
for word in words_list:
i = 1
for character in word:
try:
i += pattern[i:].index(character)
except ValueError:
break
else:
yield word
or
pattern = "qwertyuytresdftyuioknn"
words_list = ['queen', 'quoin']
for word in words_list:
i = 1
for character in word:
try:
i = pattern.index(character, i)
except ValueError:
break
else:
yield word
So basically i += pattern[i:].index(character) vs i = pattern.index(character, i).
Searching for this on generic_search_machine returns nothing helpful, except a lot of beginner tutorials trying to teach me what a list is.
Background:
This code tries to find all words from words_list which match pattern. pattern is a list of characters a user entered by swiping over the keyboard, like on most modern mobile device's keyboards.
In the actual implementation there is the additional requirement that the returned word should be longer than 5 characters and the first and last character have to exactly match. These lines are omitted here for brevity, since they are trivial to implement.

This calls a built-in function implemented in C:
i = pattern.index(character, i)
Even without looking at the source code, you can always assume that the underlying implementation is smart enough to implement that efficiently, i.e. that it does not look at the first i values in the list.
As a rule of thumb, using a built-in functionality is always faster than (or at least as fast as) the best thing you can implement yourself.
The attempt to make it better:
i += pattern[i:].index(character)
This is deffinitely worse. It makes a copy of pattern[i:] and then looks for character in it.
So, in the worst case, if you have a pattern of 1 GB and i=1, this copies 1 GB of data in memory in attempt to skip the first element (which whould have been skipped anyway).

Related

python basic regex function

I am trying to write a function that implements a simple regex matching algorithm. The special characters "*" and "?" should stand for 1 and n>=0 degrees of freedom respectively. For example the strings
y="abc" and x="a*c",
y="abc" and x="a?c",
y="abddddzfjc" and x="a?" or x="a?c"
should return True, whereas the strings
y="abcd" and x="a*d",
y="abcdef" and x="a?d*"
should return False.
My method is to run in a loop and shorten the strings as each subsequent match is identified, which works fine for identical matches or single * with alphabet character matches, but I am a stumped on about how to do it for edge cases like the last example. To handle the case where "?" has n degrees of freedom, I loop forward in the right string to find the next alphabet character, then try to find that character in the left string, looking from right to left. I am sure there is a more elegant way (maybe with a generator?!).
def match_func(x,y):
x, y = list(x), list(y)
if len(x)==len(y)==1:
if x[0] == y[0] or bool((set(x)|set(y)) & {"?","*"})
return True
elif len(x)>0 and len(y)==0:
return False
else:
for ix, char in enumerate(x):
if char==y[ix] or char=="*":
return match_func(x[ix+1:],y[ix+1:])
else:
if char=="?"
if ix==len(x)=1: return True
##check if the next letter in x has an eventual match in y
peek = ix+1
next_char = x[peek]
while peek<len(x)-1:
next_char = x[peek]
if next_char.isalpha():
break
else: peek+=1
if peek == len(x)-1:
return True
ys = ''.join(y)
next_char_ix = ys[ix].rfind(next_char)
##search y for next possible match?
if next_char_ix!=-1:
return match_func(x[peek:], y[next_char_ix:])
else:
return False
else:
return False
return True

First decide whether to make your match algorithm a minimal or maximal search. Meaning, if your pattern is a, and your subject string is aa, does the match occur at the first or second position? As you state the problem, either choice seems to be acceptable.
Having made that choice, it will become clear how you should traverse the string - either as far to the right as possible and then working backward until you either match or fail; or starting at the left and backtracking after each attempt.
I recommend a recursive implementation either way. At each position, evaluate whether you have a possible match. If so, make your recursive call advancing the appropriate amount down both the pattern and subject string. If not, give up. If there is no match for the first character of the pattern, advance only the subject string (according to your minimal/maximal choice) and try again.
The tricky part is, you have to consider variable-length tokens in your pattern as possible matches even if the same character also matches a literal character following that wildcard. That puts you in the realm of depth-first search. Evaluating patterns like a?a?a?a on subject strings like aaaabaaaa will be lots of fun, and if you push it too far, may take years to complete.
Your professor chose well the regex operators to give you to make the assignment of meaningful depth, without the tedium of writing a full-on parser and lexer just to make the thing work.
Good luck!

How to cut up a string in a specific way, retaining what is useful?

I'm trying to make a program for my biology research.
I need to take this sequence:
NNNNNNNNNNCCNNAGTGNGNACAGACGACGGGCCCTGGCCCCTCGCACACCCTGGACCA
AGTCAATCGCACCCACTTCCCTTTCTTCTCGGATGTCAAGGGCGACCACCGGTTGGTGTT
GAGCGTCGTGGAGACCACCGTTCTGGGGCTCATCTTTGTCGTCTCACTGCTGGGCAACGT
GTGTGCTCTAGTGCTGGTGGCGCGCCGTCGGCGCCGTGGGGCGACAGCCAGCCTGGTGCT
CAACCTCTTCTGCGCGGATTTGCTCTTCACCAGCGCCATCCCTCTAGTGCTCGTCGTGCG
CTGGACTGAGGCCTGGCTGTTGGGGCCCGTCGTCTGCCACCTGCTCTTCTACGTGATGAC
AATGAGCGGCAGCGTCACGATCCTCACACTGGCCGCGGTCAGCCTGGAGCGCATGGTGTG
CATCGTGCGCCTCCGGCGCGGCTTGAGCGGCCCGGGGCGGCGGACTCAGGCGGCACTGCT
GGCTTTCATATGGGGTTACTCGGCGCTCGCCGCGCTGCCCCTCTGCATCTTGTTCCGCGT
GGTCCCGCAGCGCCTTCCCGGCGGGGACCAGGAAATTCCGATTTGCACATTGGATTGGCC
CAACCGCATAGGAGAAATCTCATGGGATGTGTTTTTTGTGACTTTGAACTTCCTGGTGCC
GGGACTGGTCATTGTGATCAGTTACTCCAAAATTTTACAGATCACGAAAGCATCGCGGAA
GAGGCTTACGCTGAGCTTGGCATACTCTGAGAGCCACCAGATCCGAGTGTCCCAACAAGA
CTACCGACTCTTCCGCACGCTCTTCCTGCTCATGGTTTCCTTCTTCATCATGTGGAGTCC
CATCATCATCACCATCCTCNCATCTTGATCCAAAACTTCCGGCAGGACCTGGNCATCTGG
NCATCCCTTTTCTTCTGGGNNGTNNNNNCACGTTGCNACTCTNCCTAAANCCCATACTGT
ANNANATGNCGCTNNNAGGAANGAATGGAGGAANANTTTTTGNNNNNNNNN
...and remove everything past the last N in the beginning and the first N at the end. In other words, to make it look something like this:
ACAGACGACGGGCCCTGGCCCCTCGCACACCCTGGACCA
AGTCAATCGCACCCACTTCCCTTTCTTCTCGGATGTCAAGGGCGACCACCGGTTGGTGTT
GAGCGTCGTGGAGACCACCGTTCTGGGGCTCATCTTTGTCGTCTCACTGCTGGGCAACGT
GTGTGCTCTAGTGCTGGTGGCGCGCCGTCGGCGCCGTGGGGCGACAGCCAGCCTGGTGCT
CAACCTCTTCTGCGCGGATTTGCTCTTCACCAGCGCCATCCCTCTAGTGCTCGTCGTGCG
CTGGACTGAGGCCTGGCTGTTGGGGCCCGTCGTCTGCCACCTGCTCTTCTACGTGATGAC
AATGAGCGGCAGCGTCACGATCCTCACACTGGCCGCGGTCAGCCTGGAGCGCATGGTGTG
CATCGTGCGCCTCCGGCGCGGCTTGAGCGGCCCGGGGCGGCGGACTCAGGCGGCACTGCT
GGCTTTCATATGGGGTTACTCGGCGCTCGCCGCGCTGCCCCTCTGCATCTTGTTCCGCGT
GGTCCCGCAGCGCCTTCCCGGCGGGGACCAGGAAATTCCGATTTGCACATTGGATTGGCC
CAACCGCATAGGAGAAATCTCATGGGATGTGTTTTTTGTGACTTTGAACTTCCTGGTGCC
GGGACTGGTCATTGTGATCAGTTACTCCAAAATTTTACAGATCACGAAAGCATCGCGGAA
GAGGCTTACGCTGAGCTTGGCATACTCTGAGAGCCACCAGATCCGAGTGTCCCAACAAGA
CTACCGACTCTTCCGCACGCTCTTCCTGCTCATGGTTTCCTTCTTCATCATGTGGAGTCC
CATCATCATCACCATCCTC
How would I do this?

I think you may be looking for the longest sequence of non-N characters in the input.
Otherwise, you have no rule to distinguish the last N in the prefix from the first N in the suffix. There is nothing at all different about the N you want to start after (before the ACAGAC…) and the next N (before the CATCCC), or, for that matter, the previous one (before the GN) except that it picks out the longest sequence. In fact, other than the 10 N's at the very start and the 9 at the very end, there doesn't seem to be anything special about any of the N's.
The easiest way to do that is to just grab all the sequences and keep the longest:
max(s.split('N'), key=len)
If you have some additional rule on top of this—e.g., the longest sequence whose length is divisible by three (which in this case is the same thing)—you can do the same basic thing:
max((seq for seq in s.split('N') if len(seq) % 3 == 0), key=len)

#abarnert's answer is correct but str.split() returns the a list sub strings. Meaning the memory usage is literally O(N) (e.g. use tons of memory). This isn't a problem when you're input is short but when processing DNA sequences, your input is typically very long. To avoid the memory overhead, your need to use a iterator. I recommend the re's finditer.
import re
_find_n_free_substrings = re.compile(r'[^N]+', re.MULTILINE).finditer
def longest_n_free_substring(string):
substrings = (match.group(0) for match in _find_n_free_substrings(string))
return max(substrings, key=len)

Python text search question [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I wonder, if you open a text file in Python. And then you'd like to search of words containing a number of letters.
Say you type in 6 different letters (a,b,c,d,e,f) you want to search.
You'd like to find words matching at least 3 letters.
Each letter can only appear once in a word.
And the letter 'a' always has to be containing.
How should the code look like for this specific kind of search?

Let's see...
return [x for x in document.split()
if 'a' in x and sum((1 if y in 'abcdef' else 0 for y in x)) >= 3]
split with no parameters acts as a "words" function, splitting on any whitespace and removing words that contain no characters. Then you check if the letter 'a' is in the word. If 'a' is in the word, you use a generator expression that goes over every letter in the word. If the letter is inside of the string of available letters, then it returns a 1 which contributes to the sum. Otherwise, it returns 0. Then if the sum is 3 or greater, it keeps it. A generator is used instead of a list comprehension because sum will accept anything iterable and it stops a temporary list from having to be created (less memory overhead).
It doesn't have the best access times because of the use of in (which on a string should have an O(n) time), but that generally isn't a very big problem unless the data sets are huge. You can optimize that a bit to pack the string into a set and the constant 'abcdef' can easily be a set. I just didn't want to ruin the nice one liner.
EDIT: Oh, and to improve time on the if portion (which is where the inefficiencies are), you could separate it out into a function that iterates over the string once and returns True if the conditions are met. I would have done this, but it ruined my one liner.
EDIT 2: I didn't see the "must have 3 different characters" part. You can't do that in a one liner. You can just take the if portion out into a function.
def is_valid(word, chars):
count = 0
for x in word:
if x in chars:
count += 1
chars.remove(x)
return count >= 3 and 'a' not in chars
def parse_document(document):
return [x for x in document.split() if is_valid(x, set('abcdef'))]
This one shouldn't have any performance problems on real world data sets.

Here is what I would do if I had to write this:
I'd have a function that, given a word, would check whether it satisfies the criteria and would return a boolean flag.
Then I'd have some code that would iterate over all words in the file, present each of them to the function, and print out those for which the function has returned True.

I agree with aix's general plan, but it's perhaps even more general than a 'design pattern,' and I'm not sure how far it gets you, since it boils down to, "figure out a way to check for what you want to find and then check everything you need to check."
Advice about how to find what you want to find: You've entered into one of the most fundamental areas of algorithm research. Though LCS (longest common substring) is better covered, you'll have no problems finding good examples for containment either. The most rigorous discussion of this topic I've seen is on a Google cs wonk's website: http://neil.fraser.name. He has something called diff-match-patch which is released and optimized in many different languages, including python, which can be downloaded here:
http://code.google.com/p/google-diff-match-patch/
If you'd like to understand more about python and algorithms, magnus hetland has written a great book about python algorithms and his website features some examples within string matching and fuzzy string matching and so on, including the levenshtein distance in a very simple to grasp format. (google for magnus hetland, I don't remember address).
WIthin the standard library you can look at difflib, which offers many ways to assess similarity of strings. You are looking for containment which is not the same but it is quite related and you could potentially make a set of candidate words that you could compare, depending on your needs.
Alternatively you could use the new addition to python, Counter, and reconstruct the words you're testing as lists of strings, then make a function that requires counts of 1 or more for each of your tested letters.
Finally, on to the second part of the aix's approach, 'then apply it to everything you want to test,' I'd suggest you look at itertools. If you have any kind of efficiency constraint, you will want to use generators and a test like the one aix proposes can be most efficiently carried out in python with itertools.ifilter. You have your function that returns True for the values you want to keep, and the builtin function bool. So you can just do itertools.ifilter(bool,test_iterable), which will return all the values that succeed.
Good luck

words = 'fubar cadre obsequious xray'
def find_words(src, required=[], letters=[], min_match=3):
required = set(required)
letters = set(letters)
words = ((word, set(word)) for word in src.split())
words = (word for word in words if word[1].issuperset(required))
words = (word for word in words if len(word[1].intersection(letters)) >= min_match)
words = (word[0] for word in words)
return words
w = find_words(words, required=['a'], letters=['a', 'b', 'c', 'd', 'e', 'f'])
print list(w)
EDIT 1: I too didn't read the requirements closely enough. To ensure a word contains only 1 instance of a valid letter.
from collections import Counter
def valid(word, letters, min_match):
"""At least min_match, no more than one of any letter"""
c = 0
count = Counter(word)
for letter in letters:
char_count = count.get(letter, 0)
if char_count > 1:
return False
elif char_count == 1:
c += 1
if c == min_match:
return True
return True
def find_words(srcfile, required=[], letters=[], min_match=3):
required = set(required)
words = (word for word in srcfile.split())
words = (word for word in words if set(word).issuperset(required))
words = (word for word in words if valid(word, letters, min_match))
return words

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.
I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.
As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").
Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
I previously had:
wordlist = [w for w in wordlist if w.startswith(word_fragment)]
(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.
It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:
newlist = []
for w in wordlist:
if w[:len(word_fragment)] > word_fragment:
# Take advantage of the fact that the list is sorted
break
if w.startswith(word_fragment):
newlist.append(w)
return newlist
but that is about the same speed, which I thought may be because list comprehensions run as compiled code?
I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?
Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.
UPDATE:
I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:
The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452
The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.
Thanks all for some excellent suggestions, and well done Peter for the best answer!!!

Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
Note that the lack of square brackets is important for this to work.
However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:
from bisect import bisect_left
def word_exists(wordlist, word_fragment):
try:
return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
except IndexError:
return False # word_fragment is greater than all entries in wordlist
bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.
Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.

You're right that you can do this more efficiently given that the list is sorted.
I'm building off of #Peter's answer, which returns a single element. I see that you want all the words that start with a given prefix. Here's how you do that:
from bisect import bisect_left
wordlist[bisect_left(wordlist, word_fragment):
bisect_left(wordlist, word_fragment[:-1] + chr(ord(word_fragment[-1])+1))]
This returns the slice from your original sorted list.

As Peter suggested I would use the Bisect module. Especially if you're reading from a large file of words.
If you really need speed you could make a daemon ( How do you create a daemon in Python? ) that has a pre-processed data structure suited for the task
I suggest you could use "tries"
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries
There are many algorithms and data structures to index and search
strings inside a text, some of them are included in the standard
libraries, but not all of them; the trie data structure is a good
example of one that isn't.
Let word be a single string and let dictionary be a large set of
words. If we have a dictionary, and we need to know if a single word
is inside of the dictionary the tries are a data structure that can
help us. But you may be asking yourself, "Why use tries if set
and hash tables can do the same?" There are two main reasons:
The tries can insert and find strings in O(L) time (where L represent
the length of a single word). This is much faster than set , but is it
a bit faster than a hash table.
The set and the hash tables
can only find in a dictionary words that match exactly with the single
word that we are finding; the trie allow us to find words that have a
single character different, a prefix in common, a character missing,
etc.
The tries can be useful in TopCoder problems, but also have a
great amount of applications in software engineering. For example,
consider a web browser. Do you know how the web browser can auto
complete your text or show you many possibilities of the text that you
could be writing? Yes, with the trie you can do it very fast. Do you
know how an orthographic corrector can check that every word that you
type is in a dictionary? Again a trie. You can also use a trie for
suggested corrections of the words that are present in the text but
not in the dictionary.
an example would be:
start={'a':nodea,'b':nodeb,'c':nodec...}
nodea={'a':nodeaa,'b':nodeab,'c':nodeac...}
nodeb={'a':nodeba,'b':nodebb,'c':nodebc...}
etc..
then if you want all the words starting with ab you would just traverse
start['a']['b'] and that would be all the words you want.
to build it you could iterate through your wordlist and for each word, iterate through the characters adding a new default dict where required.

In case of binary search (assuming wordlist is sorted), I'm thinking of something like this:
wordlist = "ab", "abc", "bc", "bcf", "bct", "cft", "k", "l", "m"
fragment = "bc"
a, m, b = 0, 0, len(wordlist)-1
iterations = 0
while True:
if (a + b) / 2 == m: break # endless loop = nothing found
m = (a + b) / 2
iterations += 1
if wordlist[m].startswith(fragment): break # found word
if wordlist[m] > fragment >= wordlist[a]: a, b = a, m
elif wordlist[b] >= fragment >= wordlist[m]: a, b = m, b
if wordlist[m].startswith(fragment):
print wordlist[m], iterations
else:
print "Not found", iterations
It will find one matched word, or none. You will then have to look to the left and right of it to find other matched words. My algorithm might be incorrect, its just a rough version of my thoughts.

Here's my fastest way to narrow the list wordlist down to a list of valid words starting with a given fragment :
sect() is a generator function that uses the excellent Peter's idea to employ bisect, and the islice() function :
from bisect import bisect_left
from itertools import islice
from time import clock
A,B = [],[]
iterations = 5
repetition = 10
with open('words.txt') as f:
wordlist = f.read().split()
wordlist.sort()
print 'wordlist[0:10]==',wordlist[0:10]
def sect(wordlist,word_fragment):
lgth = len(word_fragment)
for w in islice(wordlist,bisect_left(wordlist, word_fragment),None):
if w[0:lgth]==word_fragment:
yield w
else:
break
def hooloo(wordlist,word_fragment):
usque = len(word_fragment)
for w in wordlist:
if w[:usque] > word_fragment:
break
if w.startswith(word_fragment):
yield w
for rep in xrange(repetition):
te = clock()
for i in xrange(iterations):
newlistA = list(sect(wordlist,'VEST'))
A.append(clock()-te)
te = clock()
for i in xrange(iterations):
newlistB = list(hooloo(wordlist,'VEST'))
B.append(clock() - te)
print '\niterations =',iterations,' number of tries:',repetition,'\n'
print newlistA,'\n',min(A),'\n'
print newlistB,'\n',min(B),'\n'
result
wordlist[0:10]== ['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']
iterations = 5 number of tries: 30
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.0286089433154
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.415578236899
sect() is 14.5 times faster than holloo()
PS:
I know the existence of timeit, but here, for such a result, clock() is fully sufficient

Doing binary search in the list is not going to guarantee you anything. I am not sure how that would work either.
You have a list which is ordered, it is a good news. The algorithmic performance complexity of both your cases is O(n) which is not bad, that you just have to iterate through the whole wordlist once.
But in the second case, the performance (engineering performance) should be better because you are breaking as soon as you find that rest cases will not apply. Try to have a list where 1st element is match and rest 38000 - 1 elements do not match, you will the second will beat the first.

String Occurrence Counting Algorithm

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.
From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.
In Python this is what I want:
text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.
EDIT: It seems like python str.count serves as such a method; however, I am not able to find what algorithm it uses.

For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.
The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).
If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.
Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.
Hope this helps!

Boyer-Moore would be a good choice for counting occurrences, since it has some overhead that you would only need to do once. It does better the longer the pattern string is, so for "one" it would not be a good choice.
If you want to count overlaps, start the next search one character after the previous match. If you want to ignore overlaps, start the next search the full pattern string length after the previous match.
If your language has an indexOf or strpos method for finding one string in another, you can use that. If it proves to slow, then choose a better algorithm.

Hellnar,
You can use a simple dictionary to count occurrences in a String. The algorithm is a counting algorithm, here is an example:
"""
The counting algorithm is used to count the occurences of a character
in a string. This allows you to compare anagrams and strings themselves.
ex. animal, lamina a=2,n=1,i=1,m=1
"""
def count_occurences(str):
occurences = {}
for char in str:
if char in occurences:
occurences[char] = occurences[char] + 1
else:
occurences[char] = 1
return occurences
def is_matched(s1,s2):
matched = True
s1_count_table = count_occurences(s1)
for char in s2:
if char in s1_count_table and s1_count_table[char]>0:
s1_count_table[char] -= 1
else:
matched = False
break
return matched
#counting.is_matched("animal","laminar")
This example just returns True or False if the strings match. Keep in mind, this algorithm counts the number of times a character shows up in a string, this is good for anagrams.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.