Caesar Cipher algorithm with strings and for loop Python - python

The assignment is to write a Caesar Cipher algorithm that receives 2 parameters, the first being a String parameter, the second telling how far to shift the alphabet. The first part is to set up a method and set up two strings, one normal and one shifted. I have done this. Then I need to make a loop to iterate through the original string to build a new string, by finding the original letters and selecting the appropriate new letter from the shifted string. I've spent at least two hours staring at this one, and talked to my teacher so I know I'm doing some things right. But as for what goes in the while loop, I really don't have a clue. Any hints or pushes in the right direction would be very helpful so I at least have somewhere to start would be great, thank you.
def cipher(x, dist):
alphabet = "abcdefghijklmnopqrstuvwxyz"
shifted = "xyzabcdefghijklmnopqrstuvw"
stringspot = 0
shiftspot = (x.find("a"))
aspot = (x.find("a"))
while stringspot < 26:
aspot = shifted(dist)
shifted =
stringspot = stringspot + 1
ans =
return ans
print(cipher("abcdef", 1))
print(cipher("abcdef", 2))
print(cipher("abcdef", 3))
print(cipher("dogcatpig", 1))

Here are some pushes and hints:
You should validate your inputs. In particular, make sure that the shift distance is "reasonable," where reasonable means something you can handle. I recommend <=25.
If the maximum shift amount is 25, the letter 'a' plus 25 would get 'z'. The letter 'z' plus 25 will go past the end of the alphabet. But it wouldn't go past the end of TWO alphabets. So that's one way to handle wrap-around.
User #zondo, in his solution, handles upper-case letters. You didn't mention if you want to handle them or not. You may want to clarify that with your teacher.
If you know about dictionaries, you might want to build one to make it easy to map the old letters to the new letters.
You need to realize that strings are treated as tuples or lists - you can index them. I don't see you doing that in your code.
You can get an "ASCII code" number for a letter using ord(). The numbers are arbitrary, but both upper and lower case numbers are packed together tightly in ranges of 26. This means you can do math with them. (For example, ord('a') is 97. Not super useful. But ord('b') - ord('a') is 1, which might be good to know.)

alphabet and shifted are supposed to be a mapping between the original stream and the ciphertext. The loop's job is to iterate over all letters in the stream substitute them. More specifically, the letter in alphabet and the substitute letter in shifted reside at the same index, hence the mapping. In pseudocode:
ciphertext = empty
for each letter in x
i = index of letter in alphabet
new_letter = shifted[i]
add new_letter to ciphertext
The whole loop can be simplified to a comprehension list, but this shouldn't be your primary concern.
For more direct mapping than doing as in the pseudocode above, look into dictionaries.
Another thing that stands out in your code is the generation of shifted, which should depend on the argument dist so it can't just be hardcoded. So, if dist is 5, the first letter in shifted should be whatever lies at the 0+5 in alphabet, and so on. Hint: modulo operator.

Related

Guess a 3 letter string that contains an uppercase, lowercase and number (in any order)

I'm relatively new with Python and I'm making a program where we recieve a 3 letter long string with one upper case letter, one lower case letter and a number in any given order. The program is then supposed to find it through a brute-force attack.
I've tried doing this through for loops and defining the uppercase, lowercase letters and letters as strings and then try to iterate through these with for loops and try to match the letters from the string we wanted to find to those within the uppercase, lowercase or numbers accordingly.
What I tried to do:
uppers="ABCDEFGHIJ"
lowers="abcdefghij"
numbers="1234567890"
secret="Je1" #The string the computer is supposed to find through a brute-force attack
password = ""
counter = 0
for upper in uppers:
if upper in secret:
password += upper
break
else:
counter += 1
for lower in lowers:
if lower in secret:
password += lower
break
else:
counter += 1
for number in numbers:
if number in secret:
password += number
break
else:
counter += 1
print(password)
print("Counter: {0}".format(counter))
When I run the code, it does work, but only when the secret string is in a different order than uppercase, lowercase and numbers,("Je1" works, "eJ1" doesn't). The program doesn't really do its function properly without rearranging the for loops accordingly.
Any help is highly appreciated!
Your code very specifically always outputs uppercase, lowercase, and number in exactly this order, regardless of what the input was.
This is possible to fix in several ways, each with some tradeoffs.
The simplest is to really do brute force, i.e. examine all three sets at every position.
As an optimization, you could drop the category you found a match in from subsequent iterations, reducing the search space as you go. This is unlikely to be scalable to real-world problems, where the search space is more complex (you don't know if there will be one or more occurrences of characters in any one category, except if one category is not yet attested for near the end of the string).
Alternatively, you could remember at which position you found a character in a particular category, and reassemble the password in the right order at the end with this information. In some sense, this is the most elegant fix, but again, it suffers from the problem that it's not going to be very useful in a real-world password cracking program.
So in other words, the "full brute force" solution is the most scalable, because it will scale up to real-world problems, even though it is computationally the least scalable.
Going forward, think about how you could enumerate all possible passwords in the search space, so that each password candidate gets a predictable index, and just loop over that enumeration.
An easy way to accomplish this in Python is using itertools.permutations. This gives you all possible combinations from a given collection of items.
In your case the "collection of items" are all the lower-case letters, upper-case letters and numbers. So to use permutations you need to put them together into one collection. For this you can just concatenate the strings together:
chars = uppers + lowers + numbers
Or simply define them as one string:
chars = "ABCDEFGHIJabcdefghij1234567890"
You can then run permutations(chars, 3) which gives you items of 3 characters in length as a tuple. One example would be ('a', 'C', '3'). You need to compare this with the password-string. You can either split the password string into a tuple (which needs to be done once), or join the permutation tuple into a string (which needs to be done for each item). In your case, I am assuming that you want to use the generated password for something, so let's join the tuple into a string, which gives us the following code:
from itertools import permutations
uppers="ABCDEFGHIJ"
lowers="abcdefghij"
numbers="1234567890"
secret="Je1"
for candidate in permutations(uppers + lowers + numbers, 3):
if ''.join(candidate) == secret:
print(candidate)
Following program is slightly inefficient but works for your purpose.
import re
def m(secret):
import re
if len(secret)==3 and re.search(r'[A-Z]', secret) and re.search(r'[a-z]', secret) and re.search(r'[0-9]', secret):
print "Yes"
else:
print "No"
Which can be further modified using positive lookaheads as mentioned in the accepted answer of Match regex in any order
if re.search(r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{3}$", secret):
print "Yes"
else:
print "No"
You're looping in the order upper, lower, number, so you can only match strings that are in that specific order. You need to loop in all possible permutations of upper, lower, number. That is, you need to repeat what you did for these permutations:
upper, number, lower
lower, upper, number
lower, number, upper
number, lower, upper
number, upper, lower
You can shorten your three loops this way:
for upper in uppers:
for lower in lowers:
for number in numbers:
password = upper + lower + number
if password == secret:
print('Matched. Secret is', secret)
Repeat this five more times by changing the order of upper, lower, number.
Or, you can use itertools package to avoid having to loop so many times.
from itertools import permutations
for upper in uppers:
for lower in lowers:
for number in numbers:
perms = list(permutations([upper, lower, number]))
if tuple(secret) in perms:
print('Matched. Secret is', secret)

How to create a Boggle Board from a list of Words? (reverse Boggle solver!)

I am trying to solve the reverse Boggle problem. Simply put, given a list of words, come up with a 4x4 grid of letters in which as many words in the list can be found in sequences of adjacent letters (letters are adjacent both orthogonally and diagonally).
I DO NOT want to take a known board and solve it. That is an easy TRIE problem and has been discussed/solved to death here for people's CS projects.
Example word list:
margays, jaguars, cougars, tomcats, margay, jaguar, cougar, pumas, puma, toms
Solution:
ATJY
CTSA
OMGS
PUAR
This problem is HARD (for me). Algorithm I have so far:
For each word in the input, make a list of all possible ways it can legally be appear on the board by itself.
Try all possible combinations of placing word #2 on those boards and keep the ones that have no conflicts.
Repeat till end of list.
...
Profit!!! (for those that read /.)
Obviously, there are implementation details. Start with the longest word first. Ignore words that are substrings of other words.
I can generate all 68k possible boards for a 7 character word in around 0.4 seconds. Then when I add an additional 7 character board, I need to compare 68k x 68k boards x 7 comparisons. Solve time becomes glacial.
There must be a better way to do this!!!!
Some code:
BOARD_SIDE_LENGTH = 4
class Board:
def __init__(self):
pass
def setup(self, word, start_position):
self.word = word
self.indexSequence = [start_position,]
self.letters_left_over = word[1:]
self.overlay = []
# set up template for overlay. When we compare boards, we will add to this if the board fits
for i in range(BOARD_SIDE_LENGTH*BOARD_SIDE_LENGTH):
self.overlay.append('')
self.overlay[start_position] = word[0]
self.overlay_count = 0
#classmethod
def copy(boardClass, board):
newBoard = boardClass()
newBoard.word = board.word
newBoard.indexSequence = board.indexSequence[:]
newBoard.letters_left_over = board.letters_left_over
newBoard.overlay = board.overlay[:]
newBoard.overlay_count = board.overlay_count
return newBoard
# need to check if otherboard will fit into existing board (allowed to use blank spaces!)
# otherBoard will always be just a single word
#classmethod
def testOverlay(self, this_board, otherBoard):
for pos in otherBoard.indexSequence:
this_board_letter = this_board.overlay[pos]
other_board_letter = otherBoard.overlay[pos]
if this_board_letter == '' or other_board_letter == '':
continue
elif this_board_letter == other_board_letter:
continue
else:
return False
return True
#classmethod
def doOverlay(self, this_board, otherBoard):
# otherBoard will always be just a single word
for pos in otherBoard.indexSequence:
this_board.overlay[pos] = otherBoard.overlay[pos]
this_board.overlay_count = this_board.overlay_count + 1
#classmethod
def newFromBoard(boardClass, board, next_position):
newBoard = boardClass()
newBoard.indexSequence = board.indexSequence + [next_position]
newBoard.word = board.word
newBoard.overlay = board.overlay[:]
newBoard.overlay[next_position] = board.letters_left_over[0]
newBoard.letters_left_over = board.letters_left_over[1:]
newBoard.overlay_count = board.overlay_count
return newBoard
def getValidCoordinates(self, board, position):
row = position / 4
column = position % 4
for r in range(row - 1, row + 2):
for c in range(column - 1, column + 2):
if r >= 0 and r < BOARD_SIDE_LENGTH and c >= 0 and c < BOARD_SIDE_LENGTH:
if (r*BOARD_SIDE_LENGTH+c not in board.indexSequence):
yield r, c
class boardgen:
def __init__(self):
self.boards = []
def createAll(self, board):
# get the next letter
if len(board.letters_left_over) == 0:
self.boards.append(board)
return
next_letter = board.letters_left_over[0]
last_position = board.indexSequence[-1]
for row, column in board.getValidCoordinates(board, last_position):
new_board = Board.newFromBoard(board, row*BOARD_SIDE_LENGTH+column)
self.createAll(new_board)
And use it like this:
words = ['margays', 'jaguars', 'cougars', 'tomcats', 'margay', 'jaguar', 'cougar', 'pumas', 'puma']
words.sort(key=len)
first_word = words.pop()
# generate all boards for the first word
overlaid_boards = []
for i in range(BOARD_SIDE_LENGTH*BOARD_SIDE_LENGTH):
test_board = Board()
test_board.setup(first_word, i)
generator = boardgen()
generator.createAll(test_board)
overlaid_boards += generator.boards
This is an interesting problem. I can't quite come up with a full, optimized solution, but there here are some ideas you might try.
The hard part is the requirement to find the optimal subset if you can't fit all the words in. That's going to add a lot to the complexity. Start by eliminating word combinations that obviously aren't going to work. Cut any words with >16 letters. Count the number of unique letters needed. Be sure to take into account letters repeated in the same word. For example, if the list includes "eagle" I don't think you are allowed to use the same 'e' for both ends of the word. If your list of needed letters is >16, you have to drop some words. Figuring out which ones to cut first is an interesting sub-problem... I'd start with the words containing the least used letters. It might help to have all sub-lists sorted by score.
Then you can do the trivial cases where the total of word lengths is <16. After that, you start with the full list of words and see if there's a solution for that. If not, figure out which word(s) to drop and try again.
Given a word list then, the core algorithm is to find a grid (if one exists) that contains
all of those words.
The dumb brute-force way would be to iterate over all the grids possible with the letters you need, and test each one to see if all your words fit. It's pretty harsh though: middle case is 16! = 2x10exp13 boards. Exact formula for n unique letters is... (16!)/(16-n)! x pow(n, 16-n). Which gives a worst case in the range of 3x10exp16. Not very manageable.
Even if you can avoid rotations and flips, that only saves you 1/16 of the search space.
A somewhat smarter greedy algorithm would be to sort the words by some criteria, like difficulty or length. A recursive solution would be to take the top word remaining on the list, and attempt to place it on the grid. Then recurse with that grid and the remaining word list. If you fill up the grid before you run out of words, then you have to back track and try another way of placing the word. A greedier approach would be to try placements that re-use the most letters first.
You can do some pruning. If at any point the number of spaces left in the grid is less than the remaining set of unique letters needed, then you can eliminate those sub-trees. There are a few other cases where it's obvious there's no solution that can be cut, especially when the remaining grid spaces are < the length of the last word.
The search space for this depends on word lengths and how many letters are re-used. I'm sure it's better than brute-force, but I don't know if it's enough to make the problem reasonable.
The smart way would be to use some form of dynamic programming. I can't quite see the complete algorithm for this. One idea is to have a tree or graph of the letters, connecting each letter to "adjacent" letters in the word list. Then you start with the most-connected letter and try to map the tree onto the grid. Always place the letter that completes the most of the word list. There'd have to be some way of handling the case of multiple of the same letter in the grid. And I'm not sure how to order it so you don't have to search every combination.
The best thing would be to have a dynamic algorithm that also included all the sub word lists. So if the list had "fog" and "fox", and fox doesn't fit but fog does, it would be able to handle that without having to run the whole thing on both versions of the list. That's adding complexity because now you have to rank each solution by score as you go. But in the cases where all the words won't fit it would save a lot of time.
Good luck on this.
There are a couple of general ideas for speeding up backtrack search you could try:
1) Early checks. It usually helps to discard partial solutions that can never work as early as possible, even at the cost of more work. Consider all two-character sequences produced by chopping up the words you are trying to fit in - e.g. PUMAS contributes PU, UM, MA, and AS. These must all be present in the final answer. If a partial solution does not have enough overlapped two-character spaces free to contain all of the overlapped two-character sequences it does not yet have, then it cannot be extended to a final answer.
2) Symmetries. I think this is probably most useful if you are trying to prove that there is no solution. Given one way of filling in a board, you can rotate and reflect that solution to find other ways of filling in a board. If you have 68K starting points and one starting point is a rotation or reflection of another starting point, you don't need to try both, because if you can (or could) solve the problem from one starting point you can get the answer from the other starting point by rotating or reflecting the board. So you might be able to divide the number of starting points you need to try by some integer.
This problem is not the only one to have a large number of alternatives at each stage. This also affects the traveling salesman problem. If you can accept not having a guarantee that you will find the absolute best answer, you could try not following up the least promising of these 68k choices. You need some sort of score to decide which to keep - you might wish to keep those which use as many letters already in place as possible. Some programs for the traveling salesman problems discard unpromising links between nodes very early. A more general approach which discards partial solutions rather than doing a full depth first search or branch and bound is Limited Discrepancy Search - see for example http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2426.
Of course some approaches to the TSP discard tree search completely in favor of some sort of hill-climbing approach. You might start off with a filled boggle square and repeatedly attempt to find your words in it, modifying a few characters in order to force them in, trying to find steps which successively increase the number of words that can be found in the square. The easiest form of hill-climbing is repeated simple hill-climbing from multiple random starts. Another approach is to restart the hill-climbing by randomizing only a portion of the solution so far - since you don't know the best size of portion to randomize you might decide to chose the size of portion to randomize at random, so that at least some fraction of the time you are randomizing the correct size of region to produce a new square to start from. Genetic algorithms and simulated annealing are very popular here. A paper on a new idea, Late Acceptance Hill-Climbing, also describes some of its competitors - http://www.cs.nott.ac.uk/~yxb/LAHC/LAHC-TR.pdf

Python text encryption: rot13

I am currently doing an assignment that encrypts text by using rot 13, but some of my text wont register.
# cgi is to escape html
# import cgi
def rot13(s):
#string encrypted
scrypt=''
alph='abcdefghijklmonpqrstuvwxyz'
for c in s:
# check if char is in alphabet
if c.lower() in alph:
#find c in alph and return its place
i = alph.find(c.lower())
#encrypt char = c incremented by 13
ccrypt = alph[ i+13 : i+14 ]
#add encrypted char to string
if c==c.lower():
scrypt+=ccrypt
if c==c.upper():
scrypt+=ccrypt.upper()
#dont encrypt special chars or spaces
else:
scrypt+=c
return scrypt
# return cgi.escape(scrypt, quote = True)
given_string = 'Rot13 Test'
print rot13(given_string)
OUTPUT:
13 r
[Finished in 0.0s]
Hmmm, seems like a bunch of things are not working.
Main problem should be in ccrypt = alph[ i+13 : i+14 ]: you're missing a % len(alph) otherwise if, for example, i is equal to 18, then you'll end out of the list boundary.
In your output, in fact, only e is encoded to r because it's the only letter in your test string which, moved by 13, doesn't end out of boundary.
The rest of this answer are just tips to clean the code a little bit:
instead of alph='abc.. you can declare an import string at the beginning of the script and use a string.lowercase
instead of using string slicing, for just one character it's better to use string[i], gets the work done
instead of c == c.upper(), you can use builtin function if c.isupper() ....
The trouble you're having is with your slice. It will be empty if your character is in the second half of the alphabet, because i+13 will be off the end. There are a few ways you could fix it.
The simplest might be to simply double your alphabet string (literally: alph = alph * 2). This means you can access values up to 52, rather than just up to 26. This is a pretty crude solution though, and it would be better to just fix the indexing.
A better option would be to subtract 13 from your index, rather than adding 13. Rot13 is symmetric, so both will have the same effect, and it will work because negative indexes are legal in Python (they refer to positions counted backwards from the end).
In either case, it's not actually necessary to do a slice at all. You can simply grab a single value (unlike C, there's no char type in Python, so single characters are strings too). If you were to make only this change, it would probably make it clear why your current code is failing, as trying to access a single value off the end of a string will raise an exception.
Edit: Actually, after thinking about what solution is really best, I'm inclined to suggest avoiding index-math based solutions entirely. A better approach is to use Python's fantastic dictionaries to do your mapping from original characters to encrypted ones. You can build and use a Rot13 dictionary like this:
alph="abcdefghijklmnopqrstuvwxyz"
rot13_table = dict(zip(alph, alph[13:]+alph[:13])) # lowercase character mappings
rot13_table.update((c.upper(),rot13_table[c].upper()) for c in alph) # upppercase
def rot13(s):
return "".join(rot13_table.get(c, c) for c in s) # non-letters are ignored
First thing that may have caused you some problems - your string list has the n and the o switched, so you'll want to adjust that :) As for the algorithm, when you run:
ccrypt = alph[ i+13 : i+14 ]
Think of what happens when you get 25 back from the first iteration (for z). You are now looking for the index position alph[38:39] (side note: you can actually just say alph[38]), which is far past the bounds of the 26-character string, which will return '':
In [1]: s = 'abcde'
In [2]: s[2]
Out[2]: 'c'
In [3]: s[2:3]
Out[3]: 'c'
In [4]: s[49:50]
Out[4]: ''
As for how to fix it, there are a number of interesting methods. Your code functions just fine with a few modifications. One thing you could do is create a mapping of characters that are already 'rotated' 13 positions:
alph = 'abcdefghijklmnopqrstuvwxyz'
coded = 'nopqrstuvwxyzabcdefghijklm'
All we did here is split the original list into halves of 13 and then swap them - we now know that if we take a letter like a and get its position (0), the same position in the coded list will be the rot13 value. As this is for an assignment I won't spell out how to do it, but see if that gets you on the right track (and #Makoto's suggestion is a perfect way to check your results).
This line
ccrypt = alph[ i+13 : i+14 ]
does not do what you think it does - it returns a string slice from i+13 to i+14, but if these indices are greater than the length of the string, the slice will be empty:
"abc"[5:6] #returns ''
This means your solution turns everything from n onward into an empty string, which produces your observed output.
The correct way of implementing this would be (1.) using a modulo operation to constrain the index to a valid number and (2.) using simple character access instead of string slices, which is easier to read, faster, and throws an IndexError for invalid indices, meaning your error would have been obvious.
ccrypt = alph[(i+13) % 26]
If you're doing this as an exercise for a course in Python, ignore this, but just saying...
>>> import codecs
>>> codecs.encode('Some text', 'rot13')
'Fbzr grkg'
>>>

multilevel caesar cipher

Hey, I'm trying to decode a multilevel Caesar cipher. By that I mean a string of letters could have been shifted several times, so if I say apply_shifts[(2,3),(4,5)], that means I shift everything from the 2nd letter by 3 followed by everything from the 4th letter by 5. Here's my code so far.
def find_best_shifts_rec(wordlist, text, start):
"""
Given a scrambled string and a starting position from which
to decode, returns a shift key that will decode the text to
words in wordlist, or None if there is no such key.
Hint: You will find this function much easier to implement
if you use recursion.
wordlist: list of words
text: scambled text to try to find the words for
start: where to start looking at shifts
returns: list of tuples. each tuple is (position in text, amount of shift)
"""
for shift in range(27):
text=apply_shifts(text, [(start,-shift)])
#first word is text.split()[0]
#test if first word is valid. if not, go to next shift
if is_word(wordlist,text.split()[0])==False:
continue
#enter the while loop if word is valid, otherwise never enter and go to the next shift
i=0
next_index=0
shifts={}
while is_word(wordlist,text.split()[i])==True:
next_index+= len(text.split()[i])
i=i+1
#once a word isn't valid, then try again, starting from the new index.
if is_word(wordlist,text.split()[i])==False:
shifts[next_index]=i
find_best_shifts_rec(wordlist, text, next_index)
return shifts
My problems are
1) my code isn't running properly and I don't understand why it is messing up (it's not entering my while loop)
and
2) I don't know how to test whether none of my "final shifts" (e.g. the last part of my string) are valid words and I also don't know how to go from there to the very beginning of my loop again.
Help would be much appreciated.
I think the problem is that you always work on the whole text, but apply the (new) shifting at some start inside of the text. So your check is_word(wordlist,text.split()[0]) will always check the first word, which is - of course - a word after your first shift.
What you need to do instead is to get the first word after your new starting point, so check the actually unhandled parts of the text.
edit
Another problem I noticed is the way you are trying out to find the correct shift:
for shift in range(27):
text=apply_shifts(text, [(start,-shift)])
So you basically want to try all shifts from 0 to 26 until the first word is accepted. It is okay to do it like that, but note that after the first tried shifting, the text has changed. As such you are not shifting it by 1, 2, 3, ... but by 1, 3, 6, 10, ... which is of course not what you want, and you will of course miss some shifts while doing some identical ones multiple times.
So you need to temporarily shift your text and check the status of that temporary text, before you continue to work with the text. Or alternatively, you always shift by 1 instead.
edit²
And another problem I noticed is with the way you are trying to use recursion to get your final result. Usually recursion (with a result) works the way that you keep calling the function itself and pass the return values along, or collect the results. In your case, as you want to have multiple values, and not just a single value from somewhere inside, you need to collect each of the shifting results.
But right now, you are throwing away the return values of the recursive calls and just return the last value. So store all the values and make sure you don't lose them.
Pseudo-code for recursive function:
coded_text = text from start-index to end of string
if length of coded_text is 0, return "valid solution (no shifts)"
for shift in possible_shifts:
decoded_text = apply shift of (-shift) to coded_text
first_word = split decoded_text and take first piece
if first_word is a valid word:
rest_of_solution = recurse on (text preceding start-index)+decoded_text, starting at start+(length of first_word)+1
if rest_of_solution is a valid solution
if shift is 0
return rest_of_solution
else
return (start, -shift mod alphabet_size) + rest_of_solution
# no valid solution found
return "not a valid solution"
Note that this is guaranteed to give an answer composed of valid words - not necessarily the original string. One specific example: 'a add hat' can be decoded in place of 'a look at'.

String Occurrence Counting Algorithm

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.
From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.
In Python this is what I want:
text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.
EDIT: It seems like python str.count serves as such a method; however, I am not able to find what algorithm it uses.
For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.
The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).
If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.
Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.
Hope this helps!
Boyer-Moore would be a good choice for counting occurrences, since it has some overhead that you would only need to do once. It does better the longer the pattern string is, so for "one" it would not be a good choice.
If you want to count overlaps, start the next search one character after the previous match. If you want to ignore overlaps, start the next search the full pattern string length after the previous match.
If your language has an indexOf or strpos method for finding one string in another, you can use that. If it proves to slow, then choose a better algorithm.
Hellnar,
You can use a simple dictionary to count occurrences in a String. The algorithm is a counting algorithm, here is an example:
"""
The counting algorithm is used to count the occurences of a character
in a string. This allows you to compare anagrams and strings themselves.
ex. animal, lamina a=2,n=1,i=1,m=1
"""
def count_occurences(str):
occurences = {}
for char in str:
if char in occurences:
occurences[char] = occurences[char] + 1
else:
occurences[char] = 1
return occurences
def is_matched(s1,s2):
matched = True
s1_count_table = count_occurences(s1)
for char in s2:
if char in s1_count_table and s1_count_table[char]>0:
s1_count_table[char] -= 1
else:
matched = False
break
return matched
#counting.is_matched("animal","laminar")
This example just returns True or False if the strings match. Keep in mind, this algorithm counts the number of times a character shows up in a string, this is good for anagrams.

Categories