Why is one way more efficient than another?

Why is one way more efficient than another? - python

I was trying out hackerrank, and I came across a problem which I tried to solve using python3.
The problem was
"A kidnapper wrote a ransom note but is worried it will be traced back to him. He found a magazine and wants to know if he can cut out whole words from it and use them to create an untraceable replica of his ransom note. The words in his note are case-sensitive and he must use whole words available in the magazine, meaning he cannot use substrings or concatenation to create the words he needs.
Given the words in the magazine and the words in the ransom note, print Yes if he can replicate his ransom note exactly using whole words from the magazine; otherwise, print No."
I tried using the following approach,
def ransom_note(magazine, ransom):
# comparing based on the number of times word occurred in the list
for word in set(ransom):
if ransom.count(word) > magazine.count(word):
return False
return True
This did work, I got 18 out of 20 test cases right.
But the other two cases were timing out, so I had to get the best cost effective way of doing this.
I tried to store the words as a dictionary by using the word as the key and count of the word as the value. Still not getting those two cases, when i looked into the cases there was 30000 words for both the inputs and the output expected was "Yes".
I saw the discussion's page and found a piece of code that got me through.
from collections import Counter
def ransom_note(magazine, ransom):
return not (Counter(ransom) - Counter(magazine))
Can someone explain why this was more efficient than my method?
Thanks in advance :)

As I understand it, in your second attempt at the problem, both ransom and magazine were dictionaries, so theoretically your code was as fast as it could be.
The Python Counter collection is designed specifically to work with simple integer counts, and optimized to perform common operations very quickly. It turns out that seeing if there are enough things in one list to satisfy the requests from another list is a really common operation. So they spent time optimizing Counter do that operation very quickly.

Related

python word decomposition into subwords: e.g. motorbike -> motor, bike

I have a list of words like [bike, motorbike, copyright].
Now I want to check if th word consists of subwords which are also stand alone words. That means the ouput of my algorithm should be something like: [bike, motor, motorbike, copy, right, copyright].
I already now how to check if a word is a english word:
import enchant
english_words = []
arr = [bike, motorbike, copyright, apfel]
d_brit = enchant.Dict("en_GB")
for word in arr:
if d_brit.check(word):
english_words.append(word)
I also found an algorithm which decomposes the word in all possible ways: Splitting a word into all possible 'subwords' - All possible combinations
Unfortunately, splitting the word like this and then check if it is an english word takes simply to long, because my dataset is way to huge.
Can anyone help?

The nested for loops used in the code are extremely slow in Python. As performance seems to be the main issue, I would recommend to look for available Python packages to do parts of the job, or to build your own extension module, e.g. using Cython, or to not use Python at all.

Some alternatives to splitting the word like this:
searching for words that start with the first characters of str. If found word is start of str, check if rest is a word in dataset
split the str in two portions that make sense when looking at the length distribution of the dataset i.e. what are the most common word lengths? Then searching for matches with basic comparison. (just a wild idea)
These are a few quick ideas for faster algorithms i can think of. But if these are not quick enough, then BernieD is right.

Scoring word similarity between arbitrary text

I have a list of over 500 very important, but arbitrary strings. they look like:
list_important_codes = ['xido9','uaid3','frps09','ggix21']
What I know
*Casing is not important, but all other characters must match exactly.
*Every string starts with 4 alphabetical characters, and ends with either one or two numerical characters.
*I have a list of about 100,000 strings,list_recorded_codes that were hand-typed and should match list_important_codes exactly, but about 10,000 of them dont. Because these strings were typed manually, the incorrect strings are usually only about 1 character off. (errors such as: *has an added space, *got two letters switched around, *has "01" instead of "1", etc)
What I need to do
I need to iterate through list_recorded_codes and find all of their perfect matches within list_important_codes.
What I tried
I spent about 10 hours trying to manually program a way to fix each word, but it seems to be impractical and incredibly tedious. not to mention, when my list doubles in size at a later date, i would have to manually go about that process again.
The solution I think I need, and the expected output
Im hoping that Python's NLTK can efficiently 'score' these arbitrary terms to find a 'best score'. For example, if the word in question is inputword = "gdix88", and that word gets compared to score(inputword,"gdox89")=.84 and score(inputword,"sudh88")=.21. with my expected output being highscore=.84, highscoreword='gdox89'
for manually_entered_text in ['xido9','uaid3','frp09','ggix21']:
--get_highest_score_from_important_words() #returns word_with_highest_score
--manually_entered_text = word_with_highest_score
I am also willing to use a different set of tools to fix this issue if needed. but also, the simpler the better! Thank you!

The 'score' you are looking for is called an edit distance. There is quite a lot of literature and algorithms available - easy to find, but only after you know the proper term :)
See the corresponding wikipedia article.
The nltk package provides an implementation of the so-called Levenshtein edit-distance:
from nltk.metrics.distance import edit_distance
if __name__ == '__main__':
print(edit_distance("xido9", "xido9 "))
print(edit_distance("xido9", "xido8"))
print(edit_distance("xido9", "xido9xxx"))
print(edit_distance("xido9", "xido9"))
The results are 1, 1, 3 and 0 in this case.
Here is the documentation of the corresponding nltk module
There are more specialized versions of this score that take into account how frequent various typing errors are (for example 'e' instead of 'r' might occur quite often because the keys are next to each other on a qwert keyboard).
But classic Levenshtein would were I would start.

You could apply a dynamic programming approach to this problem. Once you have your scoring matrix, you alignment_matrix and your local and global alignment functions set up, you could iterate through the list_important_codes and find the highest scoring alignment in the list_recorded_codes. Here is a project I did for DNA sequence alignment: DNA alignment. You can easily adapt it to your problem.

Identify Visually Similar Strings in Python

I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?

What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0

Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.

Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.

Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.

Implement hashing function with collision

For a demo project, I want to create a hashing function with a very high probability of collision. Something simple is fine since the aim of the project is NOT security - but to demonstrate hash collisions.
Can anyone help me get started with an algorithm, or a sample implementation, or just point me in the right direction?
I am doing this in Python, though maybe that should not matter.

You could use the sum of the characters in a string. It's the first hash function I was taught back when I was first learning BASIC in high school, and I ran into the collision problem right away and had to figure out how to deal with it.
sum(ord(c) for c in text)
Transpositions are easily achieved by swapping strings or even words. For more fun you could also make it case-insensitive:
sum(ord(c) for c in text.lower())
I'll even give you a sample collision for that last one: Jerry Kindall -> Dillan Kyrjer :-)

One algorithm that comes to mind is hashing using the first letter of the string.
Something like
hash[ord(text[0]) - ord('a')] = text
So anything starting with the same letter will be hashed together. As you can see, that's a lot of collisions.
Another idea is to hash according to the length of the string.
hash[len(text)] = text
You can use what hayden suggests in a comment above, and cause further collisions by taking the length modulo some number. Eg.
hash[len(text) % 5] = text

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.