Python Comparing two Strings and Determining 'Uniqueness'

Python Comparing two Strings and Determining 'Uniqueness' - python

the title is a mess so bear with me while I explain my question in more detail (or really, it's a set of semi-related questions). I'm compiling a list of certain words from a large text file and storing them in a dictionary as keys with their respective occurrences (integers) as the value. I want to apply several processes to consolidate the dictionary so that 'related' words get lumped together.
First operation is plurals. I see no reason to have a 'cat' and a 'cats' key in the dictionary. Same with car vs. cars and book vs. books and so on. I want to write a function that (upon seeing a new word not currently in the dictionary) checks to see if the new word is a plural form of any key currently in the dict (and vice versa).
if new_word ends with s -> check dict for a key that matches new_word[:-1]
else if new_word does not end in s -> check dict for new_word + 's'
Is there a better way to approach this problem? (I would obviously have to handle edge cases for plurals...this is very general at this point)
On the same topic, what if I want to determine if words are similar by consulting a database of known suffixes and prefixes and seeing if a new_word is just a previously seen word with a suffix or prefix attached.
I use nltk to handle a lot of other tasks in my program such as splitting into sentences and individual words but I would prefer to write my 'similar-ness' algorithm myself. Thank you in advance for your help guys!

Related

How to search for a set of words in a text file?

I'm writing a project on extracting a semantic orientation from a review stored in a text file.
I have a 400*2 array, each row contains a word and it's weight. I want to check which of these words is in the text file, and calculate the weight of the whole content.
My question is -
what is the most efficient way to do it? Should I search for each word separately, for example with a for loop?
Do I get any benefit from storing the content of the text file in a string object?

https://docs.python.org/3.6/library/mmap.html
This may work for you. You can use find

This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words:
sort all words from the text by length
sort your array by length
.
Write a for-loop:
Call len() (length) on each word from the text.
Then only check against those words which have the same length.
With some tinkering it might give you a good performance boost instead of the "naive" search.
Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with e.g. 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop.
Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change.

Searching words without diacritics in a sorted list of words

I've been trying to come up with an efficient solution for the following problem. I have a sorted list of words that contain diacritics and I want to be able to do a search without using diacritics. So for example I want to match 'kříž' just using 'kriz'. After a bit of brainstorming I came up with the following and I want to ask you, more experienced (or clever) ones, whether it's optimal or there's a better solution. I'm using Python but the problem is language independent.
First I provide a mapping of those characters that have some diacritical siblings. So in case of Czech:
cz_map = {'a' : ('á',), ... 'e' : ('é', 'ě') ... }
Now I can easily create all variants of a word on the input. So for 'lama' I get: ['lama', 'láma', 'lamá', 'lámá']. I could already use this to search for words that match any of those permutations but when it comes to words like 'nepredvidatelny' (unpredictable) one gets 13824 permutations. Even though my laptop has a shining Intel i5 logo on him, this is to my taste too naive solution.
Here's an improvement I came up with. The dictionary of words I'm using has a variant of binary search for prefix matching (returns a word on the lowest index with a matching prefix) that is very useful in this case. I start with a first character, search for it's prefix existence in a dictionary and if it's there, I stack it up for the next character that will be tested appended to all of these stacked up sequences. This way I'm propagating only those strings that lead to a match. Here's the code:
def dia_search(word, cmap, dictionary):
prefixes = ['']
for c in word:
# each character maps to itself
subchars = [c]
# and some diacritical siblings if they exist
if cmap.has_key(c):
subchars += cmap[c]
# build a list of matching prefixes for the next round
prefixes = [p+s for s in subchars
for p in prefixes
if dictionary.psearch(p+s)>0]
return prefixes
This technique gives very good results but could it be even better? Or is there a technique that doesn't need the character mapping as in this case? I'm not sure this is relevant but the dictionary I'm using isn't sorted by any collate rules so the sequence is 'a', 'z', 'á' not 'a', 'á', 'z' as one could expect.
Thanks for all comments.
EDIT: I cannot create any auxiliary precomputed database that would be a copy of the original one but without diacritics. Let's say the original database is too big to be replicated.

using the standard library only (str.maketrans and str.translate) you could do this:
intab = "řížéě" # ...add all the other characters
outtab = "rizee" # and the characters you want them translated to
transtab = str.maketrans(intab, outtab)
strg = "abc kříž def ";
print(strg.translate(transtab)) # abc kriz def
this is for python3.
for python 2 you'd need to:
from string import maketrans
transtab = maketrans(intab, outtab)
# the rest remains the same

Have a look into Unidecode using which u can actually convert the diacritics into closest ascii. e.g.:-unidecode(u'kříž')

As has been suggested, what you want to do is to translate your unicode words (containing diacritics) to the closest standard 24-word alphabet version.
One way of implementing this would be to create a second list of words (of the same size of the original) with the corresponding translations. Then you do the query in the translated list, and once you have a match look up the corresponding location in the original list.
Or in case you can alter the original list, you can translate everything in-place and strip duplicates.

Identify Visually Similar Strings in Python

I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?

What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0

Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.

Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.

Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.

Fastest possible dictionary-like matching

I will have to perform a spelling check-like operation in Python as follows:
I have a huge list of words (let's call it the lexicon). I am now given some text (let's call it the sample). I have to search for each sample word in the lexicon. If I cannot find it, that sample word is an error.
In short - a brute-force spelling checker. However, searching through the lexicon linearly for each sample word is bound to be slow. What's a better method to do this?
The complicating factor is that neither the sample nor the lexicon is in English. It is in a language which instead of 26 characters, can have over 300 - stored in Unicode.
A suggestion of any algorithm / data structure / parallelization method will be helpful. Algorithms which have high speed at the cost of less than 100% accuracy would be perfect, since I don't need 100% accuracy. I know about Norvig's algorithm for this, but it seems English-specific.

You can use a set of Unicode strings:
s = set(u"rabbit", u"lamb", u"calf")
and use the in operator to check whether a word occurs:
>>> u"rabbit" in s
True
>>> u"wolf" in s
False
This look-up is essentially O(1), so the size of the dictionary does not matter.
Edit: Here's the complete code for a (case-sensitive) spell checker (2.6 or above):
from io import open
import re
with open("dictionary", encoding="utf-8") as f:
words = set(line.strip() for line in f)
with open("document", encoding="utf-8") as f:
for w in re.findall(r"\w+", f.read()):
if w not in words:
print "Misspelled:", w.encode("utf-8")
(The print assumes your terminal uses UTF-8.)

This is where sets come in place. Create a set of all the words in your dictionary and then use a membership operator to check if the word is present in the dictionary or not.
Here is a simplified example
>>> dictionary = {'Python','check-like', 'will', 'perform','follows:', 'spelling', 'operation'}
>>> for word in "I will have to perform a spelling check-like operation in Python as follows:".split():
if word in dictionary:
print "Found {0} in the dictionary".format(word)
else:
print "{0} not present in the dictionary".format(word)
I not present in the dictionary
Found will in the dictionary
have not present in the dictionary
to not present in the dictionary
Found perform in the dictionary
a not present in the dictionary
Found spelling in the dictionary
Found check-like in the dictionary
Found operation in the dictionary
in not present in the dictionary
Found Python in the dictionary
as not present in the dictionary
Found follows: in the dictionary
>>>

Try it with a set, like everyone is telling you. Set lookups were optimized in python's C code by experienced programmers, so there's no way you can do better in your little application.
Unicode is not an issue: Set and dictionary keys can be unicode or English text, it doesn't matter. The only consideration for you might be unicode normalization, since different orders of diacritics would not compare equal. If this is an issue for your language, I would first ensure the lexicon is stored in normalized form, and then normalize each word before you check it. E.g., unicodedata.normalize('NFC', word)

Use a tree structure to store the words, such that each path from root to leaf represents a single word. If your traversal cannot reach a leaf, or reaches a leaf before the end of the word, you have a word not in your lexicon.
Apart from the benefits Emil mentions in the comments, note also that this allows you to do things like back-tracking to find alternative spellings.

The average time complexity of hashed search in a python dictionary is O(1). You can therefore use a "dictionary with no values" (a.k.a. a set)

That's what python dictionaries and sets are for! :)
Either store your lexicon in a dictionary if each word has some value (say frequency), or a set if you just need to check for existence. Searching them is O(1) so it will be damn fast.
lex = set(('word1', 'word2', .....))
for w in words:
if w not in lex:
print "Error: %s" % w

At first, you need to create index of your lexicon. for example you can make your own indexing system, but better way is using of full-text search engines Full text search engine
I may recomend apache lucene or sphinx for you. it's both fast and open source.
After you may send a searche queries from python to search engine and catch replies.

Here is a post I wrote on checking such things. It's simular to have the google suggestion/spell checker works.
http://blog.mattalcock.com/2012/12/5/python-spell-checker/
Hope it helps.

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.