Looping through substrings of a string against a dictionary

Looping through substrings of a string against a dictionary - python

Please be gentle - I am new to this and English is not my first language. For a school project, the assignment is to create a program that allows the user to input text for a ticket and the result for who it should be routed to. Below is the code I have so far. It works fine for single word keywords, but not for keywords with two or more words. I don't necessarily need the answer as much as a push in the right direction. We are not supposed to use things we haven't learned yet so that limits me to some very basic functionality - lists, dictionaries, simple loops, etc.
If multiple keywords are in the input, only the keyword found at the highest position in the table is used for routing.
The program needs to use keywords as case insensitive.
If there is no keyword present, it should be routed to "Next Available Technician"
The goal is to have output that is formatted exactly as such:
Also I think I am suppose to use the .find() function, but I am not sure how it would be implemented.

Loop over the techRouting dictionary. Check if the keyword is in the text, and if it is, use the corresponding routed to value.
c = ''
for keyword in techRouting:
if keyword in ticket_text:
c = keyword
break

There is a couple of issues here:
The way of extracting keywords from input
ticketText = ticketText.split(" ")
gives you a list, consisting of a parts, that had whitespace " " inbetween.
In other words, ticketText breaks to a separate words, and none
of words contains a space.
In [61]: ticketText = "Password doesn't work"
...: ticketKeywords = ticketText.split(" ")
...: print(ticketKeywords)
results to
['Password', "doesn't", 'work']
Keywords from list should also be lowercased before comparison. As well, as the input.
You probably should not split the input into keywords at all. Just use keyword in inputText to check, if inputText contains the whole keyword as a substring.

Related

Identify Visually Similar Strings in Python

I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?

What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0

Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.

Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.

Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True

Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)

If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

Python comparing elements in two lists

I have two lists:
a - dictionary which contains keywords such as ["impeccable", "obvious", "fantastic", "evident"] as elements of the list
b - sentences which contains sentences such as ["I am impeccable", "you are fantastic", "that is obvious", "that is evident"]
The goal is to use the dictionary list as a reference.
The process is as follows:
Take an element for the sentences list and run it against each element in the dictionary list. If any of the elements exists, then spit out that sentence to a new list
Repeating step 1 for each of the elements in the sentences list.
Any help would be much appreciated.
Thanks.
Below is the code:
sentences = "The book was awesome and envious","splendid job done by those guys", "that was an amazing sale"
dictionary = "awesome","amazing", "fantastic","envious"
##Find Matches
for match in dictionary:
if any(match in value for value in sentences):
print match

Now that you've fixed the original problem, and fixed the next problem with doing the check backward, and renamed all of your variables, you have this:
for match in dictionary:
if any(match in value for value in sentences):
print match
And your problem with it is:
The way I have the code written i can get the dictionary items but instead i want to print the sentences.
Well, yes, your match is a dictionary item, and that's what you're printing, so of course that's what you get.
If you want to print the sentences that contain the dictionary item, you can't use any, because the whole point of that function us to just return True if any elements are true. It won't tell you which ones—in fact, if there are more than one, it'll stop at the first one.
If you don't understand functions like any and the generator expressions you're passing to them, you really shouldn't be using them as magic invocations. Figure out how to write them as explicit loops, and you will be able to answer these problems for yourself easily. (Note that the any docs directly show you how to write an equivalent loop.)
For example, your existing code is equivalent to:
for match in dictionary:
for value in sentences:
if match in value:
print match
break
Written that way, it should be obvious how to fix it. First, you want to print the sentence instead of the word, so print value instead of match (and again, it would really help if you used meaningful variable names like sentence and word instead of meaningless names like value and misleading names like match…). Second, you want to print all matching sentences, not just the first one, so don't break. So:
for match in dictionary:
for value in sentences:
if match in value:
print value
And if you go back to my first answer, you may notice that this is the exact same structure I suggested.
You can simplify or shorten this by using comprehensions and iterator functions, but not until you understand the simple version, and how those comprehensions and iterator functions work.

First translate your algorithm into psuedocode instead of a vague description, like this:
for each sentence:
for each element in the dictionary:
if the element is in the sentence:
spit out the sentence to a new list
The only one of these steps that isn't completely trivial to convert to Python is "spit out the sentence to a new list". To do that, you'll need to have a new list before you get started, like a_new_list = [], and then you can call append on it.
Once you convert this to Python, you will discover that "I am impeccable and fantastic" gets spit out twice. If you don't want that, you need to find the appropriate please to break out of the inner loop and move on to the next sentence. Which is also trivial to convert to Python.

Now that you've posted your code… I don't know what problem you were asking about, but there's at least one thing obviously wrong with it.
sentences is a list of sentences.
So, for partial in sentences means each partial will be a sentence, like "I am impeccable".
dictionary is a list of words. So, for value in dictionary means each value will be a word, like "impeccable".
Now, you're checking partial in value for each value for each partial. That will never be true. "I am impeccable" is not in "impeccable".
If you turn that around, and check whether value in partial, it will give you something that's at least true sometimes, and that may even be what you actually want, but I'm not sure.
As a side note, if you used better names for your variables, this would be a lot more obvious. partial and value don't tell you what those things actually are; if you'd called them sentence and word it would be pretty clear that sentence in word is never going to be true, and that word in sentence is probably what you wanted.
Also, it really helps to look at intermediate values to debug things like this. When you use an explicit for statement, you can print(partial) to see each thing that partial holds, or you can put a breakpoint in your debugger, or you can step through in a visualizer like this one. If you have to break the any(genexpr) up into an explicit loop to do, then do so. (If you don't know how, then you probably don't understand what generator expressions or the any function do, and have just copied and pasted random code you didn't understand and tried changing random things until it worked… in which case you should stop doing that and learn what they actually mean.)

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

Spell check program in python

Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?

There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right

I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.

It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.