Python RegEx code to detect specific features in a sentence

Python RegEx code to detect specific features in a sentence - python

I created a simple word feature detector. So far been able to find particular features (jumbled within) the string, but the algorithm get confused with certain sequences of words. Let me illustrate:
from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_descriptors)
keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
def feature_match(message, keywords, negative_descriptors):
if re.search(r"("+negative_descriptors+")" + r".*?" + r"("+keywords+")", message): return True
if re.search(r"("+keywords+")" + r".*?" + r"("+negative_trailers+")", message): return True
The above returns True for the following messages:
message = 'There is no evidence of a collection.'
message = 'A collection is not present.'
That is correct as it implies that the keyword/condition I am looking for is NOT present. However, it returns None for the following messages:
message = 'There is no evidence of disc prolapse, collection or vertebral osteomyelitis.'
message = 'There is no evidence of disc prolapse/vertebral osteomyelitis/ collection.'
It seem to be matching 'or vertebral osteomyelitis' in the first message and '/ collection' in the second message as negative matches, but this is wrong and implies that the message reads 'the condition that I am looking for IS present'. It should really be returning 'True' instead.
How do I prevent this?

There are several problems with the code you posted :
negative_trailers = '|'.join(negative_descriptors) should be negative_trailers = '|'.join(negative_trailers )
You should also convert your list keywords to string as you did for your other lists so that it can be passed to a regex
There is no use to use 3 times 'r' in your regex
After these corrections your code should look like this :
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_trailers)
keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
keywords = '|'.join(keywords)
if re.search(r"("+negative_descriptors+").*("+keywords+")", message): neg_desc_present = True
if re.search(r"("+keywords+").*("+negative_trailers+")", message): neg_desc_present = True

Related

How to store string in quotation that contains two words?

I wrote the search code and I want to store what is between " " as one place in the list, how I may do that? In this case, I have 3 lists but the second one should is not as I want.
import re
message='read read read'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
should = ors_string.split(' ')
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
Output:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly', 'needed"', 'empty']
must_not: ['russia', '"destination good"']
Wanted result:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly needed"', 'empty'] <---
must_not: ['russia', '"destination good"']
Error when edited the message, how to handle it?
Traceback (most recent call last):
ors_string = to_match.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

Your should list splits on whitespace: should = ors_string.split(' '), this is why the word is split in the list. The following code gives you the output you requested but I'm not sure that is solves your problem for future inputs.
import re
message = 'read "find find":within("exactly needed" OR empty) "plane" -russia -"destination good"'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
# Split on OR instead of whitespace.
should = ors_string.split('OR')
to_remove_or = "OR"
while to_remove_or in should:
should.remove(to_remove_or)
# Remove trailing whitespace that is left after the split.
should = [word.strip() for word in should]
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')

Divide list into two parts based on condition python

I have a list which contains a chat conversation between agent and customer.
chat = ['agent',
'Hi',
'how may I help you?',
'customer',
'I am facing issue with internet',
'agent',
'Can i know your name and mobile no.?'
'customer',
'john doe',
'111111',..... ]
This is a sample of chat list.
I am looking to divide the list into two parts, agent_chat and customer_chat, where agent_chat contains all the lines that agent said, and customer_chat containing the lines said by customer.
Something like this(final output).
agent_chat = ['Hi','how may I help you?','Can i know your name and mobile no.?'...]
customer_chat = ['I am facing issue with internet','john doe','111111',...]
I'm facing issues while solving this, i tried using list.index() method to split the chat list based on indexes, but I'm getting multiple values for the same index.
For example, the following snippet:
[chat.index(l) for l in chat if l=='agent']
Displays [0, 0], since its only giving me first occurrence.
Is there a better way to achieve the desired output?

index() returns only the first index of the element so you'll need to accumulate the index of all occurrence by iterating over the list.
I would suggest to solve this using a simple for loop as:
agent_chat = []
customer_chat = []
chat_type = 'agent'
for chat in chats:
if chat in ['agent', 'customer']:
chat_type = chat
continue
if chat_type == 'agent':
agent_chat.append(chat)
else:
customer_chat.append(chat)
Other approaches like list comprehension will require two iterations of the list.

This would be my solution to your problem.
chat = ['agent',
'Hi',
'how may I help you?',
'customer',
'I am facing issue with internet',
'agent',
'Can i know your name and mobile no.?',
'customer',
'john doe',
'111111']
agent_list = []
customer_list = []
agent = False
customer = False
for message in chat:
if message == 'agent':
agent = True
customer = False
elif message == 'customer':
agent = False
customer = True
elif agent:
agent_list.append(message)
elif customer:
customer_list.append(message)
else:
pass

Here is my solution. I don't know this is the best one but I hope it helps
def chat_lists(chat):
agent_chat = []
customer_chat = []
user_flag = ""
for message in chat:
if message == 'agent':
user_flag = 'agent'
elif message == 'customer':
user_flag = 'customer'
else :
if user_flag == 'agent':
agent_chat.append(message)
else:
customer_chat.append(message)
return customer_chat, agent_chat

You can do something like this.
chat = ['agent',
'Hi',
'how may I help you?',
'customer',
'I am facing issue with internet',
'agent',
'Can i know your name and mobile no.?'
'customer',
'john doe',
'111111']
agent = []
customer = []
for j in chat:
if j=='agent':
curr = 'agent'
continue
if j=='customer':
curr = 'customer'
continue
if(curr=='agent'):
agent.append(j)
else:
customer.append(j)
print(agent)
print(customer)

You can set up a while loop to parse through the messages, and set up a variable to act as a 'switch' for whether the agent or client is talking.
# Get the current speaker (first speaker)
current_speaker = chat[0]
# Make the chat logs
agent_chat = []
customer_chat = []
# Iterate for the array
for message in chat:
# If the current speaker is updated
if message in ['agent', 'customer']:
# Then update the speaker
current_speaker = message
# Skip to the next iteration
continue
# Add the message based
if current_speaker == 'agent':
agent_chat.append(message)
else:
customer_chat.append(message)
Another note to keep in mind is that this whole system will bug out heavily if a customer or agent decides, for whatever reason, to type in the word 'agent' or 'customer'.

Python - possible to only send message when keyword is detected in BOTH keyword lists?

I have a python script that detects a keyword from a keyword list keywords = ['camera', 'nikon'] and then sends a message to Slack like the following
Keyword camera detected
'Reddit post url'
'reddit comment that contains the keyword'
If the script detects a keyword from a second keyword list color_keywords = ['red', 'blue'] then it posts the following
Keyword camera detected
'Reddit post url'
'reddit comment that contains the keyword'
Color was detected
My question is, am I somehow able to have the script so it ONLY sends a message if a keyword from EACH keyword list is found?
So if it only finds a keyword from the first list, it will be ignored, if it finds one from the second list, it will also be ignored. but if it finds a keyword from BOTH lists, it will send the message to slack.
Below is my current code
MSG_TEMPLATE = """Keyword *{keyword}* detected
https://www.reddit.com{permalink}
```{comment_body}```"""
keywords = ['camera', 'nikon', 'canon']
color_keywords = ['blue', 'red']
with open(save_path, 'r') as fp:
alerted_comments = json.load(fp)
for comment in comment_stream:
if comment.id in alerted_comments:
continue
if comment.author: # if comment author hasn't deleted
if comment.author.name in ignore_users:
continue
if any(kw.lower() in comment.body.lower() for kw in keywords):
found_kws = [kw for kw in keywords if kw.lower() in comment.body.lower()]
msg = MSG_TEMPLATE.format(
keyword=found_kws[0],
permalink=comment.permalink,
comment_body=comment.body
)
if any(kw.lower() in comment.body.lower() for kw in color_keywords):
msg += "\n<!here> *A color was detected*"
slack_data = {'text': msg, 'mrkdwn': True,}
response = requests.post('https://hooks.slack.com/services/TB7AH6U2G/xxxxxxx/0KOjl9251TZExxxxxxxx',
data=json.dumps(slack_data), headers={'Content-Type': 'application/json'})
Any help will be greatly appreciated!

Sure! The code below is excerpted for brevity:
def find_keywords(comment, word_list):
""":returns: List of matching keywords present in the comment, or the empty list"""
return [word for word in word_list if word.lower() in comment.body.lower()]
for comment in comment_stream:
if not should_be_ignored(comment):
found_kws = find_keywords(comment, keywords)
found_colors = find_keywords(comment, color_keywords)
if found_kws and found_colors:
# At this point, we're guaranteed to have *both* one or more keywords *and* one or more colors
send_message(comment, found_kws, found_colors)
The key insight here is: you create your lists of matches first, and then afterward examine them to decide if you want to send a message. In this case, only if both lists are not empty will you progress to sending the message.
(Implementation of should_be_ignored() and send_message() are, of course, left as an exercise to the reader. :) )
EDIT: Complete implementation of the original code:
def send_message(comment, keywords, colors):
assert keywords and colors, "At this point, we should *only* be calling this function if we have at least one keyword and one color"
MSG_TEMPLATE = """Keyword *{keyword}* and color *{color}* detected
https://www.reddit.com{permalink}
```{comment_body}```"""
msg = MSG_TEMPLATE.format(
keyword=keywords[0],
color=colors[0],
permalink=comment.permalink,
comment_body=comment.body
)
slack_data = {'text': msg, 'mrkdwn': True,}
response = requests.post('https://hooks.slack.com/services/TB7AH6U2G/xxxxxxx/0KOjl9251TZExxxxxxxx',
data=json.dumps(slack_data), headers={'Content-Type': 'application/json'})
def should_be_ignored(comment, alerted):
return comment.id in alerted or (comment.author and comment.author.name in ignore_users)
def find_keywords(comment, word_list):
""":returns: List of matching keywords present in the comment, or the empty list"""
return [word for word in word_list if word.lower() in comment.body.lower()]
keywords = ['camera', 'nikon', 'canon']
color_keywords = ['blue', 'red']
with open(save_path, 'r') as fp:
alerted_comments = json.load(fp)
for comment in comment_stream:
if not should_be_ignored(comment, alerted_comments):
found_kws = find_keywords(comment, keywords)
found_colors = find_keywords(comment, color_keywords)
if found_kws and found_colors:
# At this point, we're guaranteed to have *both* one or more keywords *and* one or more colors
send_message(comment, found_kws, found_colors)
Note that all I've done (aside from the new requirement that we have both a color and a keyword before sending a message) is to pull out some of your business logic into the should_be_ignored() and send_message() functions, hopefully clarifying the intent of the main body of code. This should be a drop-in replacement for the sample you started with.

Python - Delete Conditional Lines of Chat Log File

I am trying to delete my conversation from a chat log file and only analyse the other persons data. When I load the file into Python like this:
with open(chatFile) as f:
chatLog = f.read().splitlines()
The data is loaded like this (much longer than the example):
'My Name',
'08:39 Chat data....!',
'Other person's name',
'08:39 Chat Data....',
'08:40 Chat data...,
'08:40 Chat data...?',
I would like it to look like this:
'Other person's name',
'08:39 Chat Data....',
'08:40 Chat data...,
'08:40 Chat data...?',
I was thinking of using an if statement with regular expressions:
name = 'My Name'
for x in chatLog:
if x == name:
"delete all data below until you get to reach the other
person's name"
I could not get this code to work properly, any ideas?

I think you misunderstand what "regular expressions" means... It doesn't mean you can just write English language instructions and the python interpreter will understand them. Either that or you were using pseudocode, which makes it impossible to debug.
If you don't have the other person's name, we can probably assume it doesn't begin with a number. Assuming all of the non-name lines do begin with a number, as in your example:
name = 'My Name'
skipLines = False
results = []
for x in chatLog:
if x == name:
skipLines = True
elif not x[0].isdigit():
skipLines = False
if not skipLines:
results.append(x)

others = []
on = True
for line in chatLog:
if not line[0].isdigit():
on = line != name
if on:
others.append(line)

You can delete all of your messages using re.sub with an empty string as the second argument which is your replacement string.
Assuming each chat message starts on a new line beginning with a time stamp, and that nobody's name can begin with a digit, the regular expression pattern re.escape(yourname) + r',\n(?:\d.*?\n)*' should match all of your messages, and then those matches can be replaced with the empty string.
import re
with open(chatfile) as f:
chatlog = f.read()
yourname = 'My Name'
pattern = re.escape(yourname) + r',\n(?:\d.*?\n)*'
others_messages = re.sub(pattern, '', chatlog)
print(others_messages)
This will work to delete the messages of any user from any chat log where an arbitrary number of users are chatting.

trying to automate translation on babelfish with python

I have modified a python babelizer to help me to translate english to chinese.
## {{{ http://code.activestate.com/recipes/64937/ (r4)
# babelizer.py - API for simple access to babelfish.altavista.com.
# Requires python 2.0 or better.
#
# See it in use at http://babel.MrFeinberg.com/
"""API for simple access to babelfish.altavista.com.
Summary:
import babelizer
print ' '.join(babelizer.available_languages)
print babelizer.translate( 'How much is that doggie in the window?',
'English', 'French' )
def babel_callback(phrase):
print phrase
sys.stdout.flush()
babelizer.babelize( 'I love a reigning knight.',
'English', 'German',
callback = babel_callback )
available_languages
A list of languages available for use with babelfish.
translate( phrase, from_lang, to_lang )
Uses babelfish to translate phrase from from_lang to to_lang.
babelize(phrase, from_lang, through_lang, limit = 12, callback = None)
Uses babelfish to translate back and forth between from_lang and
through_lang until either no more changes occur in translation or
limit iterations have been reached, whichever comes first. Takes
an optional callback function which should receive a single
parameter, being the next translation. Without the callback
returns a list of successive translations.
It's only guaranteed to work if 'english' is one of the two languages
given to either of the translation methods.
Both translation methods throw exceptions which are all subclasses of
BabelizerError. They include
LanguageNotAvailableError
Thrown on an attempt to use an unknown language.
BabelfishChangedError
Thrown when babelfish.altavista.com changes some detail of their
layout, and babelizer can no longer parse the results or submit
the correct form (a not infrequent occurance).
BabelizerIOError
Thrown for various networking and IO errors.
Version: $Id: babelizer.py,v 1.4 2001/06/04 21:25:09 Administrator Exp $
Author: Jonathan Feinberg <jdf#pobox.com>
"""
import re, string, urllib
import httplib, urllib
import sys
"""
Various patterns I have encountered in looking for the babelfish result.
We try each of them in turn, based on the relative number of times I've
seen each of these patterns. $1.00 to anyone who can provide a heuristic
for knowing which one to use. This includes AltaVista employees.
"""
__where = [ re.compile(r'name=\"q\">([^<]*)'),
re.compile(r'td bgcolor=white>([^<]*)'),
re.compile(r'<\/strong><br>([^<]*)')
]
# <div id="result"><div style="padding:0.6em;">??</div></div>
__where = [ re.compile(r'<div id=\"result\"><div style=\"padding\:0\.6em\;\">(.*)<\/div><\/div>', re.U) ]
__languages = { 'english' : 'en',
'french' : 'fr',
'spanish' : 'es',
'german' : 'de',
'italian' : 'it',
'portugese' : 'pt',
'chinese' : 'zh'
}
"""
All of the available language names.
"""
available_languages = [ x.title() for x in __languages.keys() ]
"""
Calling translate() or babelize() can raise a BabelizerError
"""
class BabelizerError(Exception):
pass
class LanguageNotAvailableError(BabelizerError):
pass
class BabelfishChangedError(BabelizerError):
pass
class BabelizerIOError(BabelizerError):
pass
def saveHTML(txt):
f = open('page.html', 'wb')
f.write(txt)
f.close()
def clean(text):
return ' '.join(string.replace(text.strip(), "\n", ' ').split())
def translate(phrase, from_lang, to_lang):
phrase = clean(phrase)
try:
from_code = __languages[from_lang.lower()]
to_code = __languages[to_lang.lower()]
except KeyError, lang:
raise LanguageNotAvailableError(lang)
html = ""
try:
params = urllib.urlencode({'ei':'UTF-8', 'doit':'done', 'fr':'bf-res', 'intl':'1' , 'tt':'urltext', 'trtext':phrase, 'lp' : from_code + '_' + to_code , 'btnTrTxt':'Translate'})
headers = {"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain"}
conn = httplib.HTTPConnection("babelfish.yahoo.com")
conn.request("POST", "http://babelfish.yahoo.com/translate_txt", params, headers)
response = conn.getresponse()
html = response.read()
saveHTML(html)
conn.close()
#response = urllib.urlopen('http://babelfish.yahoo.com/translate_txt', params)
except IOError, what:
raise BabelizerIOError("Couldn't talk to server: %s" % what)
#print html
for regex in __where:
match = regex.search(html)
if match:
break
if not match:
raise BabelfishChangedError("Can't recognize translated string.")
return match.group(1)
#return clean(match.group(1))
def babelize(phrase, from_language, through_language, limit = 12, callback = None):
phrase = clean(phrase)
seen = { phrase: 1 }
if callback:
callback(phrase)
else:
results = [ phrase ]
flip = { from_language: through_language, through_language: from_language }
next = from_language
for i in range(limit):
phrase = translate(phrase, next, flip[next])
if seen.has_key(phrase): break
seen[phrase] = 1
if callback:
callback(phrase)
else:
results.append(phrase)
next = flip[next]
if not callback: return results
if __name__ == '__main__':
import sys
def printer(x):
print x
sys.stdout.flush();
babelize("I won't take that sort of treatment from you, or from your doggie!",
'english', 'french', callback = printer)
## end of http://code.activestate.com/recipes/64937/ }}}
and the test code is
import babelizer
print ' '.join(babelizer.available_languages)
result = babelizer.translate( 'How much is that dog in the window?', 'English', 'chinese' )
f = open('result.txt', 'wb')
f.write(result)
f.close()
print result
The result is to be expected inside a div block . I modded the script to save the html response . What I found is that all utf8 characters are turned to nul . Do I need take special care in treating the utf8 response ?

I think you need to use:
import codecs
codecs.open
instead of plain open, in your:
saveHTML
method, to handle utf-8 docs. See the Python Unicode Howto for a complete explanation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python RegEx code to detect specific features in a sentence - python

Related

How to store string in quotation that contains two words?

Divide list into two parts based on condition python

Python - possible to only send message when keyword is detected in BOTH keyword lists?

Python - Delete Conditional Lines of Chat Log File

trying to automate translation on babelfish with python

Categories

Resources