I wrote the search code and I want to store what is between " " as one place in the list, how I may do that? In this case, I have 3 lists but the second one should is not as I want.
import re
message='read read read'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
should = ors_string.split(' ')
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
Output:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly', 'needed"', 'empty']
must_not: ['russia', '"destination good"']
Wanted result:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly needed"', 'empty'] <---
must_not: ['russia', '"destination good"']
Error when edited the message, how to handle it?
Traceback (most recent call last):
ors_string = to_match.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
Your should list splits on whitespace: should = ors_string.split(' '), this is why the word is split in the list. The following code gives you the output you requested but I'm not sure that is solves your problem for future inputs.
import re
message = 'read "find find":within("exactly needed" OR empty) "plane" -russia -"destination good"'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
# Split on OR instead of whitespace.
should = ors_string.split('OR')
to_remove_or = "OR"
while to_remove_or in should:
should.remove(to_remove_or)
# Remove trailing whitespace that is left after the split.
should = [word.strip() for word in should]
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
Related
I created a csv file like this:
"CAMERA", "Camera", "kamera", "cam", "Kamera"
"PICTURE", "Picture", "bild", "photograph"
and used it somewhat like this:
nlp = de_core_news_sm.load()
text = "Cam is not good"
doc = nlp(text)
name_dict, desc_dict = load_entities()
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=96)
for qid, desc in desc_dict.items():
desc_doc = nlp(desc)
desc_enc = desc_doc.vector
kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342) # 342 is an arbitrary value here
for qid, name in name_dict.items():
kb.add_alias(alias=name, entities=[qid], probabilities=[1]) # 100% prior probability P(entity|alias)
Printing values like this:
print(f"Entities in the KB: {kb.get_entity_strings()}")
print(f"Aliases in the KB: {kb.get_alias_strings()}")
gives me:
Entities in the KB: ['PICTURE', 'CAMERA']
Aliases in the KB: [' "Camera"', ' "Picture"']
However, if I try to check for candidates, I only get an empty list:
candidates = kb.get_candidates("Camera")
print(candidates)
for c in candidates:
print(" ", c.entity_, c.prior_prob, c.entity_vector)
Aliases in the KB: [' "Camera"', ' "Picture"']
It looks to me as if your parsing script added the literal string "Camera", with spaces and quotes and all, to the KB, instead of just the raw string Camera?
I'm working on a classification task, using a movie reviews dataset from Kaggle. The part with which I'm struggling is a series of functions, in which the output of one becomes the input of the next.
Specifically, in the code provided, the function "word_token" takes the input "phraselist", tokenizes it, and returns a tokenized document titled "phrasedocs". The only problem is that it doesn't seem to be working, because when I take that theoretical document "phrasedocs" and enter it into the next function, "process_token", I get:
NameError: name 'phrasedocs' is not defined
I am completely willing to accept that there is something simple I have overlooked, but I've been on this for hours and I can't figure it out. I would appreciate any help.
I have tried proofreading and debugging the code, but my Python expertise is not great.
# This function obtains data from train.tsv
def processkaggle(dirPath, limitStr):
# Convert the limit argument from a string to an int
limit = int(limitStr)
os.chdir(dirPath)
f = open('./train.tsv', 'r')
# Loop over lines in the file and use their first limit
phrasedata = []
for line in f:
# Ignore the first line starting with Phrase, then read all lines
if (not line.startswith('Phrase')):
# Remove final end of line character
line = line.strip()
# Each line has four items, separated by tabs
# Ignore the phrase and sentence IDs, keep the phrase and sentiment
phrasedata.append(line.split('\t')[2:4])
return phrasedata
# Randomize and subset data
def random_phrase(phrasedata):
random.shuffle(phrasedata) # phrasedata initiated in function processkaggle
phraselist = phrasedata[:limit]
for phrase in phraselist[:10]:
print(phrase)
return phraselist
# Tokenization
def word_token(phraselist):
phrasedocs=[]
for phrase in phraselist:
tokens=nltk.word_tokenize(phrase[0])
phrasedocs.append((tokens, int(phrase[1])))
return phrasedocs
# Pre-processing
# Convert all tokens to lower case
def lower_case(doc):
return [w.lower() for w in doc]
# Clean text, fixing confusion over apostrophes
def clean_text(doc):
cleantext=[]
for review_text in doc:
review_text = re.sub(r"it 's", "it is", review_text)
review_text = re.sub(r"that 's", "that is", review_text)
review_text = re.sub(r"\'s", "\'s", review_text)
review_text = re.sub(r"\'ve", "have", review_text)
review_text = re.sub(r"wo n't", "will not", review_text)
review_text = re.sub(r"do n't", "do not", review_text)
review_text = re.sub(r"ca n't", "can not", review_text)
review_text = re.sub(r"sha n't", "shall not", review_text)
review_text = re.sub(r"n\'t", "not", review_text)
review_text = re.sub(r"\'re", "are", review_text)
review_text = re.sub(r"\'d", "would", review_text)
review_text = re.sub(r"\'ll", "will", review_text)
cleantext.append(review_text)
return cleantext
# Remove punctuation and numbers
def rem_no_punct(doc):
remtext = []
for text in doc:
punctuation = re.compile(r'[-_.?!/\%#,":;\'{}<>~`()|0-9]')
word = punctuation.sub("", text)
remtext.append(word)
return remtext
# Remove stopwords
def rem_stopword(doc):
stopwords = nltk.corpus.stopwords.words('english')
updatestopwords = [word for word in stopwords if word not in ['not','no','can','has','have','had','must','shan','do','should','was','were','won','are','cannot','does','ain','could','did','is','might','need','would']]
return [w for w in doc if not w in updatestopwords]
# Lemmatization
def lemmatizer(doc):
wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in doc]
return lemma
# Stemming
def stemmer(doc):
porter = nltk.PorterStemmer()
stem = [porter.stem(t) for t in doc]
return stem
# This function combines all the previous pre-processing functions into one, which is helpful
# if I want to alter these settings for experimentation later
def process_token(phrasedocs):
phrasedocs2 = []
for phrase in phrasedocs:
tokens = nltk.word_tokenize(phrase[0])
tokens = lower_case(tokens)
tokens = clean_text(tokens)
tokens = rem_no_punct(tokens)
tokens = rem_stopword(tokens)
tokens = lemmatizer(tokens)
tokens = stemmer(tokens)
phrasedocs2.append((tokens, int(phrase[1]))) # Any words that pass through the processing
# steps above are added to phrasedocs2
return phrasedocs2
dirPath = 'C:/Users/J/kagglemoviereviews/corpus'
processkaggle(dirPath, 5000) # returns 'phrasedata'
random_phrase(phrasedata) # returns 'phraselist'
word_token(phraselist) # returns 'phrasedocs'
process_token(phrasedocs) # returns phrasedocs2
NameError Traceback (most recent call last)
<ipython-input-120-595bc4dcf121> in <module>()
5 random_phrase(phrasedata) # returns 'phraselist'
6 word_token(phraselist) # returns 'phrasedocs'
----> 7 process_token(phrasedocs) # returns phrasedocs2
8
9
NameError: name 'phrasedocs' is not defined
Simply you defined "phrasedocs" inside a function which is not seen from outside and the function return should be captured in a variable,
edit your code:
dirPath = 'C:/Users/J/kagglemoviereviews/corpus'
phrasedata = processkaggle(dirPath, 5000) # returns 'phrasedata'
phraselist = random_phrase(phrasedata) # returns 'phraselist'
phrasedocs = word_token(phraselist) # returns 'phrasedocs'
phrasedocs2 = process_token(phrasedocs) # returns phrasedocs2
You have only created the variable phrasedocs in a function. Therefore the variable is not defined for all of your other code outside this function. When you call the variable as an input to the function python can't find any variable named like that. You must create a variable called phrasedocs in your main code.
I created a simple word feature detector. So far been able to find particular features (jumbled within) the string, but the algorithm get confused with certain sequences of words. Let me illustrate:
from nltk.tokenize import word_tokenize
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_descriptors)
keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
def feature_match(message, keywords, negative_descriptors):
if re.search(r"("+negative_descriptors+")" + r".*?" + r"("+keywords+")", message): return True
if re.search(r"("+keywords+")" + r".*?" + r"("+negative_trailers+")", message): return True
The above returns True for the following messages:
message = 'There is no evidence of a collection.'
message = 'A collection is not present.'
That is correct as it implies that the keyword/condition I am looking for is NOT present. However, it returns None for the following messages:
message = 'There is no evidence of disc prolapse, collection or vertebral osteomyelitis.'
message = 'There is no evidence of disc prolapse/vertebral osteomyelitis/ collection.'
It seem to be matching 'or vertebral osteomyelitis' in the first message and '/ collection' in the second message as negative matches, but this is wrong and implies that the message reads 'the condition that I am looking for IS present'. It should really be returning 'True' instead.
How do I prevent this?
There are several problems with the code you posted :
negative_trailers = '|'.join(negative_descriptors) should be negative_trailers = '|'.join(negative_trailers )
You should also convert your list keywords to string as you did for your other lists so that it can be passed to a regex
There is no use to use 3 times 'r' in your regex
After these corrections your code should look like this :
negative_descriptors = ['no', 'unlikely', 'no evidence of']
negative_descriptors = '|'.join(negative_descriptors)
negative_trailers = ['not present', 'not evident']
negative_trailers = '|'.join(negative_trailers)
keywords = ['disc prolapse', 'vertebral osteomyelitis', 'collection']
keywords = '|'.join(keywords)
if re.search(r"("+negative_descriptors+").*("+keywords+")", message): neg_desc_present = True
if re.search(r"("+keywords+").*("+negative_trailers+")", message): neg_desc_present = True
I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.
When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah
I am using Python's regex with an if-statement: if the match is None, then it should go to the else clause. But it shows this error:
AttributeError: 'NoneType' object has no attribute 'group'
The script is:
import string
chars = re.escape(string.punctuation)
sub='FW: Re: 29699'
if re.search("^FW: (\w{10})",sub).group(1) is not None :
d=re.search("^FW: (\w{10})",sub).group(1)
else:
a=re.sub(r'['+chars+']', ' ',sub)
d='_'.join(a.split())
Every help is great help!
Your problem is this: if your search doesn't find anything, it will return None. You can't do None.group(1), which is what your code amounts to. Instead, check whether the search result is Noneānot the search result's first group.
import re
import string
chars = re.escape(string.punctuation)
sub='FW: Re: 29699'
search_result = re.search(r"^FW: (\w{10})", sub)
if search_result is not None:
d = search_result.group(1)
else:
a = re.sub(r'['+chars+']', ' ', sub)
d = '_'.join(a.split())
print(d)
# FW_RE_29699