I'm trying to check some string in a row in a sentence using if condition but i'm not getting the expected the output. I've also tried with the Regex pattern but that is also not helping me out. Can anyone help me with this? My value for string changes everytime so not sure if this is the problem.
r="Carnival: monitor service-eu Beta cloudwatch_module"
if "Carnival: monitor service-eu Beta" in r:
test_string="EU"
test_string1="eu"
elif "Carnival: monitor service-na Beta" in r:
test_string="NA"
test_string1="na"
elif "Carnival: monitor service-fe Beta" in r:
test_string="FE"
test_string1="fe"
else:
print("None found")
With regex something like this.
but this is also not working.
re_pattern = r'\b(?:service-eu|Beta|monitor|Carnival)\b'
new_= re.findall(re_pattern, r)
new1_=new_[2]
If isn't necessary use this short_description function, I suggest you use the find function:
if r.find("Carnival: monitor service-eu Beta") != -1:
test_string="EU"
test_string1="eu"
elif r.find("Carnival: monitor service-na Beta") != -1:
test_string="NA"
test_string1="na"
elif r.find("Carnival: monitor service-fe Beta") != -1:
test_string="FE"
test_string1="fe"
else:
print("None found")
I'm not 100% sure I understand your problem, but hopefully this helps:
import re
def get_string(r):
return re.findall(r"service-[a-z]{2}",r)[0][-2:]
get_string("Carnival: monitor service-na Beta")
>>> 'na'
get_string("Carnival: monitor service-fe Beta")
>>> 'fe'
Here, [a-z]{2} means any word that contains lowercase letters with a length of 2.
You can use a pattern with a capture group and .upper() for the group value.
\bCarnival: monitor service-([a-z]{2}) Beta\b
See the capture group value at the regex 101 demo and a Python demo.
Example
import re
pattern = r"\bCarnival: monitor service-([a-z]{2}) Beta\b"
r = "Carnival: monitor service-eu Beta cloudwatch_module"
m = re.search(pattern, r)
if m:
test_string1 = m.group(1)
test_string = test_string1.upper()
print(test_string)
print(test_string1)
Output
EU
eu
Related
I am trying to use a regex to exclude disambiguation pages when scraping wikipedia. I looked around for tips about using the negative lookahead and
I cannot seem to make it work. I think I am missing something fundamental
about its use but as of now I am totally clueless. Could someone please
point me in the right direction? (I don't want to use
if 'disambiguation' in y
, I am trying to grasp
the workings of the negative lookahead.) Thank you.
Here is the code:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
for x in list_links:
y = re.findall(regex1, x)
print(y)
findString(list_links)```
You can use one of the regex, based on your need. Also, I have added some changes to the function definition to respect PEP.
def remove_disambiguation_link(list_of_links):
regex = "(.*)\((!?disambiguation)\)"
# regex = "(/wiki/)(.*)\((!?disambiguation)\)"
# return [links for links in list_of_links if not re.search(regex, links)]
return list(filter(lambda link: not re.search(regex, link), list_of_links))
list_links = remove_disambiguation_link(list_links)
print(list_links)
[
"/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg",
"/wiki/Taiwanese_tea",
"/wiki/Tung-ting_tea",
"/wiki/Nantou_County",
"/wiki/Taiwan",
"/wiki/Dongfang_Meiren",
"/wiki/Alishan_National_Scenic_Area",
"/wiki/Chiayi_County",
"/wiki/Dayuling",
"/wiki/Baozhong_tea",
"/wiki/Pinglin_Township",
]
For your case the simplest solution would just be not using regex for that...
just do something like:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)'
for x in string:
if 'disambiguation' in x:
continue # skip
y = re.findall(regex1, x)
print(y)
findString(list_links)
You do not need to use regex. You can iterate through list_links and check if the string you are looking for, 'disambiguation` is in each item in list_links.
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
to_find = 'disambiguation'
def findString(list_links):
for link in list_links:
if to_find in link:
# get indice of match
match_index = list_links.index(link)
# remove match from list
list_links.pop(match_index)
# print new list without 'disambiguation' items
print(list_links)
findString(list_links)
Python - Identify certain keywords in a user's input, to then lead to an answer. For example, user inputs "There is no display on my phone"
The keywords 'display' and 'phone' would link to a set of solutions.
I just need help finding a general idea on how to identify and then lead to a set of solutions. I would appreciate any help.
Use NLTK library, import stopwords.
write a code that if the word in your text is in stopword then you have to remove that word. You will get the filtered output.
Also,
Make a negative list file - containing all the words apart from stopwords that you want to remove, extent the stopwords with these words before the above code.and you will get a 100% correct output.
A simple way if you don't want to use any external libraries would be the following.
def bool_to_int(list):
num = 0
for k, v in enumerate(list):
if v==1:
num+=(2**k)
return num
def take_action(code):
if code==1:
# do this
elif code==2:
# do this
...
keywords = ['display', 'phone', .....,]
list_of_words = data.split(" ")
code = [0]*len(keywords)
for i in list_of_words:
if i in keywords:
idx = keywords.index(i)
code[idx]=1
code = bool_to_int(code)
take_action(code)
I'm trying to print a text while highlighting certain words and word bigrams. This would be fairly straight forward if I didn't have to print the other tokens like punctuation and such as well.
I have a list of words to highlight and another list of word bigrams to highlight.
Highlighting individual words is fairly easy, like for example:
import re
import string
regex_pattern = re.compile("([%s \n])" % string.punctuation)
def highlighter(content, terms_to_hightlight):
tokens = regex_pattern.split(content)
for token in tokens:
if token.lower() in terms_to_hightlight:
print('\x1b[6;30;42m' + token + '\x1b[0m', end="")
else:
print(token, end="")
Only highlighting words that appear in sequence is more complex. I have been playing around with iterators but haven't been able to come up with anything that isn't overtly complicated.
If I understand the question correctly, one solution is to look ahead to the next word token and check if the bigram is in the list.
import re
import string
regex_pattern = re.compile("([%s \n])" % string.punctuation)
def find_next_word(tokens, idx):
nonword = string.punctuation + " \n"
for i in range(idx+1, len(tokens)):
if tokens[i] not in nonword:
return (tokens[i], i)
return (None, -1)
def highlighter(content, terms, bigrams):
tokens = regex_pattern.split(content)
idx = 0
while idx < len(tokens):
token = tokens[idx]
(next_word, nw_idx) = find_next_word(tokens, idx)
if token.lower() in terms:
print('*' + token + '*', end="")
idx += 1
elif next_word and (token.lower(), next_word.lower()) in bigrams:
concat = "".join(tokens[idx:nw_idx+1])
print('-' + concat + '-', end="")
idx = nw_idx + 1
else:
print(token, end="")
idx += 1
terms = ['man', 'the']
bigrams = [('once', 'upon'), ('i','was')]
text = 'Once upon a time, as I was walking to the city, I met a man. As I was tired, I did not look once... upon this man.'
highlighter(text, terms, bigrams)
When called, this gives :
-Once upon- a time, as -I was- walking to *the* city, I met a *man*. As -I was- tired, I did not look -once... upon- this *man*.
Please note that:
this is a greedy algorithm, it will match the first bigram it finds. So for instance you check for yellow banana and banana boat, yellow banana boat is always highlighted as -yellow banana- boat. If you want another behavior, you should update the test logic.
you probably also want to update the logic to manage the case where a word is both in terms and the first part of a bigram
I haven't tested all edge cases, some things may break / there may be fence-post errors
you can optimize performance if necessary by:
building a list of the first words of the bigram and checking if a word is in it before doing the look-ahead to the next word
and/or using the result of the look-ahead to treat in one step all the non-word tokens between two words (implementing this step should be enough to insure linear performance)
Hope this helps.
I am simply extracting top hashtags from twitter using tweepy module in python. There is one major problem I face, I wish to check if the tag in english or not. Tags that are not in english should be removed.
example:
tags=['AskOrange','CharlestonShooting','ReplyToASong','UberLIVE','Otecmatkasyn']
should not have Otecmatkasyn.
What you need to use is a language detector API. A good one is the one offered by Google, but it is not free. Another good option is Language Detection API.
After you choose the best API for you, you'd need to parse your text so that it makes sense as a sentence. For example, the tag 'AskOrange' must be split to read 'Ask Orange'. You can iterate over each character of the string, check if it is uppercase and insert a space there:
new_tags = []
for tag in tags:
new_word = tag
uppercases = 0 # In case your sentence has several uppercases
for i in xrange(1, len(tag)):
if tag[i].istitle():
new_word = new_word[:i+uppercases] + ' ' + new_word[i+uppercases:]
uppercases = uppercases + 1
new_tags.append(new_word)
Finally, send your list of new_tags to the API to detect the language.
import re,urllib2
def find_words(each_func):
i=0
wordsineach_func=[]
while len(each_func) >0:
i=i+1
word_found=longest_word(each_func)
if len(word_found)>0:
wordsineach_func.append(word_found)
each_func=each_func.replace(word_found,"")
# print i,word_found,each
return wordsineach_func
def longest_word(phrase):
phrase_length=len(phrase)
words_found=[];index=0
outerstring=""
while index < phrase_length:
outerstring=outerstring+phrase[index]
index=index+1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) ==0:
words_found.append(phrase)
return max(words_found, key=len)
data = urllib2.urlopen('https://s3.amazonaws.com/hr-testcases/479/assets/words.txt')
words=[]
for line in data:
words.append(line.replace("\n",""))
string="#honesthournow20"
string=string.replace("#","")
new_words=re.split(r'(\d+)',string)
output=[]
for each in new_words:
each_words=find_words(each)
for each_word in each_words:
output.append(each_word)
print output
Then check language.
I am making a simple chat bot in Python. It has a text file with regular expressions which help to generate the output. The user input and the bot output are separated by a | character.
my name is (?P<'name'>\w*) | Hi {'name'}!
This works fine for single sets of input and output responses, however I would like the bot to be able to store the regex values the user inputs and then use them again (i.e. give the bot a 'memory'). For example, I would like to have the bot store the value input for 'name', so that I can have this in the rules:
my name is (?P<'word'>\w*) | You said your name is {'name'} already!
my name is (?P<'name'>\w*) | Hi {'name'}!
Having no value for 'name' yet, the bot will first output 'Hi steve', and once the bot does have this value, the 'word' rule will apply. I'm not sure if this is easily feasible given the way I have structured my program. I have made it so that the text file is made into a dictionary with the key and value separated by the | character, when the user inputs some text, the program compares whether the user input matches the input stored in the dictionary, and prints out the corresponding bot response (there is also an 'else' case if no match is found).
I must need something to happen at the comparing part of the process so that the user's regular expression text is saved and then substituted back into the dictionary somehow. All of my regular expressions have different names associated with them (there are no two instances of 'word', for example...there is 'word', 'word2', etc), I did this as I thought it would make this part of the process easier. I may have structured the thing completely wrong to do this task though.
Edit: code
import re
io = {}
with open("rules.txt") as brain:
for line in brain:
key, value = line.split('|')
io[key] = value
string = str(raw_input('> ')).lower()+' word'
x = 1
while x == 1:
for regex, output in io.items():
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
string = str(raw_input('> ')).lower()+' word'
else:
print ' Sorry?'
string = str(raw_input('> ')).lower()+' word'
I had some difficulty to understand the principle of your algorithm because I'm not used to employ the named groups.
The following code is the way I would solve your problem, I hope it will give you some ideas.
I think that having only one dictionary isn't a good principle, it increases the complexity of reasoning and of the algorithm. So I based the code on two dictionaries: direg and memory
Theses two dictionaries have keys that are indexes of groups, not all the indexes, only some particular ones, the indexes of the groups being the last in each individual patterns.
Because, for the fun, I decided that the regexes must be able to have several groups.
What I call individual patterns in my code are the following strings:
"[mM]y name [Ii][sS] (\w*)"
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
You see that the second individual pattern has 2 capturing groups: consequently there are 3 individual patterns, but a total of 4 groups in all the individual groups.
So the creation of the dictionaries needs some additional care to take account of the fact that the index of the last matching group ( which I use with help of the attribute of name lastindex of a regex MatchObject ) may not correspond to the numbering of individual regexes present in the regex pattern: it's harder to explain than to understand. That's the reason why I count in the function distr() the occurences of strings {0} {1} {2} {3} {4} etc whose number MUST be the same as the number of groups defined in the corresponding individual pattern.
I found the suggestion of Laurence D'Oliveiro to use '||' instead of '|' as separator interesting.
My code simulates a session in which several inputs are done:
import re
regi = ("[mM]y name [Ii][sS] (\w*)"
"||Hi {0}!"
"||You said that your name was {0} !!!",
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"||OK here's your file {0}\\{1} :"
"||I already gave you the file {0}\\{1} !",
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
"||OK, I will do {0}"
"||You already did {0}. Do yo really want again ?")
direg = {}
memory = {}
def distr(regi,cnt = 0,di = direg,mem = memory,
regnb = re.compile('{\d+}')):
for i,el in enumerate(regi,start=1):
sp = el.split('||')
cnt += len(regnb.findall(sp[1]))
di[cnt] = sp[1]
mem[cnt] = sp[2]
yield sp[0]
regx = re.compile('|'.join(distr(regi)))
print 'direg :\n',direg
print
print 'memory :\n',memory
for inp in ('I say that my name is Armano the 1st',
'In repertory ONE I want file SPACE',
'I want to record music',
'In repertory ONE I want file SPACE',
'I say that my name is Armstrong',
'But my name IS Armstrong now !!!',
'In repertory TWO I want file EARTH',
'Now my name is Helena'):
print '\ninput ==',inp
mat = regx.search(inp)
if direg[mat.lastindex]:
print 'output ==',direg[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
direg[mat.lastindex] = None
memory[mat.lastindex] = memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
else:
print 'output ==',memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
if not memory[mat.lastindex].startswith('Sorry'):
memory[mat.lastindex] = 'Sorry, ' \
+ memory[mat.lastindex][0].lower()\
+ memory[mat.lastindex][1:]
result
direg :
{1: 'Hi {0}!', 3: "OK here's your file {0}\\{1} :", 4: 'OK, I will do {0}'}
memory :
{1: 'You said that your name was {0} !!!', 3: 'I already gave you the file {0}\\{1} !', 4: 'You already did {0}. Do yo really want again ?'}
input == I say that my name is Armano the 1st
output == Hi Armano!
input == In repertory ONE I want file SPACE
output == OK here's your file ONE\SPACE :
input == I want to record music
output == OK, I will do record music
input == In repertory ONE I want file SPACE
output == I already gave you the file ONE\SPACE !
input == I say that my name is Armstrong
output == You said that your name was Armano !!!
input == But my name IS Armstrong now !!!
output == Sorry, you said that your name was Armano !!!
input == In repertory TWO I want file EARTH
output == Sorry, i already gave you the file ONE\SPACE !
input == Now my name is Helena
output == Sorry, you said that your name was Armano !!!
OK, let me see if I understand this:
You want to a dictionary of key-value pairs. This will be the “memory” of the chatbot.
You want to apply regular-expression rules to user input. But which rules might apply is conditional on which keys are already present in the memory dictionary: if “name” is not yet defined, then the rule that defines “name” applies; but if it is, then the rule that mentions “word” applies.
Seems to me you need more information attached to your rules. For example, the “word” rule you gave above shouldn’t actually add “word” to the dictionary, otherwise it would only apply once (imagine if the user keeps trying to say “my name is x” more than twice).
Does that give you a bit more idea about how to proceed?
Oh, by the way, I think “|” is a poor choice for a separator character, because it can occur in regular expressions. Not sure what to suggest: how about “||”?