Python regex preference if multiple matches

Python regex preference if multiple matches - python

I am searching for city names in a string:
mystring = 'SDM\Austin'
city_search = r'(SD|Austin)'
mo_city = re.search(city_search,mystring,re.IGNORECASE)
city = mo_city.group(1)
print(city)
This will return city as 'SD'.
Is there a way to make 'Austin' the preference?
Switching the order to (Austin|SD) doesn't work.
The answer is the same as How can I find all matches to a regular expression in Python?, but the use case is a little different since one match is preferred.

You're using re.search, instead use re.findall which returns a lists of all matches.
So if you modify your code to:
mystring = 'SDM\Austin'
city_search = r'(SD|Austin)'
mo_city = re.findall(city_search,mystring,re.IGNORECASE)
city = mo_city[1]
print(city)
it will work find, outputting:
Austin
So, mo_city is a list: ['SD', 'Austin'] and since we want to assign the second element (Austin) to city, we take index 1 with mo_city[1].

Brief
You already have a great answer here (using findall instead of search with regex). This is another alternative (without using regex) that checks a string against a list of strings and returns matches. Based on the sample code you provided, this should work for you and is probably easier than the regex method.
Code
See code in use here
list = ['SD', 'Austin']
s = 'SDM\Austin'
for l in list:
if l in s:
print '"{}" exists in "{}"'.format(l, s);

Related

Find the word from the list given and replace the words so found

My question is pretty simple, but I haven't been able to find a proper solution.
Given below is my program:
given_list = ["Terms","I","want","to","remove","from","input_string"]
input_string = input("Enter String:")
if any(x in input_string for x in given_list):
#Find the detected word
#Not in bool format
a = input_string.replace(detected_word,"")
print("Some Task",a)
Here, given_list contains the terms I want to exclude from the input_string.
Now, the problem I am facing is that the any() produces a bool result and I need the word detected by the any() and replace it with a blank, so as to perform some task.
Edit: any() function is not required at all, look for useful solutions below.

Iterate over given_list and replace them:
for i in given_list:
input_string = input_string.replace(i, "")
print("Some Task", input_string)

No need to detect at all:
for w in given_list:
input_string = input_string.replace(w, "")
str.replace will not do anything if the word is not there and the substring test needed for the detection has to scan the string anyway.

The problem with finding each word and replacing it is that python will have to iterate over the whole string, repeatedly. Another problem is you will find substrings where you don't want to. For example, "to" is in the exclude list, so you'd end up changing "tomato" to "ma"
It seems to me like you seem to want to replace whole words. Parsing is a whole new subject, but let's simplify. I'm just going to assume everything is lowercase with no punctuation, although that can be improved later. Let's use input_string.split() to iterate over whole words.
We want to replace some words with nothing, so let's just iterate over the input_string, and filter out the words we don't want, using the builtin function of the same name.
exclude_list = ["terms","i","want","to","remove","from","input_string"]
input_string = "one terms two i three want to remove"
keepers = filter(lambda w: w not in exclude_list, input_string.lower().split())
output_string = ' '.join(keepers)
print (output_string)
one two three
Note that we create an iterator that allows us to go through the whole input string just once. And instead of replacing words, we just basically skip the ones we don't want by having the iterator not return them.
Since filter requires a function for the boolean check on whether to include or exclude each word, we had to define one. I used "lambda" syntax to do that. You could just replace it with
def keep(word):
return word not in exclude_list
keepers = filter(keep, input_string.split())

To answer your question about any, use an assignment expression (Python 3.8+).
if any((word := x) in input_string for x in given_list):
# match captured in variable word

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?

Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']

Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

Matching if any keyword from a list is present in a string

I have a list of keywords. A sample is:
['IO', 'IO Combination','CPI Combos']
Now what I am trying to do is see if any of these keywords is present in a string. For example, if my string is: there is a IO competition coming in Summer 2018. So for this example since it contains IO, it should identify that but if the string is there is a competition coming in Summer 2018 then it should not identify any keywords.
I wrote this Python code but it also identifies IO in competition:
if any(word.lower() in string_1.lower() for word in keyword_list):
print('FOUND A KEYWORD IN STRING')
I also want to identify which keyword was identified in the string (if any present). What is the issue in my code and how can I make sure that it matches only complete words?

Regex solution
You'll need to implement word boundaries here:
import re
keywords = ['IO', 'IO Combination','CPI Combos']
words_flat = "|".join(r'\b{}\b'.format(word) for word in keywords)
rx = re.compile(words_flat)
string = "there is a IO competition coming in Summer 2018"
match = rx.search(string)
if match:
print("Found: {}".format(match.group(0)))
else:
print("Not found")
Here, your list is joined with | and \b on both sides.
Afterwards, you may search with re.search() which prints "Found: IO" in this example.
Even shorter with a direct comprehension:
rx = re.compile("|".join(r'\b{}\b'.format(word) for word in keywords))
Non-regex solution
Please note that you can even use a non-regex solution for single words, you just have to reorder your comprehension and use split() like
found = any(word in keywords for word in string.split())
if found:
# do sth. here
Notes
The latter has the drawback that strings like
there is a IO. competition coming in Summer 2018
# ---^---
won't work while they do count as a "word" in the regex solution (hence the approaches are yielding different results). Additionally, because of the split() function, combined phrases like CPI Combos cannot be found. The regex solution has the advantage to even support lower and uppercase scenarios (just apply flag = re.IGNORECASE).
It really depends on your actual requirements.

for index,key in enumerate(mylist):
if key.find(mystring) != -1:
return index
It loops over your list, on every item in the list, it checks if your string is contained in the item, if it does, find() returns -1 which means it is contained, and if that happens, you get the index of the item where it was found with the help of enumerate().

How do you effectively use regular expressions to find alliterative expressions?

I have an assignment that requires me to use regular expressions in python to find alliterative expressions in a file that consists of a list of names. Here are the specific instructions:
" Open a file and return all of the alliterative names in the file.
For our purposes a "name" is a two sequences of letters separated by
a space, with capital letters only in the leading positions.
We call a name alliterative if the first and last names begin
with the same letter, with the exception that s and sh are considered
distinct, and likewise for c/ch and t/th.The names file will contain a list of strings separated by commas.Suggestion: Do this in two stages." This is my attempt so far:
def check(regex, string, flags=0):
return not (re.match("(?:" + regex + r")\Z", string, flags=flags)) is None
def alliterative(names_file):
f = open(names_file)
string = f.read()
lst = string.split(',')
lst2 = []
for i in lst:
x=lst[i]
if re.search(r'[A-Z][a-z]* [A-Z][a-z]*', x):
k=x.split(' ')
if check('{}'.format(k[0][0]), k[1]):
if not check('[cst]', k[0][0]):
lst2.append(x)
elif len(k[0])==1:
if len(k[1])==1:
lst2.append(x)
elif not check('h',k[1][1]):
lst2.append(x)
elif len(k[1])==1:
if not check('h',k[0][1]):
lst2.append(x)
return lst2
There are two issues that I have: first, what I coded seems to make sense to me, the general idea behind it is that I first check that the names are in the correct format (first name, last name, all letters only, only first letters of first and last names capitalized), then check to see if the starting letters of the first and last names match, then see if those first letters are not c s or t, if they aren't we add the name to the new list, if they are, we check to see that we aren't accidentally matching a [cst] with an [cst]h. The code compiles but when I tried to run it on this list of names:
Umesh Vazirani, Vijay Vazirani, Barbara Liskov, Leslie Lamport, Scott Shenker, R2D2 Rover, Shaq, Sam Spade, Thomas Thing
it returns an empty list instead of ["Vijay Vazirani", "Leslie Lamport", "Sam Spade", "Thomas Thing"] which it is supposed to return. I added print statements to alliterative so see where things were going wrong and it seems that the line
if check('{}'.format(k[0][0]), k[1]):
is an issue.
More than the issues with my program though, I feel like I am missing the point of regular expressions: am I overcomplicating this? Is there a nicer way to do this with regular expressions?

Please consider improving your question.
Especially the question is only useful for those who want to answer to the exactly the same question, which I think is almost no chance.
Please think how to improve so that it can be generallized to the point where this QA can be helpful to others.
I think your direction is about right.
It's a good idea to check the input rightness using regular
expression. r'[A-Z][a-z]* [A-Z][a-z]*' is a good expression.
You can group the output by parentheses. So that you can easily get first and last name later on
Keep in mind the difference between re.match and re.search. re.search(r'[A-Z][a-z]* [A-Z][a-z]*', 'aaRob Smith') returns a MatchObject. See this.
Also comment on general programming style
Better to name variables first and last for readability, rather than k[0] and k[1] (and how is the letter k picked!?)
Here's one way to do:
import re
FULL_NAME_RE = re.compile(r'^([A-Z][a-z]*) ([A-Z][a-z]*)$')
def is_alliterative(name):
"""Returns True if it matches the alliterative requirement otherwise False"""
# If not matches the name requirement, reject
match = FULL_NAME_RE.match(name)
if not match:
return False
first, last = match.group(1, 2)
first, last = first.lower(), last.lower() # easy to assume all lower-cases
if first[0] != last[0]:
return False
if first[0] in 'cst': # Check sh/ch/th
# Do special check
return _is_cst_h(first) == _is_cst_h(last)
# All check passed!
return True
def _is_cst_h(text):
"""Returns true if text is one of 'ch', 'sh', or 'th'."""
# Bad (?) assumption that the first letter is c, s, or t
return text[1:].startswith('h')
names = [
'Umesh Vazirani', 'Vijay Vazirani' , 'Barbara Liskov',
'Leslie Lamport', 'Scott Shenker', 'R2D2 Rover', 'Shaq' , 'Sam Spade', 'Thomas Thing'
]
print [name for name in names if is_alliterative(name)]
# Ans
print ['Vijay Vazirani', 'Leslie Lamport', 'Sam Spade', 'Thomas Thing']

Try this regular expression:
[a[0] for a in re.findall('((?P<caps>[A-Z])[a-z]*\\s(?P=caps)[a-z]*)', names)]
Note: It does not handle the sh/ch/th special case.

replace multiple words - python

There can be an input "some word".
I want to replace this input with "<strong>some</strong> <strong>word</strong>" in some other text which contains this input
I am trying with this code:
input = "some word".split()
pattern = re.compile('(%s)' % input, re.IGNORECASE)
result = pattern.sub(r'<strong>\1</strong>',text)
but it is failing and i know why: i am wondering how to pass all elements of list input to compile() so that (%s) can catch each of them.
appreciate any help

The right approach, since you're already splitting the list, is to surround each item of the list directly (never using a regex at all):
sterm = "some word".split()
result = " ".join("<strong>%s</strong>" % w for w in sterm)
In case you're wondering, the pattern you were looking for was:
pattern = re.compile('(%s)' % '|'.join(sterm), re.IGNORECASE)
This works on your string because the regular expression would become
(some|word)
which means "matches some or matches word".
However, this is not a good approach as it does not work for all strings. For example, consider cases where one word contains another, such as
a banana and an apple
which becomes:
<strong>a</strong> <strong>banana</strong> <strong>a</strong>nd <strong>a</strong>n <strong>a</strong>pple

It looks like you're wanting to search for multiple words - this word or that word. Which means you need to separate your searches by |, like the script below:
import re
text = "some word many other words"
input = '|'.join('some word'.split())
pattern = re.compile('(%s)' % input, flags=0)
print pattern.sub(r'<strong>\1</strong>',text)

I'm not completely sure if I know what you're asking but if you want to pass all the elements of input in as parameters in the compile function call, you can just use *input instead of input. * will split the list into its elements. As an alternative, could't you just try joining the list with and adding at the beginning and at the end?

Alternatively, you can use the join operator with a list comprehension to create the intended result.
text = "some word many other words".split()
result = ' '.join(['<strong>'+i+'</strong>' for i in text])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex preference if multiple matches - python

Related

Find the word from the list given and replace the words so found

How to remove a substrings from a list of strings?

Matching if any keyword from a list is present in a string

How do you effectively use regular expressions to find alliterative expressions?

replace multiple words - python

Categories

Resources