I have a script that takes in an argument and tries to find a match using regex. On single values, I don't have any issues, but when I pass multiple words, the order matters. What can I do so that the regex returns no matter what the order of the supplied words are? Here is my example script:
import re
from sys import argv
data = 'some things other stuff extra words'
pattern = re.compile(argv[1])
search = re.search(pattern, data)
print search
if search:
print search.group(0)
print data
So based on my example, if I pass "some things" as an arg, then it matches, but if i pass "things some", it doesn't matches, and I would like it to. Optionally, I would like it to also return if either "some" or "things" match.
The argument passed could possibly be a regex
I think you want something like this:
search = filter(None, (re.search(arg, data) for arg in argv[1].split()))
Or
search = re.search('|'.join(argv[1].split()), data)
You can then check the search results, if len(search) == len(argv[1].split()), then it means all patterns matched, and if search is truthy, then it means at least one of them matched.
Ok, I think I got it, you can use a lookahead assertion like this:
>>> re.search('(?=.*thing)(?=.*same)', data)
You can obviously programatically build such regex:
re.search(''.join('(?=.*{})'.format(arg) for arg in argv[1].split()), data)
I think it would be better to just create several regexes and match each of them against the string. If any of them matches, you return True.
If you are just trying to match constant strings, the in operator is enough:
'some' in data or 'things' in data
You could also just split the data text into sublists, and check if the ordering/reverse ordering of search exists in it:
import re
data = 'some, things other stuff extra words blah.'
search = "things, some"
def search_text(data, search):
data_words = re.compile('\w+').findall(data)
# ['some', 'things', 'other', 'stuff', 'extra', 'words', 'blah']
search_words = re.compile('\w+').findall(search)
# ['things', 'some']
len_search = len(search_words)
candidates = [data_words[i:i+len_search] for i in range(0, len(data_words)-1, len_search-1)]
# [['some', 'things'], ['things', 'other'], ['other', 'stuff'], ['stuff', 'extra'], ['extra', 'words'], ['words', 'blah']]
return search_words in candidates or search_words[::-1] in candidates
print(search_text(data, search))
Which Outputs:
True
Related
I'm practicing questions from Cracking the coding interview to become better and just in case, be prepared. The first problem states: Find if a string has all unique characters or not? I wrote this and it works perfectly:
def isunique(string):
x = []
for i in string:
if i in x:
return False
else:
x.append(i)
return True
Now, my question is, what if I have all unique characters like in:
'I am J'
which would be pretty rare, but lets say it occurs by mere chance, how can I create an exception for the spaces? I a way it doesn't count the space as a character, so the func returns True and not False?
Now no matter how space or how many special characters in your string , it will just count the words :
import re
def isunique(string):
pattern=r'\w'
search=re.findall(pattern,string)
string=search
x = []
for i in string:
if i in x:
return False
else:
x.append(i)
return True
print(isunique('I am J'))
output:
True
without space words test case :
print(isunique('war'))
True
with space words test case:
print(isunique('w a r'))
True
repeating letters :
print(isunique('warrior'))
False
Create a list of characters you want to consider as non-characters and replace them in string. Then perform your function code.
As an alternative, to check the uniqueness of characters, the better approach will be to compare the length of final string with the set value of that string as:
def isunique(my_string):
nonchars = [' ', '.', ',']
for nonchar in nonchars:
my_string = my_string.replace(nonchar, '')
return len(set(my_string)) == len(my_string)
Sample Run:
>>> isunique( 'I am J' )
True
As per the Python's set() document:
Return a new set object, optionally with elements taken from iterable.
set is a built-in class. See set and Set Types — set, frozenset for
documentation about this class.
And... a pool of answers is never complete unless there is also a regex solution:
def is_unique(string):
import re
patt = re.compile(r"^.*?(.).*?(\1).*$")
return not re.search(patt, string)
(I'll leave the whitespace handling as an exercise to the OP)
An elegant approach (YMMV), with collections.Counter.
from collections import Counter
def isunique(string):
return Counter(string.replace(' ', '')).most_common(1)[0][-1] == 1
Alternatively, if your strings contain more than just whitespaces (tabs and newlines for instance), I'd recommend regex based substitution:
import re
string = re.sub(r'\s+', '', string, flags=re.M)
Simple solution
def isunique(string):
return all(string.count(i)==1 for i in string if i!=' ')
Normally we would write the following to replace one match:
namesRegex = re.compile(r'(is)|(life)', re.I)
replaced = namesRegex.sub(r"butter", "There is no life in the void.")
print(replaced)
output:
There butter no butter in the void.
What i want is to replace, probably using back references, each group with a specific text. Namely i want to replace the first group (is) with "are" and the second group (life) with "butterflies".
Maybe something like that. But the following is not working code.
namesRegex = re.compile(r'(is)|(life)', re.I)
replaced = namesRegex.sub(r"(are) (butterflies)", r"\1 \2", "There is no life in the void.")
print(replaced)
Is there a way to replace multiple groups in one statement in python?
You can use a replacement by lambda, mapping the keywords you want to associate:
>>> re.sub(r'(is)|(life)', lambda x: {'is': 'are', 'life': 'butterflies'}[x.group(0)], "There is no life in the void.")
'There are no butterflies in the void.'
You can define a map of keys and replacements first and then use a lambda function in replacement:
>>> repl = {'is': 'are', 'life': 'butterflies'}
>>> print re.sub(r'is|life', lambda m: repl[m.group()], "There is no life in the void.")
There are no butterflies in the void.
I will also suggest you to use word boundaries around your keys to safeguard your search patterns:
>>> print re.sub(r'\b(?:is|life)\b', lambda m: repl[m.group()], "There is no life in the void.")
There are no butterflies in the void.
You may use a dictionary with search-replacement values and use a simple \w+ regex to match words:
import re
dt = {'is' : 'are', 'life' : 'butterflies'}
namesRegex = re.compile(r'\w+')
replaced = namesRegex.sub(lambda m: dt[m.group()] if m.group() in dt else m.group(), "There is no life in the void.")
print(replaced)
See a Python demo
With this approach, you do not have to worry about creating a too large regex pattern based on alternation. You may adjust the pattern to include word boundaries, or only match letters (e.g. [\W\d_]+), etc. as per the requirements. The main point is that the pattern should match all the search terms that are keys in the dictionary.
The if m.group() in dt else m.group() part is checking if the found match is present as a key in the dictionary, and if it is not, just returns the match back. Else, the value from the dictionary is returned.
If you want just to replace specific words, go no further than str.replace().
s = "There is no life in the void."
s.replace('is', 'are').replace('life', 'butterflies') # => 'There are no butterflies in the void.'
Note, this is not a duplicate of this question:
How to check if a string contains an element from a list in Python
However, it's based on it's logic.
I have the following string:
my_string = 'this is my complex description'
I have a list of keywords:
keywords = ['my', 'desc', 'complex']
I know I can use the following to check if the keywords exist:
if any(ext in my_string for ext in keywords):
print(my_string)
But I'd like to show which keywords actually match the description. I know I can do a loop through the keywords, and then do a check for each one individually, but is it possible in a single statement?
It doesn't matter which version of python is the solution.
If you want to match complete words, you could use set intersection:
>>> my_string = 'this is my complex description'
>>> keywords = ['my', 'desc', 'complex']
>>> set(my_string.split()) & set(keywords)
{'complex', 'my'}
>>> my_string = 'this is my complex description'
>>> keywords = ['my', 'desc', 'complex']
>>> print(*[c for c in my_string.split() if c in keywords])
my complex
Note this this only works, to my knowledge, on python3.x (I am not too sure about how it would stand in python 2)
If you're confused on what it is doing, the * is just unpacking the list made from a list comprehension that filters any item that ISNT in my_string as separate arguments in print. In python3, separate args in print are printed with spaces in between them.
found_words = [ word for word in keywords if word in my_string ]
This will give you a list of the keywords that are found in my_string. Performance will be better if you make keywords a set though:
keywords = set(['my', 'desc', 'complex'])
found_words = [ word for word in my_string.split() if word in keywords ]
But the latter relies on the fact that my_string doesn't separate words with anything other than whitespace.
Let's say I have a string that looks like this:
myStr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
What I would like to obtain in the end would be:
myStr_l1 = '(Txt_l1) or (Txt2_l1)'
and
myStr_l2 = '(Txt_l2) or (Txt2_l2)'
Some properties:
all "Txt_"-elements of the string start with an uppercase letter
the string can contain much more elements (so there could also be Txt3, Txt4,...)
the suffixes '_l1' and '_l2' look different in reality; they cannot be used for matching (I chose them for demonstration purposes)
I found a way to get the first part done by using:
myStr_l1 = re.sub('\(\w+\)','',myStr)
which gives me
'(Txt_l1 ) or (Txt2_l1 )'
However, I don't know how to obtain myStr_l2. My idea was to remove everything between two open parentheses. But when I do something like this:
re.sub('\(w+\(', '', myStr)
the entire string is returned.
re.sub('\(.*\(', '', myStr)
removes - of course - far too much and gives me
'Txt2_l2))'
Does anyone have an idea how to get myStr_l2?
When there is an "and" instead of an "or", the strings look slightly different:
myStr2 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2))'
Then I can still use the command from above:
re.sub('\(\w+\)','',myStr2)
which gives:
'(Txt_l1 and Txt2_l1 )'
but I again fail to get myStr2_l2. How would I do this for these kind of strings?
And how would one then do this for mixed expressions with "and" and "or" e.g. like this:
myStr3 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2)) or (Txt3_l1 (Txt3_l2) and Txt4_l1 (Txt2_l2))'
re.sub('\(\w+\)','',myStr3)
gives me
'(Txt_l1 and Txt2_l1 ) or (Txt3_l1 and Txt4_l1 )'
but again: How would I obtain myStr3_l2?
Regexp is not powerful enough for nested expressions (in your case: nested elements in parentheses). You will have to write a parser. Look at https://pyparsing.wikispaces.com/
I'm not entirely sure what you want but I wrote this to strip everything between the parenthesis.
import re
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
sets = mystr.split(' or ')
noParens = []
for line in sets:
mat = re.match(r'\((.* )\((.*\)\))', line, re.M)
if mat:
noParens.append(mat.group(1))
noParens.append(mat.group(2).replace(')',''))
print(noParens)
This takes all the parenthesis away and puts your elements in a list. Here's an alternate way of doing it without using Regular Expressions.
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
noParens = []
mystr = mystr.replace(' or ', ' ')
mystr = mystr.replace(')','')
mystr = mystr.replace('(','')
noParens = mystr.split()
print(noParens)
I'm trying to parse the following string:
constructor: function(some, parameters, here) {
With the following regex:
re.search("(\w*):\s*function\((?:(\w*)(?:,\s)*)*\)", line).groups()
And I'm getting:
('constructor', '')
But I was expecting something more like:
('constructor', 'some', 'parameters', 'here')
What am I missing?
If you change your pattern to:
print re.search(r"(\w*):\s*function\((?:(\w+)(?:,\s)?)*\)", line).groups()
You'll get:
('constructor', 'here')
This is because (from docs):
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
If you can do this in one step, I don't know how. Your alternative, of course is to do something like:
def parse_line(line):
cons, args = re.search(r'(\w*):\s*function\((.*)\)', line).groups()
mats = re.findall(r'(\w+)(?:,\s*)?', args)
return [cons] + mats
print parse_line(line) # ['constructor', 'some', 'parameters', 'here']
One option is to use more advanced regex instead of the stock re. Among other nice things, it supports captures, which, unlike groups, save every matching substring:
>>> line = "constructor: function(some, parameters, here) {"
>>> import regex
>>> regex.search("(\w*):\s*function\((?:(\w+)(?:,\s)*)*\)", line).captures(2)
['some', 'parameters', 'here']
The re module doesn't support repeated captures: the group count is fixed. Possible workarounds include:
1) Capture the parameters as a string and then split it:
match = re.search("(\w*):\s*function\(([\w\s,]*)\)", line).groups()
args = [arg.strip() for arg in math[1].split(",")]
2) Capture the parameters as a string and then findall it:
match = re.search("(\w*):\s*function\(([\w\s,]*)\)", line).groups()
args = re.findall("(\w+)(?:,\s)*", match[1])
3) If your input string has already been verified, you can just findall the whole thing:
re.findall("(\w+)[:,)]", string)
Alternatively, you can use the regex module and captures(), as suggested by #georg.
You might need two operations here (search and findall):
[re.search(r'[^:]+', given_string).group()] + re.findall(r'(?<=[ (])\w+?(?=[,)])', given_string)
Output: ['constructor', 'some', 'parameters', 'here']