I've been using word boundaries to look up exact words in a string but just learned that they ignore non-word characters.
So when I look for "height" in the string "height (in stories)" I get the results I expect:
p = re.compile(r'\b%s\b' % 'height')
p.search('height (in stories)') # match
But when I look for "height (in stories)" in the string "height (in stories)" I don't get a match:
p = re.compile(r'\b%s\b' % 'height (in stories)')
p.search('height (in stories)') # no match
How can I get the parentheses recognized?
There's two problems.
You need to use re.escape(text) to create a regular expression that matches the string text.
There's no word boundary between a parenthesis and the end of the string.
p = re.compile(r'\bheight \(in stories\)')
When using regular expressions, you should read their syntax.
The ( and ) characters have a special meaning in regular expressions, if you want to match the characters literally, you need to escape them. Just like the dot ....
Consider using re.escape.
Related
I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.
for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
If I have a string that may look like this:
"[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
How do I extract the categories and put them into a list?
I'm having a hard time getting the regular expression to work.
To expand on the explanation of the regex used by Avinash in his answer:
Category:([^\[\]]*) consists of several parts:
Category: which matches the text "Category:"
(...) is a capture group meaning roughly "the expression inside this group is a block that I want to extract"
[^...] is a negated set which means "do not match any characters in this set".
\[ and \] match "[" and "]" in the text respectively.
* means "match zero or more of the preceding regex defined items"
Where I have used ... to indicate that I removed some characters that were not important for the explanation.
So putting it all together, the regex does this:
Finds "Category:" and then matches any number (including zero) characters after that that are not the excluded characters "[" or "]". When it hits an excluded character it stops and the text matched by the regex inside the (...) part is returned. So the regex does not actually look for "[[" or "]]" as you might expect and so will match even if they are left out. You could force it to look for the double square brackets at the beginning and end by changing it to \[\[Category:([^\[\]]*)\]\].
For the second regex, Category:[^\[\]]*, the capture group (...) is excluded, so Python returns everything matched which includes "Category:".
Seems like you want something like this,
>>> str = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
>>> re.findall(r'Category:([^\[\]]*)', str)
['Political culture', 'Political ideologies']
>>> re.findall(r'Category:[^\[\]]*', str)
['Category:Political culture', 'Category:Political ideologies']
By default re.findall will print only the strings which are matched by the pattern present inside a capturing group. If no capturing group was present, then only the findall function would return the matches in list. So in our case , this Category: matches the string category: and this ([^\[\]]*) would capture any character but not of [ or ] zero or more times. Now the findall function would return the characters which are present inside the group index 1.
Python code:
s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line]
print(cats)
Output:
['Category:Political culture', 'Category:Political ideologies']
Which regular expression pattern will match a substring not containing a specific character in Python? For example, I have the string "abc,5 * de", and I want to match "abc" and "5 * de" as two substrings, but not the ,.
Use a negated character class that contains all characters you don't want to match.
Something like
[^,]+
See it here on Regexr
The [] denotes the character class and the ^ as first character makes it a negated class.
s = "abc,5 * de"
result = s.split(',')
result[0] # "abc"
result[1] # "5* de"
Regex expressions are not always the only way to solve string problems.
I don't know Python, but with all the regexp engines I know, that would be /[^,]*/. Or if Python has a built-in function to split a string on a regexp, then you could just split on /,/.
I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"