Why doesn't the below regex print True?
print re.compile(r'^\b[a-z]\b$').search('(s)')
I want to match single char alphabeticals that may have non alphanumeric characters before and after, but do not have any more alphanumeric characters anywhere in the string. So the following should be matches:
'b'
'b)'
'(b)'
'b,
and the following should be misses:
'b(s)'
'blah(b)'
'bb)'
'b-b'
'bb'
The solutions here don't work.
The ^ at the begining and $ at the end cause the expression to match only if the entire string is a single character. (Thus, they make each \b obsolete.) Remove the anchors to match inside a larger string:
print re.compile(r'\b[a-z]\b').search('b(s)')
Alternatively, ensure only one character like:
print re.compile(r'^\W*[a-z]\W*$').match('b(s)')
Note that in the first case, 'b-b' and 'blah(b)' will match because they contain single alphabetical characters not touching others inside them. In the second case, 'b(s)' will not be a match, because it contains two alphabetical characters, but the other four cases will match correctly, and all of the no-match cases will return None (false logical value) as intended.
Ok here is the answer:
print re.compile(^[(,\[]?[a-z][),;\]]?[,;]?$).search('(s)')
It catches a variety of complex patterns for single character alphanumerics. I realize this is different than what I asked for but in reality it works better.
Related
I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.
For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.
I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.
I bumped into the problem while playing around in Python: when I create a random string, let's say "test 1981", the following Python call returns with an empty string.
>>> re.search('\d?', "test 1981").group()
''
I was wondering why this is. I was reading through some other posts, and it seems that it has to do with greedy vs. non-greedy operators. Is it that the '?' checks to see if the first value is a digit, and if it's not, it takes the easier, quicker path and just outputs nothing?
Any clarification would help. Thanks!
Your pattern matches a digit or the empty string. It starts at the first character and tries to match a digit, what it is doing next is trying to match the alternative, means the empty string, voilà a match is found before the first character.
I think you expected it to move on and try to match on the next character, but that is not done, first it tries to match what the quantifier allows on the first position. And that is 0 or one digit.
The use of the optional quantifier makes only sense in combination with a required part, say you want a digit followed by an optional one:
>>> re.search('\d\d?', "test 1981").group()
'19'
Otherwise your pattern is always true.
Regex
\d?
simply means that it should optionally (?) match single digit (\d).
If you use something like this, it will work as you expect (match single digit anywhere in the string):
\d
re.search('\d?', "test 1981").group() greedily matches the first match of the pattern (0 or 1 digits) it can find. In this case that's zero digits. Note that re.search('\d?', "1981 test").group() actually matches the string '1' at the beginning of the string. What you're probably looking for here is re.search('\d+', "test 1981").group(), which finds the whole string 1981 no matter where it is.
i am having a python string of format
mystr = "hi.this(is?my*string+"
here i need to get the position of 'is' that is surrounded by special characters or non-alphabetic characters (i.e. second 'is' in this example). however, using
mystr.find('is')
will return the position if 'is' that is associated with 'this' which is not desired. how can i find the position of a substring that is surrounded by non-alphabetic characters in a string? using python 2.7
Here the best option is to use a regular expression. Python has the re module for working with regular expressions.
We use a simple search to find the position of the "is":
>>> match = re.search(r"[^a-zA-Z](is)[^a-zA-Z]", mystr)
This returns the first match as a match object. We then simply use MatchObject.start() to get the starting position:
>>> match.start(1)
8
Edit: A good point made, we make "is" a group and match that group to ensure we get the correct position.
As pointed out in the comments, this makes a few presumptions. One is that surrounded means that "is" cannot be at the beginning or end of the string, if that is the case, a different regular expression is needed, as this only matches surrounded strings.
Another is that this counts numbers as the special characters - you stated non-alphabetic, which I take to mean numbers included. If you don't want numbers to count, then using r"\b(is)\b" is the correct solution.
I want to find strings that do NOT match a particular sequence of characters. For example:
something like
REGEX = r'[^XY]*'
I'd like to look for strings that have any number of characters except an X and Y next to each other...the REGEX above doesn't work since it blocks X's and Y's separately.
How about:
if "XY" not in s:
print "matched"
else
print "not matched"
Or is this for inclusion in some longer regexp? Then maybe you want a negative lookahead expression:
REGEXP="...(?!XY)..."
EDIT: fixed typo
There are a few ways to do that.
^(?!.*XY).*$
The lookahead expression tries to match a XY sequence anywhere in the string. It's a negative lookahead, so if it finds one, the match attempt fails. Otherwise the .* goes ahead and consumes the whole string.
^(?:(?!XY).)*$
This one repeatedly matches any character (.), but only after the lookahead confirms that the character is not the beginning of a XY sequence.
^(?:[^X]+|X(?!Y))*$
Repeatedly matches one or more of any character except X, or X if it's not followed by Y.
With the first two regexes, you have to apply the DOT_ALL modifier if their might be newlines in the source string. The third one doesn't need that because it uses a negated character class - [^X] - instead of a dot.
Here's the problem:
split=re.compile('\\W*')
This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like käyttäj&aml;auml;.
What should I add to the regex to include the & and ; characters?
I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:
(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+
This matches
either a word character (including “_”), or
an HTML entity consisting of
the character “&”,
the character “#”,
the character “x” followed by at least one hexadecimal digit, or
at least one decimal digit, or
at least one letter (= named entity),
a semicolon
at least once.
/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.
You probably want to take the problem reverse, i.e. finding all the character without the spaces:
[^ \t\n]*
Or you want to add the extra characters:
[a-zA-Z0-9&;]*
In case you want to match HTML entities, you should try something like:
(\w+|&\w+;)*
you should make a character class that would include the extra characters. For example:
split=re.compile('[\w&;]+')
This should do the trick. For your information
\w (lower case 'w') matches word characters (alphanumeric)
\W (capital W) is a negated character class (meaning it matches any non-alphanumeric character)
* matches 0 or more times and + matches one or more times, so * will match anything (even if there are no characters there).
Looks like this RegEx did the trick:
split=re.compile('(\\\W+&\\\W+;)*')
Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.