Is there a more simplified regular expression to match anything that is not a letter, hypen, space, or apostrophe?
This is the regex I was using...
[^\w\s'-]|\d|_|\xa0
It's working, I was just curious if there was a more simplified expression
[^a-zA-Z-' ]
Matches everything except letters A-z, hyphens, spaces and apostrophes
\w already includes \d and _. So the simplest regex will be:
[^\w\s\-']
The following pattern...
[^a-z- ']
...is simpler and should do what you want with case-insensitivity set:
import re
p = re.compile(ur'[^a-z- \']', re.IGNORECASE)
test_str1 = u"9"
test_str2 = u"["
test_str3 = u"_"
re.search(p, test_str1)
re.search(p, test_str2)
re.search(p, test_str3)
Mirroring Maroun Maroun's comment, \w matches _; it also matches 0-9: so saying "not a-z or A-Z or 0-9 or _" with [^\w...]...then saying "0-9 or _" with |\d|_ is a bit confusing and needlessly complicating.
Same with \s, as it matches more than a space (specifically a carriage return, new line, tab, or form feed), which does not jive with wanting to match "anything that is not...a space...": given your description then, use a literal over the \s character class.
Related
I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:
match = re.findall("[A-Z].*\?", data2)
print(match)
an example of a result I get is:
'He knows me, and I know him. Do YOU know me? Hey?'
the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.
The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?
questions = re.findall(r"\s*([^\.\?]+\?)", data2)
# \s* sentence beginning space to ignore
# ( start capture group
# [^\.\?]+ negated capture group matching anything besides "." and "?" (one or more)
# \? question mark to end sentence
# ) end capture group
You could look for letters, digits, and whitespace that end with a '?'.
>>> [i.strip() for i in re.findall('[\w\d\s]+\?', s)]
['Do YOU know me?', 'Hey?']
There would still be some edge cases to handle, like there could be punctuation like a ',' or other complexities.
You can use
(?<!\S)[A-Z][^?.]*\?(?!\S)
The pattern matches:
(?<!\S) Negative lookbehind, assert a whitespace boundary to the left
[A-Z] Match a single uppercase char A-Z
[^?.]*\? Match 0+ times any char except ? and . and then match a ?
(?!\S) Negative lookahead, assert a whitespace boundary to the right
Regex demo
You should use the ^ at the beginning of your expression so your regex expression should look like this: "^[A-Z].*\?".
"Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character."
If you have multiple sentences in your line you can use the following regex:
"(?<=.\s+)[A-Z].*\?"
?<= is called positive lookbehind. We try to find sentences which either start in a new line or have a period (.) and one or more whitespace characters before them.
Let say I have this string:
Alpha+*&Numeric%$^String%%$
I want to get the non-alphanumeric characters that are between alphanumeric characters:
+*& %$^
I have this regex: [^0-9a-zA-Z]+ but it's giving me
+* %$^ %%$
which includes the tailing non-alphanumeric characters which I do not want. I have also tried [0-9a-zA-Z]([^0-9a-zA-Z])+[0-9a-zA-Z] but it's giving me
a+*&N c%$^S
which include the characters a, N, c and S
If you don't mind including the _ character as alpha-numeric data, you can extract all your non-alpha-numeric-data with this:
some_string = "A+*&N%$^S%%$"
import re
result = re.findall(r'\b\W+\b', some_string) # sets result to: ['+*&', '%$^']
Note my use of \b instead of something like \w or [^\W].
\w and [^\W] each match one character, so if your alpha-numeric string (between the text you want) is exactly one character, then what you think should be the next match won't match.
But since \b is a zero-width "word boundary," it doesn't care how many alpha-numeric characters there are, as long as there is at least one.
The only problem with your second attempt is the location of the + qualifier--it should be inside of the parentheses. You can also use the word character class \w and its inverse \W to pull out these items, which is the same as your second regex but includes underscores _ as parts of words:
import re
s = "Alpha+*&Numeric%$^String%%$"
print(re.findall(r"\w(\W+)\w", s)) # adds _ character
print(re.findall(r"[0-9a-zA-Z]([^0-9a-zA-Z]+)[0-9a-zA-Z]", s)) # your version fixed
print(re.findall(r"(?i)[0-9A-Z]([^0-9A-Z]+)[0-9A-Z]", s)) # same as above
Output:
['+*&', '%$^']
['+*&', '%$^']
['+*&', '%$^']
Set-up
I've got a string of names which need to be separated into a list.
Following this answer, I have,
string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)
where the last line gives me,
['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']
Problems
1) Whitespace is ignored
'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.
What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?
2) Special characters not handled well
The code used cannot handle 'ö'. How do I include such 'German' characters?
I.e. I want to obtain,
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
You can use positive and negative lookbehind and just list the Umlauts explicitly:
>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
(?<!\s)...: matches ... that is not preceded by \s
(?<=\s)...: matches ... that is preceded by \s
(?:...): non-capturing group so as to not mess with the findall results
This works
string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.
Is there an error in the way python handles '.' or '\b'? I'm not sure why this produces differing results.
import re
regex1 = r'\.?\b'
print bool(re.match(regex1, '.'))
regex2 = r'a?\b'
print bool(re.match(regex2, 'a'))
Output:
False
True
\b, word boundary, matches between word characters and non-word elements. As such, it will match between a word character like a and the end of the string, but not between a non-word character like . and end of string.
As geekosaur pointed out \b is merely a short way of writing
(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
In your case you may want to use
(?!\w)
or
(?!\S)
instead of \b.