Cannot understand the code for removing words with numbers [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I want to remove words with numbers. After research I understood that
s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
This code works to solve my situation
However, I am not able to understand how this code works. I know about regex and I know individually \d recognizes all the numbers [0-9]. \S is for white spaces. and * is 0 or more occurrences of the pattern to its left
"\S*\d\S*"
This part I am not able to understand
But I am not sure I understand how this code identifies AB55.
Can anyone please explain to me? Thanks

this replaces a digit with any non-space symbols around with empty string ""
the AB55 is viewed like : AB are \S*, 5 is \d, 5 is \S*
55CD : empty string is \S*, 5 is \d, 5CD is \S*
A55D : A is \S*, 5 is \d, 5D is \S*
5555 : empty string is \S*, 5 is \d, 555 is \S*
The re.sub("\S*\d\S*", "", s) replaces all this substrings to empty string "" and .strip() is useless since it removes whitespace at the begin and end of the previous result

You misunderstand the code. \S is the opposite of \s: it matches with everything except whitespace.
Since the Kleene star (*) is greedy, it thus means that it aims to match as much non-space characters as possible, followed by a digit followed by as much non-space characters as possible. It will thus match a full word, where at least one character is a digit.
All these matches are then replaced by the empty string, and therefore removed from the original string.

Your code first matches 0+ times non whitespace chars \S* (where \s* matches whitespace chars) and will match all the way until the end of the "word". Then it backtracks to match a digit and and again match 0+ non whitespace chars.
The pattern will for example also match a single digit.
You could slightly optimize the pattern to first match not a whitespace char or a digit [^\s\d]* using a negated character class to prevent the first \S* match the whole word.
[^\s\d]*\d\S*
Regex demo

This is how your regex works, you mention about \S for white spaces. But it is not.
This is what python documentation mention about \s and \S
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
This is with \s which is for whitespace characters.
and you'll get an output like this,
>>> import re
>>>
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\s*\d\s*", "", s).strip()
'ABCD abcd ABCD AD'

Related

Regex for questions taking multiple sentences

I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:
match = re.findall("[A-Z].*\?", data2)
print(match)
an example of a result I get is:
'He knows me, and I know him. Do YOU know me? Hey?'
the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.
The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?
questions = re.findall(r"\s*([^\.\?]+\?)", data2)
# \s* sentence beginning space to ignore
# ( start capture group
# [^\.\?]+ negated capture group matching anything besides "." and "?" (one or more)
# \? question mark to end sentence
# ) end capture group
You could look for letters, digits, and whitespace that end with a '?'.
>>> [i.strip() for i in re.findall('[\w\d\s]+\?', s)]
['Do YOU know me?', 'Hey?']
There would still be some edge cases to handle, like there could be punctuation like a ',' or other complexities.
You can use
(?<!\S)[A-Z][^?.]*\?(?!\S)
The pattern matches:
(?<!\S) Negative lookbehind, assert a whitespace boundary to the left
[A-Z] Match a single uppercase char A-Z
[^?.]*\? Match 0+ times any char except ? and . and then match a ?
(?!\S) Negative lookahead, assert a whitespace boundary to the right
Regex demo
You should use the ^ at the beginning of your expression so your regex expression should look like this: "^[A-Z].*\?".
"Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character."
If you have multiple sentences in your line you can use the following regex:
"(?<=.\s+)[A-Z].*\?"
?<= is called positive lookbehind. We try to find sentences which either start in a new line or have a period (.) and one or more whitespace characters before them.

Match Pattern based on multiple special characters

Regex to match more than one special characters after a string
I am trying to come up with regex to match in the order of importance as below
String plus 2 or more special characters followed by some word
String plus 1 special character followed by some word
String (and no special characters) followed by some word
I am able to match all patterns with below regex
re.compile(r'keyword\W*\s*(\S*)', re.IGNORECASE|re.MULTILINE|re.UNICODE)
but it does not differentiate between different scenarios after keyword.
for example:
considering keyword is the string above
If I have string 'keyword-+blah' I should be able to match with 1 only
if I have string 'keyword-blah' I should be able to match with 2 only
if I have String 'keywordblah' I should be able to match with 3 only
You could use a character class to specify which chars you consider to be special. Then use a quantifier {0,2} to match a repetition of 0, 1 or 2 times.
The following \w+ matches 1+ times a word character.
Note that \S matches a non whitespace char so that would also match - or +
keyword[+-]{0,2}\w+
Regex demo

extract string using regular expression

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?
You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'
Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.

Regex to include alphanumeric and _

I'm trying to create a regular expression to match alphanumeric characters and the underscore _. This is my regex: "\w_*[^-$\s\]" and my impression is that this regex means any alphanumeric character \w, an underscore _, and no -,$, or whitespace. Is this correct?
Regular expressions are read as patterns which actually match characters in a string, left to right, so your pattern actually matches an alphanumeric, THEN an underscore (0 or more), THEN at least one character that is not a hyphen, dollar, or whitespace.
Since you're trying to alternate on character types, just use a character class to show what characters you're allowing:
[\w_]
This checks that ANY part of the string matches it, so let's anchor it to the beginning and and of the string:
^[\w_]$
And now we see that the character class lacks a quantifier, so we are matching on exactly ONE character. We can fix that using + (if you want one or more characters, no empty strings) or * (if you want to allow empty strings). I'll use + here.
^[\w_]+$
As it turns out, the \w character class already includes the underscore, so we can remove the redundant underscore from the pattern:
^[\w]+$
And now we have only one character in the character class, so we no longer need the character class brackets at all:
^\w+$
And that's all you need, unless I'm missing something about your requirements.
Yes, you are semi-correct if the closing bracket was not escaped and you edited your regex a bit. Also the token \w matches underscore, so you do not need to repeat this character. Your regular expression says:
\w # word characters (a-z, A-Z, 0-9, _)
_* # '_' (0 or more times)
[^-$\s] # any character except: '-', '$', whitespace (\n, \r, \t, \f, and " ")
You could simply write your entire regex as follows to match word characters:
\w+ # word characters ( a-z, A-Z, 0-9, _ ) (1 or more times)
If you want to match an entire string, be sure to anchor your expression.
^\w+$
Explanation:
^ # the beginning of the string
\w+ # word characters ( a-z, A-Z, 0-9, _ ) (1 or more times)
$ # before an optional \n, and the end of the string

Categories