How to include hyphenated words in this regex? [duplicate] - python

I have the following regex for matching words:
\w+(?:'|\-\w+)?
For the following string:
' 's yea' don't -yeah no- ice-cream '
it gives the following matches:
s yea' don't yeah no ice-cream
However, I would like the following matches:
's yea' don't yeah no ice-cream
Since a word can start or end with an apostrophe but not with a hyphen. Note the a ' on its own should not be matched.

Your \w+(?:'|\-\w+)? starts matching with a word character \w, thus all "words" starting with ' are not matched as per the requirements.
In general, you can match words with and without hyphens with
\w+(?:-\w+)*
In the current scenario, you may include the \w and ' into a character class and use
'?\w[\w']*(?:-\w+)*'?
See the regex demo
If a "word" can only have 1 hyphen, replace * at the end with the ? quantifier.
Breakdown:
'? - optional apostrophe
\w - a word character
[\w']* - 0+ word character or an apostrophe
(?:-\w+)* - 0+ sequences of:
- - a hyphen
\w+ - 1+ word character
'? - optional apostrophe

Related

word boundary \b doesn't work on string with dot in Python regex [duplicate]

For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.

Split String into alpha & punctuation with exceptions regex

I am trying to split a string into 2 parts : alphanum chars & special chars. I want to limit the occurence of the escape character
b.sc.... = ['b.sc.','...'] (Preserve "." inside word & outside word just once)
really???? = ['really','????'] (split when any other special char encountered)
I went through a lot of SO questions before posting here. I have come up with this till now: re.findall(r"[\w+|\-.+\w]+|\W+,text)`
How to proceed further?
You can use
[re.sub(r'([.-])+', r'\1', x) for x in re.findall(r'\w+(?:-+\w+)+|\w+(?:\.+\w+)*\.?|[^\w\s]+', text)]
See this regex demo
Details
\w+(?:-+\w+)+ - one or more word chars followed with one or more occurrences of - and one or more word chars
| - or
\w+(?:\.+\w+)*\.? - one or more word chars followed with one or more occurrences of . and one or more word chars and then an optional dot
| - or
[^\w\s]+ - one or more non-word and non-whitespace chars.
The re.sub(r'([.-])+', r'\1', x) part is a post-processing step to replace one or more consecutive . or - chars with a single occurrence.

Restricting re.findall for number of words in quotation

I'm trying to get only the quotation out of a sentence - but! only if it's one or two words long. So for the sentence
mysentence = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
The output should be
lesson
never try
So far I've got
import re
print(re.findall(r'"(.*?)"', mysentence))
Any suggestions how to solve this?
You can try this regex:
"[^"\s]+(?:\s[^"\s]+)?"
The " at the start and end matches the quotes beginning end ending the quoted word/phrase. and then we match one word: [^" ]+. [^" ] is any character that is not a quote or a space. I excluded spaces to make sure that this only matches a single word.
The next part is all in an optional group, because the second word is optional. The second word is a space followed by a single word: \s[^"\s]+.
Demo
You may use
"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"
See the regex demo.
Details
" - a " char
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
(\w+(?:\s+\w+)?) - Group 1:
\w+ - 1+ word chars
(?:\s+\w+)? - an optional sequence of 1+ whitespace chars followed with 1+ word chars
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
" - a " char
Python demo:
import re
rx = r'"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"'
s = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
print( re.findall(rx, s) )
Try this:
"((?:\w+[ .]*){1,2})"
You can easily change necessary number of words to match by changing 2 to proper number.
See the demo.
" - a " char
((?:\w+[ .]*){1,2}) - Group 1:
(?:\w+[ .]*) - non-capturing group
\w+ - sequence of 1+ 'word' chars
[ .]* - optional chars set for word delimiter. In our case spaces and dots.
{1,2} - number of repeating 'from one to two' of non-capturing group
" - a " char
As variant, for word separators can be described as "0+ sequence of not a word char and not a " char". Like this [^"\w]*
For example:
"((?:\w+[^"]*){1,2})"
See the demo

extract string using regular expression

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?
You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'
Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

Categories