I want to process further in my function when an input string matches following regex pattern:
whitespace_or_beginning_of_line word_from_letters slash
word_from_letters whitespace_or_end_of_line
I've tried:
import re
text = "[url=}}{{cz.csob.cebmobile://deeplink?screen=AL03&tab=overview/detail/cards/standing_orders]"
if re.search(r" [a-aZ-Z]/[a-aZ-Z] ", text) or re.search(r"\n[a-aZ-Z]/[a-aZ-Z]\n", text):
...process further (do some logic)
You can use
(?<!\S)[a-zA-Z]+/[a-zA-Z]+(?!\S)
In Python:
re.findall(r'(?<!\S)[a-zA-Z]+/[a-zA-Z]+(?!\S)', text)
See the regex demo. Details:
(?<!\S) - a left-hand whitespace boundary
[a-zA-Z]+ - one or more ASCII letters
/ - a slash
[a-zA-Z]+ - one or more ASCII letters
(?!\S) - a right-hand whitespace boundary.
Related
How can I split This Is ABC Title into This Is, ABC, Title in Python? If is use [A-Z] as regex expression it will be split into This, Is, ABC, Title? I do not want to split on whitespace.
You can use
re.split(r'\s*\b([A-Z]+)\b\s*', text)
Details:
\s* - zero or more whitespaces
\b - word boundary
([A-Z]+) - Capturing group 1: one or more ASCII uppercase letters
\b - word boundary([A-Z]+)
\s* - zero or more whitespaces
Note the use of capturing group that makes re.split also output the captured substring.
See the Python demo:
import re
text = "This Is ABC Title"
print( re.split(r'\s*\b([A-Z]+)\b\s*', text) )
# => ['This Is', 'ABC', 'Title']
For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.
I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split
You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
Using Python with Matthew Barnett's regex module.
I have this string:
The well known *H*rry P*tter*.
I'm using this regex to process the asterisks to obtain <em>H*rry P*tter</em>:
REG = re.compile(r"""
(?<!\p{L}|\p{N}|\\)
\*
([^\*]*?) # I need this part to deal with nested patterns; I really can't omit it
\*
(?!\p{L}|\p{N})
""", re.VERBOSE)
PROBLEM
The problem is that this regex doesn't match this kind of strings unless I protect intraword asterisks first (I convert them to decimal entities), which is awfully expensive in documents with lots of asterisks.
QUESTION
Is it possible to tell the negative class to block at internal asterisks only if they are not surrounded by word characters?
I tried these patterns in vain:
([^(?:[^\p{L}|\p{N}]\*[^\p{L}|\p{N}])]*?)
([^(?<!\p{L}\p{N})\*(?!\p{L}\p{N})]*?)
I suggest a single regex replacement for the cases like you mentioned above:
re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'<em>\1</em>', s)
See the regex demo
Details:
\B\*\b - a * that is preceded with a non-word boundary and followed with a word boundary
([^*]*(?:\b\*\b[^*]*)*) - Group 1 capturing:
[^*]* - 0+ chars other than *
(?:\b\*\b[^*]*)* - zero or more sequences of:
\b\*\b - a * enclosed with word boundaries
[^*]* - 0+ chars other than *
\b\*\B - a * that is followed with a non-word boundary and preceded with a word boundary
More information on word boundaries and non-word boundaries:
Word boundaries at regular-expressions.info
Difference between \b and \B in regex
What are non-word boundary in regex (\B), compared to word-boundary?
I have a text like
var12.1
a
a
dsa
88
123!!!
secondVar12.1
The string between var and secondVar may be different (and there may be different count of them).
How can I dump it with regexp?
I'm trying something something like this to no avail:
re.findall(r"^var[0-9]+\.[0-9]+[\n.]+^secondVar[0-9]+\.[0-9]+", str, re.MULTILINE)
You can grab it with:
var\d+(?:(?!var\d).)*?secondVar
See demo. re.S (or re.DOTALL) modifier must be used with this regex so that . could match a newline. The text between the delimiters will be in Group 1.
NOTE: The closest match will be matched due to (?:(?!var\d).)*? tempered greedy token (i.e. if you have another var + a digit after var + 1+ digits then the match will be between the second var and secondVar.
NOTE2: You might want to use \b word boundaries to match the words beginning with them: \bvar(?:(?!var\d).)*?\bsecondVar.
REGEX EXPLANATION
var - match the starting delimiter
\d+ - 1+ digits
(?:(?!var\d).)*? - a tempered greedy token that matches any char, 0 or more (but as few as possible) repetitions, that does not start a char sequence var and a digit
secondVar - match secondVar literally.
IDEONE DEMO
import re
p = re.compile(r'var\d+(?:(?!var\d).)*?secondVar', re.DOTALL)
test_str = "var12.1\na\na\ndsa\n\n88\n123!!!\nsecondVar12.1\nvar12.1\na\na\ndsa\n\n88\n123!!!\nsecondVar12.1"
print(p.findall(test_str))
Result for the input string (I doubled it for demo purposes):
['12.1\na\na\ndsa\n\n88\n123!!!\n', '12.1\na\na\ndsa\n\n88\n123!!!\n']
You're looking for the re.DOTALL flag, with a regex like this: var(.*?)secondVar. This regex would capture everything between var and secondVar.