how can I perform conditional splitting with exceptions in python - python

I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split

You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

Related

regex split on uppercase, but ignore titlecase

How can I split This Is ABC Title into This Is, ABC, Title in Python? If is use [A-Z] as regex expression it will be split into This, Is, ABC, Title? I do not want to split on whitespace.
You can use
re.split(r'\s*\b([A-Z]+)\b\s*', text)
Details:
\s* - zero or more whitespaces
\b - word boundary
([A-Z]+) - Capturing group 1: one or more ASCII uppercase letters
\b - word boundary([A-Z]+)
\s* - zero or more whitespaces
Note the use of capturing group that makes re.split also output the captured substring.
See the Python demo:
import re
text = "This Is ABC Title"
print( re.split(r'\s*\b([A-Z]+)\b\s*', text) )
# => ['This Is', 'ABC', 'Title']

regex match all after a string with positive lookbehind and input it behind every selection

copyright: hololive hololive_english
character: mori_calliope takanashi_kiara takanashi_kiara_(phoenix)
artist: xu_chin-wen
species:
meta: web
I want to select every word after eg:character: so i can put eg:character: behind every selection,
character:mori_calliope character:takanashi_kiara chararcter:takanashi_kiara_(phoenix)
the closest thing i got is
(?<=(\w*):\s*\S*\s.*)(?<=\s)(?=\S)
which works properly but it breaks when there is a single entry on eg:character: something or when its empty
i would be really thankfull if someone would help
You should install PyPi regex module and use
regex.sub(r'(?<=(\w+):.*)(?<=\s)(?=\S)', r'\1:', text)
# or
# regex.sub(r'(?<=(\w+:).*)(?<=\s)(?=\S)', r'\1', text)
See the regex demo.
Details:
(?<=(\w+):.*) - a positive lookbehind that matches a location that is immediately preceded with any word (captured into Group 1) followed by a : char and then any zero or more chars other than line break chars as many as possible
(?<=\s)` - a positive lookbehind that matches a location that is immediately preceded with a whitespace char
(?=\S) - a positive lookahead that matches a location that is immediately followed with a non-whitespace char.
See the Python demo:
import regex
text = "copyright: hololive hololive_english\ncharacter: mori_calliope takanashi_kiara takanashi_kiara_(phoenix)\nartist: xu_chin-wen\nspecies:\nmeta: web"
print( regex.sub(r'(?<=(\w+):.*)(?<=\s)(?=\S)', r'\1:', text) )
Output:
copyright: copyright:hololive copyright:hololive_english
character: character:mori_calliope character:takanashi_kiara character:takanashi_kiara_(phoenix)
artist: artist:xu_chin-wen
species:
meta: meta:web

Testing a string for a regex pattern in Python

I want to process further in my function when an input string matches following regex pattern:
whitespace_or_beginning_of_line word_from_letters slash
word_from_letters whitespace_or_end_of_line
I've tried:
import re
text = "[url=}}{{cz.csob.cebmobile://deeplink?screen=AL03&tab=overview/detail/cards/standing_orders]"
if re.search(r" [a-aZ-Z]/[a-aZ-Z] ", text) or re.search(r"\n[a-aZ-Z]/[a-aZ-Z]\n", text):
...process further (do some logic)
You can use
(?<!\S)[a-zA-Z]+/[a-zA-Z]+(?!\S)
In Python:
re.findall(r'(?<!\S)[a-zA-Z]+/[a-zA-Z]+(?!\S)', text)
See the regex demo. Details:
(?<!\S) - a left-hand whitespace boundary
[a-zA-Z]+ - one or more ASCII letters
/ - a slash
[a-zA-Z]+ - one or more ASCII letters
(?!\S) - a right-hand whitespace boundary.

regex to match string before colon till whitespace

i have a sample string from a text file. i want to find all the words before colon till whitespace.
i have written code like this:
import re
text = 'From: mathew <mathew#mantis.co.uk>\nSubject: Alt.Atheism FAQ: Atheist Resources\n\nArchive-
name: atheism/resources\nAlt-atheism-archive-name:'
email_data = re.findall("[^\s].*(?=:)", text)
print(email_data)
Output:
['From', 'Subject: Alt.Atheism FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Desired Output:
['From', 'Subject', 'FAQ', 'Archive-name', 'Alt-atheism-archive-name']
Code is picking up data till newline charater because of (.*) used. i want to restrict it till whitespace so i put [^\s] but its not working. What could i do instead?
You may use
email_data = re.findall(r"\S[^:\s]+(?=:)", text)
See the Python demo and the regex demo.
Details
\S - a non-whitespace char
[^:\s]+ - 1+ chars other than : and whitespace
(?=:) - immediately to the right, there must be a : char (it is not consumed, not added to the match value).
Use re.IGNORECASE flag with the regex pattern
\b[a-z-]+(?=:(?:\s|$))
https://regex101.com/r/0UHsbo/1
https://ideone.com/oz91bP

need regex expression to avoid " \n " character

I want to apply regex to the below string in python Where i only want to capture Model Number : 123. I tried the below regex but it didn't fetch me the result.
string = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'(?s)Model Number:.*?\n',string)
Output is as follows Model Number : 123\n How can i avoid \n at the end of the output?
Remove the DOTALL (?s) inline modifier to avoid matching a newline char with ., add \s* after Number and use .* instead of .*?\n:
r'Model Number\s*:.*'
See the regex demo
Here, Model Number will match a literal substring, \s* will match 0+ whitespaces, : will match a colon and .* will match 0 or more chars other than line break chars.
Python demo:
import re
s = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'Model Number\s*:.*',s)
print(model_number) # => ['Model Number : 123']
If you need to extract just the number use
r'Model Number\s*:\s*(\d+)'
See another regex demo and this Python demo.
Here, (\d+) will capture 1 or more digits and re.findall will only return these digits. Or, use it with re.search and once the match data object is obtained, grab it with match.group(1).
NOTE: If the string appears at the start of the string, use re.match. Or add ^ at the start of the pattern and use re.M flag (or add (?m) at the start of the pattern).
you can use strip() function
model_number.strip()
this will remove all white spaces

Categories