regex split on uppercase, but ignore titlecase - python

How can I split This Is ABC Title into This Is, ABC, Title in Python? If is use [A-Z] as regex expression it will be split into This, Is, ABC, Title? I do not want to split on whitespace.

You can use
re.split(r'\s*\b([A-Z]+)\b\s*', text)
Details:
\s* - zero or more whitespaces
\b - word boundary
([A-Z]+) - Capturing group 1: one or more ASCII uppercase letters
\b - word boundary([A-Z]+)
\s* - zero or more whitespaces
Note the use of capturing group that makes re.split also output the captured substring.
See the Python demo:
import re
text = "This Is ABC Title"
print( re.split(r'\s*\b([A-Z]+)\b\s*', text) )
# => ['This Is', 'ABC', 'Title']

Related

Regex for for creating an acronym

I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n

Testing a string for a regex pattern in Python

I want to process further in my function when an input string matches following regex pattern:
whitespace_or_beginning_of_line word_from_letters slash
word_from_letters whitespace_or_end_of_line
I've tried:
import re
text = "[url=}}{{cz.csob.cebmobile://deeplink?screen=AL03&tab=overview/detail/cards/standing_orders]"
if re.search(r" [a-aZ-Z]/[a-aZ-Z] ", text) or re.search(r"\n[a-aZ-Z]/[a-aZ-Z]\n", text):
...process further (do some logic)
You can use
(?<!\S)[a-zA-Z]+/[a-zA-Z]+(?!\S)
In Python:
re.findall(r'(?<!\S)[a-zA-Z]+/[a-zA-Z]+(?!\S)', text)
See the regex demo. Details:
(?<!\S) - a left-hand whitespace boundary
[a-zA-Z]+ - one or more ASCII letters
/ - a slash
[a-zA-Z]+ - one or more ASCII letters
(?!\S) - a right-hand whitespace boundary.

how can I perform conditional splitting with exceptions in python

I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \s in [A-Z]+\. Name do not split
You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo. Details:
(?<=[.?!]) - a positive lookbehind that requires ., ? or ! immediately to the left of the current location
(?<![A-Z]\.(?=\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead, that is why it works with Python re, and \s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \s+ pattern below)
\s+ - one or more whitespace chars.
See the Python demo:
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

Regex to match words followed by whitespace or punctuation

If I have the word india
MATCHES
"india!" "india!" "india." "india"
NON MATCHES "indian" "indiana"
Basically, I want to match the string but not when its contained within another string.
After doing some research, I started with
exp = "(?<!\S)india(?!\S)"
num_matches = len(re.findall(exp))
but that doesn't match the punctuation and I'm not sure where to add that in.
Assuming the objective is to match a given word (e.g., "india") in a string provided the word is neither preceded nor followed by a character that is not in the string " .,?!;" you could use the following regex:
(?<![^ .,?!;])india(?![^ .,?!;\r\n])
Demo
Python's regex engine performs the following operations
(?<! # begin a negative lookbehind
[^ .,?!;] # match 1 char other than those in " .,?!;"
) # end the negative lookbehind
india # match string
(?! # begin a negative lookahead
[^ .,?!;\r\n] # match 1 char other than those in " .,?!;\r\n"
) # end the negative lookahead
Notice that the character class in the negative lookahead contains \r and \n in case india is at the end of a line.
\"india(\W*?)\"
this will catch anything except for numbers and letters
Try this
^india[^a-zA-Z0-9]$
^ - Regex starts with India
[^a-zA-Z0-9] - not a-z, A-Z, 0-9
$ - End Regex
Try with:
r'\bindia\W*\b'
See demo
To ignore case:
re.search(r'\bindia\W*\b', my_string, re.IGNORECASE).group(0)
you may use:
import re
s = "india."
s1 = "indiana"
print(re.search(r'\bindia[.!?]*\b', s))
print(re.search(r'\bindia[.!?]*\b', s1))
output:
<re.Match object; span=(0, 5), match='india'>
None
If you also want to match the punctuation, you could use make use of a negated character class where you could match any char except a word character or a newline.
(?<!\S)india[^\w\r\n]*(?!\S)
(?<!\S) Assert a whitspace bounadry to the left
india Match literally
[^\w\r\n] Match 0+ times any char except a word char or a newline
(?!\S) Assert a whitspace boundary to the right
Regex demo

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories