Split String into alpha & punctuation with exceptions regex - python

I am trying to split a string into 2 parts : alphanum chars & special chars. I want to limit the occurence of the escape character
b.sc.... = ['b.sc.','...'] (Preserve "." inside word & outside word just once)
really???? = ['really','????'] (split when any other special char encountered)
I went through a lot of SO questions before posting here. I have come up with this till now: re.findall(r"[\w+|\-.+\w]+|\W+,text)`
How to proceed further?

You can use
[re.sub(r'([.-])+', r'\1', x) for x in re.findall(r'\w+(?:-+\w+)+|\w+(?:\.+\w+)*\.?|[^\w\s]+', text)]
See this regex demo
Details
\w+(?:-+\w+)+ - one or more word chars followed with one or more occurrences of - and one or more word chars
| - or
\w+(?:\.+\w+)*\.? - one or more word chars followed with one or more occurrences of . and one or more word chars and then an optional dot
| - or
[^\w\s]+ - one or more non-word and non-whitespace chars.
The re.sub(r'([.-])+', r'\1', x) part is a post-processing step to replace one or more consecutive . or - chars with a single occurrence.

Related

Vowels not at the end or start of the words in string

I am trying to find the words in string not starting or ending with letters 'aıoueəiöü'. But regex fails to find words when I use this code:
txt = "Nasa has fixed a problem with malfunctioning equipment on a new rocket designed to take astronauts to the Moon."
re.findall(r"\b[^aıoueəiöü]\w+[^aıoueəiöü]\b",txt)
Instead, it works fine when whitespace character \s is added in negation part:
re.findall(r"\b[^aıoueəiöü\s]\w+[^aıoueəiöü\s]\b",txt)
I cannot understand the issue in first example of code, why should I specify whitespace characters too?
Note that [^aıoueəiöü] matches any char other than a, ı, o, u, e, ə, i, ö and ü. It can match a whitespace, a digit, punctuation, etc.
Also, you regex matches strings of at least three chars, you need to adjust it to match one and two char strings, too.
You do not have to rely on excluding whitespace from the pattern. Since you only want to match word chars other than vowels, add \W rather than \s:
\b[^\Waıoueəiöü](?:\w*[^\Waıoueəiöü])?\b
See the regex demo.
Details:
\b - a word boundary
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
(?:\w*[^\Waıoueəiöü])? - an optional occurrence of
\w* - any zero or more word chars
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
\b - a word boundary

regex - dont want to tokenize certain part of the input

I have to tokenize string where the string does not contain any word character and if that is "". But I cannot tokenize two words i.e. "START_CALL" and "END_CALL" which has "".
So far I came up with :
split_tokens = re.split(r'([\W_])', string_to_be_replaced)
But it is splitting all tokens with underscore(_) and splitting "START", "_", "CALL".
I can split on "START_CALL" and then do the split tokens in the sub-strings.
But would be interested to know is there a much elegant way for doing this?
You can use
(\W|_\b|\b_)
([\W_])(?<!\B_\B)
See the regex demo #1 / regex demo #2. Details:
( - start of a capturing group
\W| - any non-word char (a char other than letter, digit, some connector punctuation and most diacritic chars), or
_\b| - an underscore that is not followed with a word char, or
\b_ - an underscore that is not preceded with a word char
) - end of the group.
[\W_](?<!\B_\B) - any non-word char or _ that is not a _ both preceded and followed with word chars.

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Restricting re.findall for number of words in quotation

I'm trying to get only the quotation out of a sentence - but! only if it's one or two words long. So for the sentence
mysentence = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
The output should be
lesson
never try
So far I've got
import re
print(re.findall(r'"(.*?)"', mysentence))
Any suggestions how to solve this?
You can try this regex:
"[^"\s]+(?:\s[^"\s]+)?"
The " at the start and end matches the quotes beginning end ending the quoted word/phrase. and then we match one word: [^" ]+. [^" ] is any character that is not a quote or a space. I excluded spaces to make sure that this only matches a single word.
The next part is all in an optional group, because the second word is optional. The second word is a space followed by a single word: \s[^"\s]+.
Demo
You may use
"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"
See the regex demo.
Details
" - a " char
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
(\w+(?:\s+\w+)?) - Group 1:
\w+ - 1+ word chars
(?:\s+\w+)? - an optional sequence of 1+ whitespace chars followed with 1+ word chars
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
" - a " char
Python demo:
import re
rx = r'"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"'
s = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
print( re.findall(rx, s) )
Try this:
"((?:\w+[ .]*){1,2})"
You can easily change necessary number of words to match by changing 2 to proper number.
See the demo.
" - a " char
((?:\w+[ .]*){1,2}) - Group 1:
(?:\w+[ .]*) - non-capturing group
\w+ - sequence of 1+ 'word' chars
[ .]* - optional chars set for word delimiter. In our case spaces and dots.
{1,2} - number of repeating 'from one to two' of non-capturing group
" - a " char
As variant, for word separators can be described as "0+ sequence of not a word char and not a " char". Like this [^"\w]*
For example:
"((?:\w+[^"]*){1,2})"
See the demo

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Categories