regex - dont want to tokenize certain part of the input - python

I have to tokenize string where the string does not contain any word character and if that is "". But I cannot tokenize two words i.e. "START_CALL" and "END_CALL" which has "".
So far I came up with :
split_tokens = re.split(r'([\W_])', string_to_be_replaced)
But it is splitting all tokens with underscore(_) and splitting "START", "_", "CALL".
I can split on "START_CALL" and then do the split tokens in the sub-strings.
But would be interested to know is there a much elegant way for doing this?

You can use
(\W|_\b|\b_)
([\W_])(?<!\B_\B)
See the regex demo #1 / regex demo #2. Details:
( - start of a capturing group
\W| - any non-word char (a char other than letter, digit, some connector punctuation and most diacritic chars), or
_\b| - an underscore that is not followed with a word char, or
\b_ - an underscore that is not preceded with a word char
) - end of the group.
[\W_](?<!\B_\B) - any non-word char or _ that is not a _ both preceded and followed with word chars.

Related

Vowels not at the end or start of the words in string

I am trying to find the words in string not starting or ending with letters 'aıoueəiöü'. But regex fails to find words when I use this code:
txt = "Nasa has fixed a problem with malfunctioning equipment on a new rocket designed to take astronauts to the Moon."
re.findall(r"\b[^aıoueəiöü]\w+[^aıoueəiöü]\b",txt)
Instead, it works fine when whitespace character \s is added in negation part:
re.findall(r"\b[^aıoueəiöü\s]\w+[^aıoueəiöü\s]\b",txt)
I cannot understand the issue in first example of code, why should I specify whitespace characters too?
Note that [^aıoueəiöü] matches any char other than a, ı, o, u, e, ə, i, ö and ü. It can match a whitespace, a digit, punctuation, etc.
Also, you regex matches strings of at least three chars, you need to adjust it to match one and two char strings, too.
You do not have to rely on excluding whitespace from the pattern. Since you only want to match word chars other than vowels, add \W rather than \s:
\b[^\Waıoueəiöü](?:\w*[^\Waıoueəiöü])?\b
See the regex demo.
Details:
\b - a word boundary
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
(?:\w*[^\Waıoueəiöü])? - an optional occurrence of
\w* - any zero or more word chars
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
\b - a word boundary

A word starting with t but ends with other than e

I am trying to create a regex that starts with t or T and doesn't end with e letter. I tried the code below so far, but it's not giving me the desirable result. Could anyone show me what is exactly missing here?
my_str = my_file.read()
word = re.findall("[tT].*[^e]$", my_str)
print(word)
You can use
\bt(?:[a-z]*[a-df-z])?\b
\bt[a-z]*\b(?<!e)
Just for completeness, here is a regex to match any word starting with a Cyrillic т and not ending with a Cyrillic е:
\bт[^\W\d_]*\b(?<!е)
See the regex demo #1, regex demo #2 and a Cyrillic regex demo.
If you need a case insensitive matching, add re.I:
re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I)
And a note on word boundaries: if the words can be glued to _ or digits, use letter boundaries rather than word boundaries:
r'(?<![a-z])t(?:[a-z]*[a-df-z])?(?![a-z])'
r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)' # Unicode letter boundaries
Regex details
\b - word boundary (start of string or a position immediately after a char other than a digit, letter, underscore)
(?<![a-z]) ((?<![^\W\d_]) is a Unicode aware equivalent) - a negative lookbehind that matches a location that is not immediately preceded with a letter
t - a t letter
(?:[a-z]*[a-df-z])? - an optional non-capturing group matching 0 or more letters and then a letter other than e
\b - word boundary
(?![a-z]) ((?![^\W\d_]) is a Unicode aware equivalent) - a negative lookahead that matches a location that is not immediately followed with a letter.
Also,
\bt[a-z]*\b(?<!e) matches a word boundary, t, any zero or more lowercase ASCII letters (any ASCII letters with re.I), then a word boundary marks the end of a word and the negative lookbehind (?<!e) fails the match if there is e at the end of the word
[^\W\d_]* - matches zero or more more Unicode letters.
See a Python demo:
import re
text = r't, train => main,teene!'
cyr_text = r'таня тане работе'
print( re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bt[a-z]*\b(?<!e)', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bт[^\W\d_]*\b(?<!е)', cyr_text, re.I) )
# => ['таня']
print( re.findall(r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)', cyr_text, re.I) )
# => ['таня']
There is also another way of doing it:
re.findall(r"\b[Tt]+[a-zA-Z]*[^Ee\s]\b", my_str)
Maybe:
[\W]([Tt]\w*[^e])[\W]
Any non word character followed by (capture: Tt, some optional word characters, not e) followed by first non word character

Split String into alpha & punctuation with exceptions regex

I am trying to split a string into 2 parts : alphanum chars & special chars. I want to limit the occurence of the escape character
b.sc.... = ['b.sc.','...'] (Preserve "." inside word & outside word just once)
really???? = ['really','????'] (split when any other special char encountered)
I went through a lot of SO questions before posting here. I have come up with this till now: re.findall(r"[\w+|\-.+\w]+|\W+,text)`
How to proceed further?
You can use
[re.sub(r'([.-])+', r'\1', x) for x in re.findall(r'\w+(?:-+\w+)+|\w+(?:\.+\w+)*\.?|[^\w\s]+', text)]
See this regex demo
Details
\w+(?:-+\w+)+ - one or more word chars followed with one or more occurrences of - and one or more word chars
| - or
\w+(?:\.+\w+)*\.? - one or more word chars followed with one or more occurrences of . and one or more word chars and then an optional dot
| - or
[^\w\s]+ - one or more non-word and non-whitespace chars.
The re.sub(r'([.-])+', r'\1', x) part is a post-processing step to replace one or more consecutive . or - chars with a single occurrence.

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Restricting re.findall for number of words in quotation

I'm trying to get only the quotation out of a sentence - but! only if it's one or two words long. So for the sentence
mysentence = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
The output should be
lesson
never try
So far I've got
import re
print(re.findall(r'"(.*?)"', mysentence))
Any suggestions how to solve this?
You can try this regex:
"[^"\s]+(?:\s[^"\s]+)?"
The " at the start and end matches the quotes beginning end ending the quoted word/phrase. and then we match one word: [^" ]+. [^" ] is any character that is not a quote or a space. I excluded spaces to make sure that this only matches a single word.
The next part is all in an optional group, because the second word is optional. The second word is a space followed by a single word: \s[^"\s]+.
Demo
You may use
"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"
See the regex demo.
Details
" - a " char
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
(\w+(?:\s+\w+)?) - Group 1:
\w+ - 1+ word chars
(?:\s+\w+)? - an optional sequence of 1+ whitespace chars followed with 1+ word chars
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
" - a " char
Python demo:
import re
rx = r'"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"'
s = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
print( re.findall(rx, s) )
Try this:
"((?:\w+[ .]*){1,2})"
You can easily change necessary number of words to match by changing 2 to proper number.
See the demo.
" - a " char
((?:\w+[ .]*){1,2}) - Group 1:
(?:\w+[ .]*) - non-capturing group
\w+ - sequence of 1+ 'word' chars
[ .]* - optional chars set for word delimiter. In our case spaces and dots.
{1,2} - number of repeating 'from one to two' of non-capturing group
" - a " char
As variant, for word separators can be described as "0+ sequence of not a word char and not a " char". Like this [^"\w]*
For example:
"((?:\w+[^"]*){1,2})"
See the demo

Categories