Ignore optional suffix without preceding delimiter - python

I would like to capture the first part of a word, ignoring the optional suffix. Both the suffix and preceding text are composed of the same class of characters (that is, there is no delimiter before the suffix).
My first try only captures the first letter:
m = re.search(r'([A-Za-z]+?)(?:Suff)?', 'textSuff')
m.groups()
>>> ('t',)
I want to capture "text" only, but when I make the first group element greedy, it grabs the entire string.
m = re.search(r'([A-Za-z]+)(?:Suff)?', 'textSuff')
m.groups()
>>> ('textSuff',)
Is it feasible without a different character to delimit the suffix?

If your pattern is all constructed from optional patterns, be sure you will get as few characters in return as possible. Thus, there must be at least a boundary. I guess the word boundary \b is a valid way to go here (since you need to match words):
([A-Za-z]+?)(?:Suff)?\b
See demo
IDEONE DEMO:
import re
p = re.compile(r'([A-Za-z]+?)(?:Suff)?\b')
test_str = "textSuff more words tSuff"
print(re.findall(p, test_str))
Outputs:
['text', 'more', 'words', 't']

You need to specify that after everything either the string must end or there must be a non-acceptable character....
m = re.search(r'([A-Za-z]+?)(?:Suff)?(?:[^A-Za-z]|$)'

Related

Need Regex that matches all patterns with format as `{word}{.,#}{word}` with strict matching

So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. #viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.
I suggest using
(\w+)?([.,#])((?(1)\w*|\w+))
See the regex demo.
Details
(\w+)? - An optional group #1: one or more word chars
([.,#]) - Group #2: ., , or #
((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).
See the Python demo:
import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
print(s, ' -> ', pattern.findall(s))
Output:
no.16 -> [('no', '.', '16')]
#400 -> [('', '#', '400')]
word1.word2 -> [('word1', '.', 'word2')]
word -> []
123 -> []
The answer to your edit is
if re.search(r'\w[.,#]|[.,#]\w', text):
print( re.findall(r'[.,#]|[^\s.,#]+', text) )
If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.
I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:
a='no.16'
b='#400'
c='word1.word2'
lst=[a, b, c]
for elem in lst:
result= re.split('(\.|#|,)',elem)
while('' in result):
result.remove('')
print(result)
You could do something like this:
import re
str = "no.16"
pattern = re.compile(r"(\w+)([.|#])(\w+)")
result = list(filter(None, pattern.split(str)))
The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).
However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.

How to match and replace this pattern in Python RE?

s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Categories