If I have the word india
MATCHES
"india!" "india!" "india." "india"
NON MATCHES "indian" "indiana"
Basically, I want to match the string but not when its contained within another string.
After doing some research, I started with
exp = "(?<!\S)india(?!\S)"
num_matches = len(re.findall(exp))
but that doesn't match the punctuation and I'm not sure where to add that in.
Assuming the objective is to match a given word (e.g., "india") in a string provided the word is neither preceded nor followed by a character that is not in the string " .,?!;" you could use the following regex:
(?<![^ .,?!;])india(?![^ .,?!;\r\n])
Demo
Python's regex engine performs the following operations
(?<! # begin a negative lookbehind
[^ .,?!;] # match 1 char other than those in " .,?!;"
) # end the negative lookbehind
india # match string
(?! # begin a negative lookahead
[^ .,?!;\r\n] # match 1 char other than those in " .,?!;\r\n"
) # end the negative lookahead
Notice that the character class in the negative lookahead contains \r and \n in case india is at the end of a line.
\"india(\W*?)\"
this will catch anything except for numbers and letters
Try this
^india[^a-zA-Z0-9]$
^ - Regex starts with India
[^a-zA-Z0-9] - not a-z, A-Z, 0-9
$ - End Regex
Try with:
r'\bindia\W*\b'
See demo
To ignore case:
re.search(r'\bindia\W*\b', my_string, re.IGNORECASE).group(0)
you may use:
import re
s = "india."
s1 = "indiana"
print(re.search(r'\bindia[.!?]*\b', s))
print(re.search(r'\bindia[.!?]*\b', s1))
output:
<re.Match object; span=(0, 5), match='india'>
None
If you also want to match the punctuation, you could use make use of a negated character class where you could match any char except a word character or a newline.
(?<!\S)india[^\w\r\n]*(?!\S)
(?<!\S) Assert a whitspace bounadry to the left
india Match literally
[^\w\r\n] Match 0+ times any char except a word char or a newline
(?!\S) Assert a whitspace boundary to the right
Regex demo
Related
I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n
I have the regex (?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)(?!\w).
Given the string #first#nope #second#Hello #my-friend, email# whats.up#example.com #friend, what can I do to exclude the strings #first and #second since they are not whole words on their own ?
In other words, exclude them since they are succeeded by # .
You can use
(?<![a-zA-Z0-9_.-])#(?=([A-Za-z]+[A-Za-z0-9_-]*))\1(?![#\w])
(?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w])
See the regex demo. Details:
(?<![a-zA-Z0-9_.-]) - a negative lookbehind that matches a location that is not immediately preceded with ASCII digits, letters, _, . and -
# - a # char
(?=([A-Za-z]+[A-Za-z0-9_-]*)) - a positive lookahead with a capturing group inside that captures one or more ASCII letters and then zero or more ASCII letters, digits, - or _ chars
\1 - the Group 1 value (backreferences are atomic, no backtracking is allowed through them)
(?![#\w]) - a negative lookahead that fails the match if there is a word char (letter, digit or _) or a # char immediately to the right of the current location.
Note I put hyphens at the end of the character classes, this is best practice.
The (?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w]) alternative uses shorthand character classes and the (?a) inline modifier (equivalent of re.ASCII / re.A makes \w only match ASCII chars (as in the original version). Remove (?a) if you plan to match any Unicode digits/letters.
Another option is to assert a whitespace boundary to the left, and assert no word char or # sign to the right.
(?<!\S)#([A-Za-z]+[\w-]+)(?![#\w])
The pattern matches:
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left
# Match literally
([A-Za-z]+[\w-]+) Capture group1, match 1+ chars A-Za-z and then 1+ word chars or -
(?![#\w]) Negative lookahead, assert not # or word char to the right
Regex demo
Or match a non word boundary \B before the # instead of a lookbehind.
\B#([A-Za-z]+[\w-]+)(?![#\w])
Regex demo
Python 3.8.2
the task at hand is simple: to match lowercase characters separated by a single underscore. So the pattern could be r"[a-z]+_[a-z]+"
now my issue is that I expected re.findall() to pair up all the following:
"ash_tonic_transit_so_kern_err_looo_"
instead of paring all the words around each underscore ('ash_tonic', 'tonic_transit', 'transit_so', ETC) I get three pairs: ['ash_tonic', 'transit_so', 'kern_err']
Does python re omit part of the string once a match has been found instead of running the search again?
import re
def match_lower(s):
patternRegex = re.compile(r'[a-z]+_[a-z]+')
mo = patternRegex.findall(s)
return mo
print(match_lower('ash_tonic_transit_so_kern_err_looo_'))
You could use a positive lookahead with a capturing group to get the matches, and start the match asserting what is directly to the left is not a char a-z using a negative lookbehind.
Use re.findall which will return the values from the capturing group.
(?<![a-z])(?=([a-z]+_[a-z]+))
Explanation
(?<![a-z]) Negative lookabehind, assert what is directly to the left is not a char a-z
(?= Positive lookahead, assert what on the right is
([a-z]+_[a-z]+) Capture group 1, match 1+ chars a-z _ 1+ chars a-z
) Close lookahead
Regex demo | Python demo
import re
regex = r"(?<![a-z])(?=([a-z]+_[a-z]+))"
test_str = "ash_tonic_transit_so_kern_err_looo_"
print(re.findall(regex, test_str))
Output
['ash_tonic', 'tonic_transit', 'transit_so', 'so_kern', 'kern_err', 'err_looo']
This is explicitly mentioned in the documentation of re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings.
For instance, 'ash_tonic' and 'tonic_transit' overlap, so they won't be considered two distinct matches.
I am trying to create a regex that starts with t or T and doesn't end with e letter. I tried the code below so far, but it's not giving me the desirable result. Could anyone show me what is exactly missing here?
my_str = my_file.read()
word = re.findall("[tT].*[^e]$", my_str)
print(word)
You can use
\bt(?:[a-z]*[a-df-z])?\b
\bt[a-z]*\b(?<!e)
Just for completeness, here is a regex to match any word starting with a Cyrillic т and not ending with a Cyrillic е:
\bт[^\W\d_]*\b(?<!е)
See the regex demo #1, regex demo #2 and a Cyrillic regex demo.
If you need a case insensitive matching, add re.I:
re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I)
And a note on word boundaries: if the words can be glued to _ or digits, use letter boundaries rather than word boundaries:
r'(?<![a-z])t(?:[a-z]*[a-df-z])?(?![a-z])'
r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)' # Unicode letter boundaries
Regex details
\b - word boundary (start of string or a position immediately after a char other than a digit, letter, underscore)
(?<![a-z]) ((?<![^\W\d_]) is a Unicode aware equivalent) - a negative lookbehind that matches a location that is not immediately preceded with a letter
t - a t letter
(?:[a-z]*[a-df-z])? - an optional non-capturing group matching 0 or more letters and then a letter other than e
\b - word boundary
(?![a-z]) ((?![^\W\d_]) is a Unicode aware equivalent) - a negative lookahead that matches a location that is not immediately followed with a letter.
Also,
\bt[a-z]*\b(?<!e) matches a word boundary, t, any zero or more lowercase ASCII letters (any ASCII letters with re.I), then a word boundary marks the end of a word and the negative lookbehind (?<!e) fails the match if there is e at the end of the word
[^\W\d_]* - matches zero or more more Unicode letters.
See a Python demo:
import re
text = r't, train => main,teene!'
cyr_text = r'таня тане работе'
print( re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bt[a-z]*\b(?<!e)', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bт[^\W\d_]*\b(?<!е)', cyr_text, re.I) )
# => ['таня']
print( re.findall(r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)', cyr_text, re.I) )
# => ['таня']
There is also another way of doing it:
re.findall(r"\b[Tt]+[a-zA-Z]*[^Ee\s]\b", my_str)
Maybe:
[\W]([Tt]\w*[^e])[\W]
Any non word character followed by (capture: Tt, some optional word characters, not e) followed by first non word character
I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like
inspection_observation=pdfFile.getPage(z).extractText()
if 'OBSERVATION' in inspection_observation:
for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):
#print inspection_observation;
print finding;
Please advise the appropriate regular expression for this instance,
If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.
Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.
(?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
Explanation
(?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
[A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
( Capture group
.*? Match any character
(?= Positive lookahead to assert what follows is
[A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
\s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
) Close non capture group
) Close capture group
Regex demo