I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n
Related
How do I make a regex in python that returns a string with all underscores between lowercase letters?
For example, it should detect and return: 'aa_bb_cc' , 'swd_qq' , 'hello_there_friend'
But it should not return these: 'aA_bb' , 'aa_' , '_ddQ' , 'aa_baa_2cs'
My code is ([a-z]+_[a-z]+)+ , but it returns only one underscore. It should return all underscores seperated by lowercase letters.
For example, when I pass the string "aab_cbbbc_vv", it returns only 'aab_cbbbc' instead of 'aab_cbbbc_vv'
Thank you
Your regex is almost correct. If you change it to:
^([a-z]+)(_[a-z]+)+$
It woks as you can check here.
^ - matches the beginning of the string
$ - the end of the string
You need these so that you are not getting partial matches when matching the strings you don't want to get matched.
try this code to get it
import re
s = "aa_bb_cc swd_qq hello_there_friend aA_bb aa_ _ddQ aa_baa_2cs"
print(re.findall(r"[a-z][a-z_]+\_[a-z]+",s))
the output sould be
['aa_bb_cc', 'swd_qq', 'hello_there_friend', 'aa_baa']
The reason that you get only results with 1 underscore for your example data is that ([a-z]+_[a-z]+)+ repeats a match of [a-z]+, then an underscore and then again [a-z]+
That would for example match a_b or a_bc_d, but only a partial match for a_b_c as there has to be at least a char a-z present before each _ for every iteration.
You could update your pattern to:
\b[a-z]+(?:_[a-z]+)+\b
Explanation
\b A word boundary
[a-z]+ Match 1+ chars a-z
(?:_[a-z]+)+ Repeat 1+ times matching _ and 1+ chars a-z
\b A word boundary
regex demo
I am trying to create a regex that starts with t or T and doesn't end with e letter. I tried the code below so far, but it's not giving me the desirable result. Could anyone show me what is exactly missing here?
my_str = my_file.read()
word = re.findall("[tT].*[^e]$", my_str)
print(word)
You can use
\bt(?:[a-z]*[a-df-z])?\b
\bt[a-z]*\b(?<!e)
Just for completeness, here is a regex to match any word starting with a Cyrillic т and not ending with a Cyrillic е:
\bт[^\W\d_]*\b(?<!е)
See the regex demo #1, regex demo #2 and a Cyrillic regex demo.
If you need a case insensitive matching, add re.I:
re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I)
And a note on word boundaries: if the words can be glued to _ or digits, use letter boundaries rather than word boundaries:
r'(?<![a-z])t(?:[a-z]*[a-df-z])?(?![a-z])'
r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)' # Unicode letter boundaries
Regex details
\b - word boundary (start of string or a position immediately after a char other than a digit, letter, underscore)
(?<![a-z]) ((?<![^\W\d_]) is a Unicode aware equivalent) - a negative lookbehind that matches a location that is not immediately preceded with a letter
t - a t letter
(?:[a-z]*[a-df-z])? - an optional non-capturing group matching 0 or more letters and then a letter other than e
\b - word boundary
(?![a-z]) ((?![^\W\d_]) is a Unicode aware equivalent) - a negative lookahead that matches a location that is not immediately followed with a letter.
Also,
\bt[a-z]*\b(?<!e) matches a word boundary, t, any zero or more lowercase ASCII letters (any ASCII letters with re.I), then a word boundary marks the end of a word and the negative lookbehind (?<!e) fails the match if there is e at the end of the word
[^\W\d_]* - matches zero or more more Unicode letters.
See a Python demo:
import re
text = r't, train => main,teene!'
cyr_text = r'таня тане работе'
print( re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bt[a-z]*\b(?<!e)', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bт[^\W\d_]*\b(?<!е)', cyr_text, re.I) )
# => ['таня']
print( re.findall(r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)', cyr_text, re.I) )
# => ['таня']
There is also another way of doing it:
re.findall(r"\b[Tt]+[a-zA-Z]*[^Ee\s]\b", my_str)
Maybe:
[\W]([Tt]\w*[^e])[\W]
Any non word character followed by (capture: Tt, some optional word characters, not e) followed by first non word character
If I have the word india
MATCHES
"india!" "india!" "india." "india"
NON MATCHES "indian" "indiana"
Basically, I want to match the string but not when its contained within another string.
After doing some research, I started with
exp = "(?<!\S)india(?!\S)"
num_matches = len(re.findall(exp))
but that doesn't match the punctuation and I'm not sure where to add that in.
Assuming the objective is to match a given word (e.g., "india") in a string provided the word is neither preceded nor followed by a character that is not in the string " .,?!;" you could use the following regex:
(?<![^ .,?!;])india(?![^ .,?!;\r\n])
Demo
Python's regex engine performs the following operations
(?<! # begin a negative lookbehind
[^ .,?!;] # match 1 char other than those in " .,?!;"
) # end the negative lookbehind
india # match string
(?! # begin a negative lookahead
[^ .,?!;\r\n] # match 1 char other than those in " .,?!;\r\n"
) # end the negative lookahead
Notice that the character class in the negative lookahead contains \r and \n in case india is at the end of a line.
\"india(\W*?)\"
this will catch anything except for numbers and letters
Try this
^india[^a-zA-Z0-9]$
^ - Regex starts with India
[^a-zA-Z0-9] - not a-z, A-Z, 0-9
$ - End Regex
Try with:
r'\bindia\W*\b'
See demo
To ignore case:
re.search(r'\bindia\W*\b', my_string, re.IGNORECASE).group(0)
you may use:
import re
s = "india."
s1 = "indiana"
print(re.search(r'\bindia[.!?]*\b', s))
print(re.search(r'\bindia[.!?]*\b', s1))
output:
<re.Match object; span=(0, 5), match='india'>
None
If you also want to match the punctuation, you could use make use of a negated character class where you could match any char except a word character or a newline.
(?<!\S)india[^\w\r\n]*(?!\S)
(?<!\S) Assert a whitspace bounadry to the left
india Match literally
[^\w\r\n] Match 0+ times any char except a word char or a newline
(?!\S) Assert a whitspace boundary to the right
Regex demo
I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like
inspection_observation=pdfFile.getPage(z).extractText()
if 'OBSERVATION' in inspection_observation:
for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):
#print inspection_observation;
print finding;
Please advise the appropriate regular expression for this instance,
If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.
Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.
(?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
Explanation
(?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
[A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
( Capture group
.*? Match any character
(?= Positive lookahead to assert what follows is
[A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
\s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
) Close non capture group
) Close capture group
Regex demo
I have a regex that matches all three characters words in a string:
\b[^\s]{3}\b
When I use it with the string:
And the tiger attacked you.
this is the result:
regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']
As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.
I have the same problem with ",", ";", ":", etc.
I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.
Is there a way of doing this?
Thanks in advance,
EDIT
Thaks to the answers of #BrenBarn and #Kendall Frey I managed to get to the regex I was looking for:
(?<!\w)[^\s]{3}(?=$|\s)
If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround.
(?<=\s)\w{3}(?=\s)
If you need it to match punctuation as part of words (such as 'in.') then \w won't be adequate, and you can use \S (anything but a space)
(?<=\s)\S{3}(?=\s)
As described in the documentation:
A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
So if you want a period to count as a word character and not a word boundary, you can't use \b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \s[^\s]{3}\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (i.e., restrict the match but not be included in it), you could use lookaround, something like (?<=\s)[^\s]{3}(?=\s).
This would be my approach. Also matches words that come right after punctuations.
import re
r = r'''
\b # word boundary
( # capturing parentheses
[^\s]{3} # anything but whitespace 3 times
\b # word boundary
(?=[^\.,;:]|$) # dont allow . or , or ; or : after word boundary but allow end of string
| # OR
[^\s]{2} # anything but whitespace 2 times
[\.,;:] # a . or , or ; or :
)
'''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'
print re.findall(r, s, re.X)
output:
['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']