I have the following examples:
Tortillas Bolsa 2a 1kg 4118
Tortillinas 50p 1 31Kg TAB TR 46113
Bollos BK 4in 36p 1635g SL 131
Super Pan Bco Ajonjoli 680g SP WON 100
Pan Blanco Bimbo Rendidor 567g BIM 49973
Gansito ME 5p 250g MTA MLA 49860
Where I want to keep everything before the number but I also don't want the two uppercase letter word example: ME, BK. I'm using ^((\D*).*?) [^A-Z]{2,3}
The expected result should be
Tortillas Bolsa
Tortillinas
Bollos
Super Pan Bco Ajonjoli
Pan Blanco Bimbo Rendidor
Gansito
With the regex I'm using I'm still getting the two capital letter words Bollos BK and Gansito ME
Pre-compile a regex pattern with a lookahead (explained below) and employ regex.match inside a list comprehension:
>>> import re
>>> p = re.compile(r'\D+?(?=\s*([A-Z]{2})?\s*\d)')
>>> [p.match(x).group() for x in data]
[
'Tortillas Bolsa',
'Tortillinas',
'Bollos',
'Super Pan Bco Ajonjoli',
'Pan Blanco Bimbo Rendidor',
'Gansito'
]
Here, data is your list of strings.
Details
\D+? # anything that isn't a digit (non-greedy)
(?= # regex-lookahead
\s* # zero or more wsp chars
([A-Z]{2})? # two optional uppercase letters
\s*
\d # digit
)
In the event of any string not containing the pattern you're looking for, the list comprehension will error out (with an AttributeError), since re.match returns None in that instance. You can then employ a loop and test the value of re.match before extracting the matched portion.
matches = []
for x in data:
m = p.match(x)
if m:
matches.append(m.group())
Or, if you want a placeholder None when there's no match:
matches = []
for x in data:
matches.append(m.group() if m else None)
My 2 cents
^.*?(?=\s[\d]|\s[A-Z]{2,})
https://regex101.com/r/7xD7DS/1/
You may use the lookahead feature:
I_WANT = '(.+?)' # This is what you want
I_DO_NOT_WANT = '\s(?:[0-9]|(?:[A-Z]{2,3}\s))' # Stop-patterns
RE = '{}(?={})'.format(I_WANT, I_DO_NOT_WANT) # Combine the parts
[re.findall(RE, x)[0] for x in test_strings]
#['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli',
# 'Pan Blanco Bimbo Rendidor', 'Gansito']
Supposing that:
All the words you want to match in your capture group start with an uppercase letter
The rest of each word contains only lowercase letters
Words are separated by a single space
...you can use the following regular expressions:
Using Unicode character properties:
^((\p{Lu}\p{Ll}+ )+)
> Try this regex on regex101.
Without Unicode support:
^(([A-z][a-z]+ )+)
> Try this regex on regex101.
I suggest splitting on the first two uppercase letter word or a digit and grab the first item:
r = re.compile(r'\b[A-Z]{2}\b|\d')
[r.split(item)[0].strip() for item in my_list]
# => ['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli', 'Pan Blanco Bimbo Rendidor', 'Gansito']
See the Python demo
Pattern details
\b[A-Z]{2}\b - a whole (since \b are word boundaries) two uppercase ASCII letter word
| - or
\d - a digit.
With .strip(), all trailing and leading whitespace will get trimmed.
A slight variation for a re.sub:
re.sub(r'\s*(?:\b[A-Z]{2}\b|\d).*', '', s)
See the regex demo
Details
\s* - 0+ whitespace chars
(?:\b[A-Z]{2}\b|\d) - either a two uppercase letter word or a digit
.* - the rest of the line.
Related
I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n
How can i use Regular expression in python to add a dot and a space after a single letter in a name, but only if it as a single letter in the beginning, per example:
A G Mark
AG Mark
A.G. Mark
to this
A. G. Mark
i have tried this but not working in some cases:
import re
line = "A G Mark"
b = re.sub(r' ', r'. ', line)
print (b)
a = re.sub(r'(?<=[.])(?=[^\s])', r' ', line)
print (a)
Is it possible to use one case(a) or another(b)?
per example if
A.G. Mark use "a"
else
A G Mark use "b"
else
AG Mark use "c"
In all your case, you could try:
(?<=[A-Z])\.?\s?(?![a-z])
And replace with .
See the online demo.
(?<=[A-Z]) - Positive lookbehind to assert position after capital alpha.
\.?\s? - Both optional dot and space character.
(?![a-z]) - Negative lookahead to prevent being followed by lowercase alpha.
So the distinction of a full name is "Capital letter followed by lower-case letters".
Therefore we want to replace a capital letter, followed by a possible dot, not followed by a lower-case letter and followed by any number of spaces.
This translates to the following substitution:
re.sub(r'([A-Z])\.?(?![a-z])\s*', r'\g<1>. ', line)
Explanation:
pattern:
([A-Z]) - Capture group of a single capital letter
\.? - matches the character . literally. ? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy).
(?![a-z]) - Negative Lookahead. Assert that the following character doesn't match a single lower-case letter.
\s* matches any whitespace character (equal to [\r\n\t\f\v ]). * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy).
replacement:
\g<1> - grab the first capture group (the capital letter).
. - add a dot and a space.
Remember that in the pattern we already match any amount of spaces, so adding the space here will not result in excessive spaces
Regex Demo
Code demo:
import re
lines = """A G Mark
AG Mark
A.G. Mark"""
for line in lines.splitlines():
print(re.sub(r'([A-Z])\.?(?![a-z])\s*', r'\g<1>. ', line))
Which gives:
A. G. Mark
A. G. Mark
A. G. Mark
Try to copy and paste it in (find and replace) boxs
Find: (\w)(\s\w)(\s\w+)
Replace: $1.$2.$3
I want to remove all occurrences of dots separated by single characters, I also want to replace all occurrences of dots separated by more than one consecutive character with a space (if one side has len > 1 char).
For example. Given a string,
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
After processing the output should look like:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
Notice that in the case of A.B.C.D.E., all dots are removed (this should be true for when there is no trailing dot also)
Notice that in the case of K.L.M.NO, the first two dots are removed, the last one is replaced with a space (because NO is not a single character)
Notice that in the case of PQ.R.S, the first dot is replaced with a space, the second dot is removed.
I almost have a working solution:
re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
But in the example given, T.U.VWXYZ gets translated to TUVWXYZ, whereas it should be TU VWXYZ
Note: it's not important for this to be solved with a single regex, or even regex at all for that matter.
Edit: changed PQ.RS to PQ.R.S in the example string.
I'd take two steps.
replace (\b[A-Z])\.(?=[A-Z]\b|\s|$) with r'\1'
replace (\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,}) with r'\1\2 '
Sample
import re
re1 = re.compile(r'(\b[A-Z])\.(?=[A-Z]\b|\s|$)')
re2 = re.compile(r'(\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,})')
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
r = re2.sub(r'\1\2 ', re1.sub(r'\1', s)).strip()
print(r)
outputs
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
which matches your desired result:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
re1 matches all dots that are preceded by a free-standing letter and followed by either another free-standing letter, or whitespace, or the end of the string.
re2 matches all dots that are preceded by a least 2 and followed by at least 1 letter (or the other way around)
You can first replace all dots followed by two characters by spaces, and then remove the remaining dots:
re.sub(r'\.([A-Z]{2})', r' \1', s).replace(".", "")
This gives " ABCDE FGH IJ KLM NO PQ RS TU VWXYZ" on your example.
hopefully this is slightly neater:
import re
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
s = re.sub(r"\.(\w{2})", r" \1", s)
s = re.sub(r"(\w{2})\.(\w)", r"\1 \2", s)
s = re.sub(r"\.", "",s)
s = s.strip()
print(s)
You can use a single regex solution if you consider using a dynamic replacement:
import re
rx = r'\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?)|\.'
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
print( re.sub(rx, lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s.strip()) )
# => ABCDE FGH IJ KLM NO PQ RS TU VWXYZ
See the Python demo and a regex demo.
The regex matches:
\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?) - a word boundary, then Group 1 (that will be replaced with itself after stripping off all periods) capturing:
[A-Z] - an uppercase ASCII letter
(?:\.[A-Z])+ - zero or more sequences of a dot and an uppercase ASCII letter
\b - word boundary
(?:\.(?![A-Z]))? - an optional sequence of . that is not followed with an uppercase ASCII letter
| - or
\. - a . in any other context (it will be replaced with a space).
The lambda x: x.group(1).replace('.', '') if x.group(1) else ' ' replacement means that if Group 1 matches, the replacement string is Group 1 value without dots, and if Group 1 does not match the replacement is a single regular space.
If I have the word india
MATCHES
"india!" "india!" "india." "india"
NON MATCHES "indian" "indiana"
Basically, I want to match the string but not when its contained within another string.
After doing some research, I started with
exp = "(?<!\S)india(?!\S)"
num_matches = len(re.findall(exp))
but that doesn't match the punctuation and I'm not sure where to add that in.
Assuming the objective is to match a given word (e.g., "india") in a string provided the word is neither preceded nor followed by a character that is not in the string " .,?!;" you could use the following regex:
(?<![^ .,?!;])india(?![^ .,?!;\r\n])
Demo
Python's regex engine performs the following operations
(?<! # begin a negative lookbehind
[^ .,?!;] # match 1 char other than those in " .,?!;"
) # end the negative lookbehind
india # match string
(?! # begin a negative lookahead
[^ .,?!;\r\n] # match 1 char other than those in " .,?!;\r\n"
) # end the negative lookahead
Notice that the character class in the negative lookahead contains \r and \n in case india is at the end of a line.
\"india(\W*?)\"
this will catch anything except for numbers and letters
Try this
^india[^a-zA-Z0-9]$
^ - Regex starts with India
[^a-zA-Z0-9] - not a-z, A-Z, 0-9
$ - End Regex
Try with:
r'\bindia\W*\b'
See demo
To ignore case:
re.search(r'\bindia\W*\b', my_string, re.IGNORECASE).group(0)
you may use:
import re
s = "india."
s1 = "indiana"
print(re.search(r'\bindia[.!?]*\b', s))
print(re.search(r'\bindia[.!?]*\b', s1))
output:
<re.Match object; span=(0, 5), match='india'>
None
If you also want to match the punctuation, you could use make use of a negated character class where you could match any char except a word character or a newline.
(?<!\S)india[^\w\r\n]*(?!\S)
(?<!\S) Assert a whitspace bounadry to the left
india Match literally
[^\w\r\n] Match 0+ times any char except a word char or a newline
(?!\S) Assert a whitspace boundary to the right
Regex demo
Input is a two-sentence string:
s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'
I'd like to .split s into sentences based on the logic that:
sentences end with one or more periods, exclamation marks, questions marks, or period+quotation mark
and are then followed by 1+ whitespace characters and a capitalized alpha character.
Desired result:
['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']
Also okay:
['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']
But I currently chop off the 0th element of each sentence because the uppercase character is captured:
import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']
Notice the missing T. How can I tell .split to ignore certain elements of the compiled pattern?
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])
Try it here.
Although I would suggest the below to allow quotes to be followed by any of .!? to be a split condition
((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])
Try it here.
Explanation
The common stuff in both +(?=[A-Z])
' +' #One or more spaces(The actual splitting chars used.)
(?= #START positive look ahead check if it followed by this, but do not consume
[A-Z] #Any capitalized alphabet
) #END positive look ahead
The conditions for what comes before the space
For Solution1
( #GROUP START
(?<= #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
) #END positive look behind
| #OR condition this is also the reason we had to put all this in GROUP
(?<= #START Positive look behind,
\.\" #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
) #END positive look behind
) #END GROUP
For Solution2
( #GROUP START
(?<=[.!?]) #Same as the previous look behind
| #OR condition
(?<=[.!?]\") #Only difference here is that we are allowing quote after any of . or ! or ?
) #GROUP END
It's easier to describe the sentence than trying to identify the delimiter. So instead of re.split try with re.findall:
re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)
To preserve the next uppercase letter, the pattern uses a lookahead that is only a test and doesn't consume characters.
details:
( # capture group: re.findall return only the capture group content if any
[^.?!\s] # the first character isn't a space or a punctuation character
.*? # a non-greedy quantifier
[.?!]* # eventual punctuation characters
)
\s* # zero or more white-spaces
(?![^A-Z]) # not followed by a character that isn't a uppercase letter
# (this includes an uppercase letter and the end of the string)
Obviously, for more complicated cases with abbreviations, names, etc., you have to use tools like nltk or any other nlp tools trained with dictionaries.