I want to remove all occurrences of dots separated by single characters, I also want to replace all occurrences of dots separated by more than one consecutive character with a space (if one side has len > 1 char).
For example. Given a string,
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
After processing the output should look like:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
Notice that in the case of A.B.C.D.E., all dots are removed (this should be true for when there is no trailing dot also)
Notice that in the case of K.L.M.NO, the first two dots are removed, the last one is replaced with a space (because NO is not a single character)
Notice that in the case of PQ.R.S, the first dot is replaced with a space, the second dot is removed.
I almost have a working solution:
re.sub(r'(?<!\w)([A-Z])\.', r'\1', s)
But in the example given, T.U.VWXYZ gets translated to TUVWXYZ, whereas it should be TU VWXYZ
Note: it's not important for this to be solved with a single regex, or even regex at all for that matter.
Edit: changed PQ.RS to PQ.R.S in the example string.
I'd take two steps.
replace (\b[A-Z])\.(?=[A-Z]\b|\s|$) with r'\1'
replace (\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,}) with r'\1\2 '
Sample
import re
re1 = re.compile(r'(\b[A-Z])\.(?=[A-Z]\b|\s|$)')
re2 = re.compile(r'(\b[A-Z]{2,})\.(?=[A-Z])|(\b[A-Z])\.(?=[A-Z]{2,})')
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
r = re2.sub(r'\1\2 ', re1.sub(r'\1', s)).strip()
print(r)
outputs
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
which matches your desired result:
'ABCDE FGH IJ KLM NO PQ RS TU VWXYZ'
re1 matches all dots that are preceded by a free-standing letter and followed by either another free-standing letter, or whitespace, or the end of the string.
re2 matches all dots that are preceded by a least 2 and followed by at least 1 letter (or the other way around)
You can first replace all dots followed by two characters by spaces, and then remove the remaining dots:
re.sub(r'\.([A-Z]{2})', r' \1', s).replace(".", "")
This gives " ABCDE FGH IJ KLM NO PQ RS TU VWXYZ" on your example.
hopefully this is slightly neater:
import re
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.RS T.U.VWXYZ'
s = re.sub(r"\.(\w{2})", r" \1", s)
s = re.sub(r"(\w{2})\.(\w)", r"\1 \2", s)
s = re.sub(r"\.", "",s)
s = s.strip()
print(s)
You can use a single regex solution if you consider using a dynamic replacement:
import re
rx = r'\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?)|\.'
s = ' A.B.C.D.E. FGH.IJ K.L.M.NO PQ.R.S T.U.VWXYZ'
print( re.sub(rx, lambda x: x.group(1).replace('.', '') if x.group(1) else ' ', s.strip()) )
# => ABCDE FGH IJ KLM NO PQ RS TU VWXYZ
See the Python demo and a regex demo.
The regex matches:
\b([A-Z](?:\.[A-Z])+\b(?:\.(?![A-Z]))?) - a word boundary, then Group 1 (that will be replaced with itself after stripping off all periods) capturing:
[A-Z] - an uppercase ASCII letter
(?:\.[A-Z])+ - zero or more sequences of a dot and an uppercase ASCII letter
\b - word boundary
(?:\.(?![A-Z]))? - an optional sequence of . that is not followed with an uppercase ASCII letter
| - or
\. - a . in any other context (it will be replaced with a space).
The lambda x: x.group(1).replace('.', '') if x.group(1) else ' ' replacement means that if Group 1 matches, the replacement string is Group 1 value without dots, and if Group 1 does not match the replacement is a single regular space.
Related
I am new to python and trying to solve some problems (in the way to learn).
I want to match space-separated words that contain two or fewer o characters.
That is what I actually did:
import re
pattern = r'\b(?:[^a\s]*o){1}[^a\s]*\b'
text = "hop hoop hooop hoooop hooooop"
print(re.findall(pattern, text))
When I run my code it does match all the words in the string..
Any suggestion?
You can use
import re
pattern = r'(?<!\S)(?:[^\so]*o){0,2}[^o\s]*(?!\S)'
text = "hop hoop hooop hoooop hooooop"
print(re.findall(pattern, text))
# Non regx solution:
print([x for x in text.split() if x.count("o") < 3])
See the Python demo. Both yield ['hop', 'hoop'].
The (?<!\S)(?:[^\so]*o){0,2}[^o\s]*(?!\S) regex matches
(?<!\S) - a left-hand whitespace boundary
(?:[^\so]*o){0,2} - zero, one or two occurrences of any zero or more chars other than whitespace and o char, and then an o char
[^o\s]* - zero or more chars other than o and whitespace
(?!\S) - a right-hand whitespace boundary
How can i use Regular expression in python to add a dot and a space after a single letter in a name, but only if it as a single letter in the beginning, per example:
A G Mark
AG Mark
A.G. Mark
to this
A. G. Mark
i have tried this but not working in some cases:
import re
line = "A G Mark"
b = re.sub(r' ', r'. ', line)
print (b)
a = re.sub(r'(?<=[.])(?=[^\s])', r' ', line)
print (a)
Is it possible to use one case(a) or another(b)?
per example if
A.G. Mark use "a"
else
A G Mark use "b"
else
AG Mark use "c"
In all your case, you could try:
(?<=[A-Z])\.?\s?(?![a-z])
And replace with .
See the online demo.
(?<=[A-Z]) - Positive lookbehind to assert position after capital alpha.
\.?\s? - Both optional dot and space character.
(?![a-z]) - Negative lookahead to prevent being followed by lowercase alpha.
So the distinction of a full name is "Capital letter followed by lower-case letters".
Therefore we want to replace a capital letter, followed by a possible dot, not followed by a lower-case letter and followed by any number of spaces.
This translates to the following substitution:
re.sub(r'([A-Z])\.?(?![a-z])\s*', r'\g<1>. ', line)
Explanation:
pattern:
([A-Z]) - Capture group of a single capital letter
\.? - matches the character . literally. ? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy).
(?![a-z]) - Negative Lookahead. Assert that the following character doesn't match a single lower-case letter.
\s* matches any whitespace character (equal to [\r\n\t\f\v ]). * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy).
replacement:
\g<1> - grab the first capture group (the capital letter).
. - add a dot and a space.
Remember that in the pattern we already match any amount of spaces, so adding the space here will not result in excessive spaces
Regex Demo
Code demo:
import re
lines = """A G Mark
AG Mark
A.G. Mark"""
for line in lines.splitlines():
print(re.sub(r'([A-Z])\.?(?![a-z])\s*', r'\g<1>. ', line))
Which gives:
A. G. Mark
A. G. Mark
A. G. Mark
Try to copy and paste it in (find and replace) boxs
Find: (\w)(\s\w)(\s\w+)
Replace: $1.$2.$3
Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.
You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).
This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'
I have the following examples:
Tortillas Bolsa 2a 1kg 4118
Tortillinas 50p 1 31Kg TAB TR 46113
Bollos BK 4in 36p 1635g SL 131
Super Pan Bco Ajonjoli 680g SP WON 100
Pan Blanco Bimbo Rendidor 567g BIM 49973
Gansito ME 5p 250g MTA MLA 49860
Where I want to keep everything before the number but I also don't want the two uppercase letter word example: ME, BK. I'm using ^((\D*).*?) [^A-Z]{2,3}
The expected result should be
Tortillas Bolsa
Tortillinas
Bollos
Super Pan Bco Ajonjoli
Pan Blanco Bimbo Rendidor
Gansito
With the regex I'm using I'm still getting the two capital letter words Bollos BK and Gansito ME
Pre-compile a regex pattern with a lookahead (explained below) and employ regex.match inside a list comprehension:
>>> import re
>>> p = re.compile(r'\D+?(?=\s*([A-Z]{2})?\s*\d)')
>>> [p.match(x).group() for x in data]
[
'Tortillas Bolsa',
'Tortillinas',
'Bollos',
'Super Pan Bco Ajonjoli',
'Pan Blanco Bimbo Rendidor',
'Gansito'
]
Here, data is your list of strings.
Details
\D+? # anything that isn't a digit (non-greedy)
(?= # regex-lookahead
\s* # zero or more wsp chars
([A-Z]{2})? # two optional uppercase letters
\s*
\d # digit
)
In the event of any string not containing the pattern you're looking for, the list comprehension will error out (with an AttributeError), since re.match returns None in that instance. You can then employ a loop and test the value of re.match before extracting the matched portion.
matches = []
for x in data:
m = p.match(x)
if m:
matches.append(m.group())
Or, if you want a placeholder None when there's no match:
matches = []
for x in data:
matches.append(m.group() if m else None)
My 2 cents
^.*?(?=\s[\d]|\s[A-Z]{2,})
https://regex101.com/r/7xD7DS/1/
You may use the lookahead feature:
I_WANT = '(.+?)' # This is what you want
I_DO_NOT_WANT = '\s(?:[0-9]|(?:[A-Z]{2,3}\s))' # Stop-patterns
RE = '{}(?={})'.format(I_WANT, I_DO_NOT_WANT) # Combine the parts
[re.findall(RE, x)[0] for x in test_strings]
#['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli',
# 'Pan Blanco Bimbo Rendidor', 'Gansito']
Supposing that:
All the words you want to match in your capture group start with an uppercase letter
The rest of each word contains only lowercase letters
Words are separated by a single space
...you can use the following regular expressions:
Using Unicode character properties:
^((\p{Lu}\p{Ll}+ )+)
> Try this regex on regex101.
Without Unicode support:
^(([A-z][a-z]+ )+)
> Try this regex on regex101.
I suggest splitting on the first two uppercase letter word or a digit and grab the first item:
r = re.compile(r'\b[A-Z]{2}\b|\d')
[r.split(item)[0].strip() for item in my_list]
# => ['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli', 'Pan Blanco Bimbo Rendidor', 'Gansito']
See the Python demo
Pattern details
\b[A-Z]{2}\b - a whole (since \b are word boundaries) two uppercase ASCII letter word
| - or
\d - a digit.
With .strip(), all trailing and leading whitespace will get trimmed.
A slight variation for a re.sub:
re.sub(r'\s*(?:\b[A-Z]{2}\b|\d).*', '', s)
See the regex demo
Details
\s* - 0+ whitespace chars
(?:\b[A-Z]{2}\b|\d) - either a two uppercase letter word or a digit
.* - the rest of the line.
I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']