changing a word surrounded by numbers using re.sub - python

I want to change "to" to "-" only when it is surrounded by numbers or numbers and spaces e.g. 34to55 or 34 to 55 both to be changed to 34-55. Also, I don't want to change it when the word begins or ends with alphabets e.g. abc34to55def
I tried
re.sub(?<![a-z])(\d)to(\d)(?!\[a-z]), \\1-\\2, 34to55)
but it doesn't give me what I want.
Any help would be greatly appreciated

You can use 2 capture groups with word boundaries and use the groups in the replacement.
\b(\d+)\s*to\s*(\d+)\b
\b A word boundary to prevent a partial match
(\d+) Capture group 1, match 1+ digits
\s*to\s* Match to between optional whitespace chars
(\d+) Capture group 2, match 1+ digits
\b A word boundary
and replace with
\1-\2
Regex demo | Python demo
import re
pattern = r"\b(\d+)\s*to\s*(\d+)\b"
s = ("34to55 or 34 to 55\n"
"abc34to55def")
result = re.sub(pattern, r"\1-\2", s)
if result:
print (result)
Result
34-55 or 34-55
abc34to55def

Related

Pattern to extract, expand and form a sentence based on a certain delimiter

I was trying out to solve a problem on regex:
There is an input sentence which is of one of these forms: Number1,2,3 or Number1/2/3 or Number1-2-3 these are the 3 delimiters: , / -
The expected output is: Number1,Number2,Number3
Pattern I've tried so far:
(?\<=,)\[^,\]+(?=,)
but this misses out on the edge cases i.e. 1st element and last element. I am also not able to generate for '/'.
You could separate out the key from values, then use a list comprehension to build the output you want.
inp = "Number1,2,3"
matches = re.search(r'(\D+)(.*)', inp)
output = [matches[1] + x for x in re.split(r'[,/]', matches[2])]
print(output) # ['Number1', 'Number2', 'Number3']
You can do it in several steps: 1) validate the string to match your pattern, and once validated 2) add the first non-digit chunk to the numbers while replacing - and / separator chars with commas:
import re
texts = ['Number1,2,3', 'Number1/2/3', 'Number1-2-3']
for text in texts:
m = re.search(r'^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$', text)
if m:
print( re.sub(r'(?<=,)(?=\d)', m.group(1).replace('\\', '\\\\'), text.replace('/',',').replace('-',',')) )
else:
print(f"NO MATCH in '{text}'")
See this Python demo.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
The ^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$ regex validates your three types of input:
^ - start of string
(\D+) - Group 1: one or more non-digits
(\d+(?=([,/-]))(?:\3\d+)*) - Group 2: one or more digits, and then zero or more repetitions of ,, / or - and one or more digits (and the separator chars should be consistent due to the capture used in the positive lookahead and the \3 backreference to that value used in the non-capturing group)
$ - end of string.
The re.sub pattern, (?<=,)(?=\d), matches a location between a comma and a digit, the Group 1 value is placed there (note the .replace('\\', '\\\\') is necessary since the replacement is dynamic).
import re
for text in ("Number1,2,3", "Number1-2-3", "Number1/2/3"):
print(re.sub(r"(\D+)(\d+)[/,-](\d+)[/,-](\d+)", r"\1\2,\1\3,\1\4", text))
\D+ matches "Number" or any other non-number text
\d+ matches a number (or more than one)
[/,-] matches any of /, ,, -
The rest is copy paste 3 times.
The substitution consists of backreferences to the matched "Number" string (\1) and then each group of the (\d+)s.
This works if you're sure that it's always three numbers divided by that separator. This does not ensure that it's the same separator between each number. But it's short.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
If you can make use of the pypi regex module you can use the captures collection with a named capture group.
([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)
([^\d\s,/]+) Capture group 1, match 1+ chars other than the listed
(?<num>\d+) Named capture group num matching 1+ digits
([,/-]) Capture either , / - in group 3
(?<num>\d+) Named capture group num matching 1+ digits
(?:\3(?<num>\d+))* Optionally repeat a backreference to group 3 to keep the separators the same and match 1+ digits in group num
(?!\S) Assert a whitspace boundary to the right to prevent a partial match
Regex demo | Python demo
import regex as re
pattern = r"([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)"
s = "Number1,2,3 or Number4/5/6 but not Number7/8,9"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group(1) + c for c in m.captures("num")]))
Output
Number1,Number2,Number3
Number4,Number5,Number6

does Regex omit part of a string if it has already been matched?

Python 3.8.2
the task at hand is simple: to match lowercase characters separated by a single underscore. So the pattern could be r"[a-z]+_[a-z]+"
now my issue is that I expected re.findall() to pair up all the following:
"ash_tonic_transit_so_kern_err_looo_"
instead of paring all the words around each underscore ('ash_tonic', 'tonic_transit', 'transit_so', ETC) I get three pairs: ['ash_tonic', 'transit_so', 'kern_err']
Does python re omit part of the string once a match has been found instead of running the search again?
import re
def match_lower(s):
patternRegex = re.compile(r'[a-z]+_[a-z]+')
mo = patternRegex.findall(s)
return mo
print(match_lower('ash_tonic_transit_so_kern_err_looo_'))
You could use a positive lookahead with a capturing group to get the matches, and start the match asserting what is directly to the left is not a char a-z using a negative lookbehind.
Use re.findall which will return the values from the capturing group.
(?<![a-z])(?=([a-z]+_[a-z]+))
Explanation
(?<![a-z]) Negative lookabehind, assert what is directly to the left is not a char a-z
(?= Positive lookahead, assert what on the right is
([a-z]+_[a-z]+) Capture group 1, match 1+ chars a-z _ 1+ chars a-z
) Close lookahead
Regex demo | Python demo
import re
regex = r"(?<![a-z])(?=([a-z]+_[a-z]+))"
test_str = "ash_tonic_transit_so_kern_err_looo_"
print(re.findall(regex, test_str))
Output
['ash_tonic', 'tonic_transit', 'transit_so', 'so_kern', 'kern_err', 'err_looo']
This is explicitly mentioned in the documentation of re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings.
For instance, 'ash_tonic' and 'tonic_transit' overlap, so they won't be considered two distinct matches.

How to extract filename from path using regex

I would like to extract a filename from a path using regular expression:
mysting = '/content/drive/My Drive/data/happy (463).jpg'
How do I extract 'happy.jpg'?
I have tried this: '[^/]*$' but the result still includes the number in parenthesis which I do not want: 'happy (463).jpg'
How could I improve it?
You could use 2 capturing groups. In the first group match / and capture 1+ word chars in group 1.
Then match 1+ digits between parenthesis and capture .jpg asserting the end of the string in group 2.
^.*/(\w+)\s*\(\d+\)(\.jpg)$
In parts that will match
^.*/ Match until last /
(\w+) Catpure group 1, match 1+ word chars
\s* Match 1+ whitespace chars
\(\d+\) Match 1+ digits between parenthesis
(\.jpg) Capture group 2, match .jpg
$ End of string
Regex demo | Python demo
Then use group 1 and group 2 in the replacement to get happy.jpg
import re
regex = r"^.*/(\w+)\s*\(\d+\)(\.jpg)$"
test_str = "/content/drive/My Drive/data/happy (463).jpg"
result = re.sub(regex, r"\1\2", test_str, 1)
if result:
print (result)
Output
happy.jpg
Without Regex; str methods (str.partition and str.rpartition):
In [185]: filename = mysting.rpartition('/')[-1]
In [186]: filename
Out[186]: 'happy (463).jpg'
In [187]: f"{filename.partition(' ')[0]}.{filename.rpartition('.')[-1]}"
Out[187]: 'happy.jpg'
With Regex; re.sub:
re.sub(r'.*/(?!.*/)([^\s]+)[^.]+(\..*)', r'\1\2', mysting)
.*/ greedily matches upto last /
The zero-width negative lookahead (?!.*/) ensures there is no / in anyplace forward
([^\s]+) matches upto the next whitespace and put as the first captured group
[^.]+ matches upto next .
(\..*) matches a literal . followed by any number of characters and put as the second captured group; if you want to match more conservatively like 3 characters or even literal .jpg you can do that also
in the replacement, only the captured groups are used
Example:
In [183]: mysting = '/content/drive/My Drive/data/happy (463).jpg'
In [184]: re.sub(r'.*/(?!.*/)([^\s]+)[^.]+(\..*)', r'\1\2', mysting)
Out[184]: 'happy.jpg'
I use javascript.
In javascript case,
const myString="happy (463).jpg";
const result=myString.replace(/\s\(\d*\)/,'');
After you split path in slash separator,
you can apply this code.

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories