I was trying out to solve a problem on regex:
There is an input sentence which is of one of these forms: Number1,2,3 or Number1/2/3 or Number1-2-3 these are the 3 delimiters: , / -
The expected output is: Number1,Number2,Number3
Pattern I've tried so far:
(?\<=,)\[^,\]+(?=,)
but this misses out on the edge cases i.e. 1st element and last element. I am also not able to generate for '/'.
You could separate out the key from values, then use a list comprehension to build the output you want.
inp = "Number1,2,3"
matches = re.search(r'(\D+)(.*)', inp)
output = [matches[1] + x for x in re.split(r'[,/]', matches[2])]
print(output) # ['Number1', 'Number2', 'Number3']
You can do it in several steps: 1) validate the string to match your pattern, and once validated 2) add the first non-digit chunk to the numbers while replacing - and / separator chars with commas:
import re
texts = ['Number1,2,3', 'Number1/2/3', 'Number1-2-3']
for text in texts:
m = re.search(r'^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$', text)
if m:
print( re.sub(r'(?<=,)(?=\d)', m.group(1).replace('\\', '\\\\'), text.replace('/',',').replace('-',',')) )
else:
print(f"NO MATCH in '{text}'")
See this Python demo.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
The ^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$ regex validates your three types of input:
^ - start of string
(\D+) - Group 1: one or more non-digits
(\d+(?=([,/-]))(?:\3\d+)*) - Group 2: one or more digits, and then zero or more repetitions of ,, / or - and one or more digits (and the separator chars should be consistent due to the capture used in the positive lookahead and the \3 backreference to that value used in the non-capturing group)
$ - end of string.
The re.sub pattern, (?<=,)(?=\d), matches a location between a comma and a digit, the Group 1 value is placed there (note the .replace('\\', '\\\\') is necessary since the replacement is dynamic).
import re
for text in ("Number1,2,3", "Number1-2-3", "Number1/2/3"):
print(re.sub(r"(\D+)(\d+)[/,-](\d+)[/,-](\d+)", r"\1\2,\1\3,\1\4", text))
\D+ matches "Number" or any other non-number text
\d+ matches a number (or more than one)
[/,-] matches any of /, ,, -
The rest is copy paste 3 times.
The substitution consists of backreferences to the matched "Number" string (\1) and then each group of the (\d+)s.
This works if you're sure that it's always three numbers divided by that separator. This does not ensure that it's the same separator between each number. But it's short.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
If you can make use of the pypi regex module you can use the captures collection with a named capture group.
([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)
([^\d\s,/]+) Capture group 1, match 1+ chars other than the listed
(?<num>\d+) Named capture group num matching 1+ digits
([,/-]) Capture either , / - in group 3
(?<num>\d+) Named capture group num matching 1+ digits
(?:\3(?<num>\d+))* Optionally repeat a backreference to group 3 to keep the separators the same and match 1+ digits in group num
(?!\S) Assert a whitspace boundary to the right to prevent a partial match
Regex demo | Python demo
import regex as re
pattern = r"([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)"
s = "Number1,2,3 or Number4/5/6 but not Number7/8,9"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group(1) + c for c in m.captures("num")]))
Output
Number1,Number2,Number3
Number4,Number5,Number6
Task is to separate the attributes of a product from its string. I am using regex to separate the required parts but having difficulty in distinguishing "L" from "ML" (or "l" from "ml"). Similar case for "kg" and "g" as regex always chooses the shorter string.
prod = 'TestProduct- 200 ML x24'
searchobj = re.findall('([0-9]+).*(g|kg|ltr|l|ml)\s*x*[*]*([0-9]+)', prod, re.I)
print(searchobj)
#output
[('200', 'L', '24')]
How to make output as following?
[('200', 'ML', '24')]
Thanks.
You could specify that you only want whole words of the form (g|kg|ltr|l|ml)\s by changing that to \s(g|kg|ltr|l|ml)\s (require a space before and after the expression).
You can use
(\d+(?:\.\d+)?)\s*(g|kg|ltr|l|ml)\s*x*\**(\d+(?:\.\d+)?)
See the regex demo.
Details
(\d+(?:\.\d+)?) - Group 1: one or more digits, and then an optional sequence of a dot and then one or more digits
\s* - 0+ whitespaces
(g|kg|ltr|l|ml) - Group 2: one of the char (sequences)
\s* - 0+ whitespaces
x* - 0 or more x chars
\** - 0 or more * chars
(\d+(?:\.\d+)?) - Group 3: one or more digits, and then an optional sequence of a dot and then one or more digits
I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
I have a bunch of line data I need to capture like so:
Level production data TD Index
Total Agriculture\Production data TS Index
I need to capture everything before the last two words, for example in this case my regex output should be Level production data for the first match. How can I do this while also assuming varying number of words before the TD Index. Thanks!
Try this regex:
^.*(?=(?:\s+\S+){2}$)
Click for Demo
Explanation:
^ - asserts the start of the string
.* - matches 0+ occurrences of any character except a newline character
(?=(?:\s+\S+){2}$) - positive lookahead to validate that current position is followed by 2 words (1+ white space followed by 1+ occurrences of non-whitespace)X2 just before the end of the string
You can try this:
import re
s = ["Level production data TD Index", "Total Agriculture\Production data TS Index"]
new_s = [re.findall('[\w\s\W]{1,}(?=\s\w+\s\w+$)', i)[0] for i in s]
Output:
['Level production data', 'Total Agriculture\\Production data']
Code
See regex in use here
.*(?= \S+ \S+)
Alternatively: .*(?= [\w\/]+ [\w\/]+) replacing \S with what you define as your valid word character set.
You can also add + after the spaces if there is a possibility of more than 1 space being present as such: .*(?= +\S+ +\S+)
Usage
See code in use here
import re
r = r".*(?= \S+ \S+)"
l = [
"Level production data TD Index",
"Total Agriculture\\Production data TS Index"
]
for s in l:
m = re.match(r, s)
if m:
print m.group(0)
Explanation
.* Match any character any number of times
(?= \S+ \S+) Positive lookahead ensuring what follows matches
Match a literal space
\S+ Match any non-whitespace character one or more times
Match a literal space
\S+ Match any non-whitespace character one or more times
I have a string that looks like either of these three examples:
1: Name = astring Some comments
2: Typ = one two thee Must be "sand", "mud" or "bedload"
3: RDW = 0.02 [ - ] Some comment about RDW
I first split the variable name and rest like so:
re.findall(r'\s*([a-zA-z0-9_]+)\s*=\s*(.*)', line)
I then want to split the right part of the string into a part containing the values and a part containing the comments (if there are any). I want to do this by looking at the number of whitespaces. If it exceeds say 4, then I assume the comments to start
Any idea on how to do this?
I currently have
re.findall(r'(?:(\S+)\s{0,3})+', dataString)
However if I test this using the string:
'aa aa23r234rf2134213^$&$%& bb'
Then it also selects 'bb'
You may use a single regex with re.findall:
^\s*(\w+)\s*=\s*(.*?)(?:(?:\s{4,}|\[)(.*))?$
See the regex demo.
Details:
^ - start of string
\s* - 0+ whitespaces
(\w+) - capturing group #1 matching 1 or more letters/digits/underscores
\s*=\s* - = enclosed with 0+ whitespaces
(.*?) - capturing group #2 matching any 0+ chars, as few as possible, up to the first...
(?:(?:\s{4,}|\[)(.*))? - an optional group matching
(?:\s{4,}|\[) - 4 or more whitespaces or a [
(.*) - capturing group #3 matching 0+ chars up to
$ - the end of string.