How do I select variable Regular expression using Python? - python

I have some lines like below with numbers and strings. Some have only numbers while some have some strings as well before them:
'abc' (17245...64590)
'cde' (12244...67730)
'dsa' complement (12345...67890)
I would like to extract both formats with and without numbers. So, the first two lines should contain only numbers while the third line should also contain string before the numbers.
I am using this command to achieve this.
result = re.findall("\bcomplement\b|\d+", line)
Any idea, how to do it.
Expected output would be like this:
17245, 64590
12244, 67730
complement, 12345, 67890

If the number of digit chunks inside the parentheses is always 2 and they are separated with 1+ dots use
re.findall(r'\s{2,}(?:(\w+)\s*)?\((\d+)\.+(\d+)\)', s)
See the regex demo. And a sample Python demo:
import re
s= ''''abc' (17245...64590)
'cde' (12244...67730)
'dsa' complement (12345...67890)'''
rx = r"\s{2,}(?:(\w+)\s*)?\((\d+)\.+(\d+)\)"
for x in re.findall(rx, s):
print(", ".join([y for y in x if y]))
Details
\s{2,} - 2 or more whitespaces
(?:(\w+)\s*)? - an optional sequence of:
(\w+) - Group 1: one or more word chars
\s* - 0+ whitespaces
\( - a (
(\d+) - Group 2: one or more digits
\.+ - 1 or more dots
(\d+) - Group 3: one or more digits
\) - a ) char.
If the number of digit chunks inside the parentheses can vary you may use
import re
s= ''''abc' (17245...64590)
'cde' (12244...67730)
'dsa' complement (12345...67890)'''
for m in re.finditer(r'\s{2,}(?:(\w+)\s*)?\(([\d.]+)\)', s):
res = []
if m.group(1):
res.append(m.group(1))
res.extend(re.findall(r'\d+', m.group(2)))
print(", ".join(res))
Both Python snippets output:
17245, 64590
12244, 67730
complement, 12345, 67890
See the online Python demo. Note it can match any number of digit chunks inside parentheses and it assumes that are at least 2 whitespace chars in between Column 1 and Column 2.
See the regex demo, too. The difference with the first one is that there is no third group, the second and third groups are replaced with one second group ([\d.]+) that captures 1 or more dots or digits (the digits are later extracted with re.findall(r'\d+', m.group(2))).

Related

Pattern to extract, expand and form a sentence based on a certain delimiter

I was trying out to solve a problem on regex:
There is an input sentence which is of one of these forms: Number1,2,3 or Number1/2/3 or Number1-2-3 these are the 3 delimiters: , / -
The expected output is: Number1,Number2,Number3
Pattern I've tried so far:
(?\<=,)\[^,\]+(?=,)
but this misses out on the edge cases i.e. 1st element and last element. I am also not able to generate for '/'.
You could separate out the key from values, then use a list comprehension to build the output you want.
inp = "Number1,2,3"
matches = re.search(r'(\D+)(.*)', inp)
output = [matches[1] + x for x in re.split(r'[,/]', matches[2])]
print(output) # ['Number1', 'Number2', 'Number3']
You can do it in several steps: 1) validate the string to match your pattern, and once validated 2) add the first non-digit chunk to the numbers while replacing - and / separator chars with commas:
import re
texts = ['Number1,2,3', 'Number1/2/3', 'Number1-2-3']
for text in texts:
m = re.search(r'^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$', text)
if m:
print( re.sub(r'(?<=,)(?=\d)', m.group(1).replace('\\', '\\\\'), text.replace('/',',').replace('-',',')) )
else:
print(f"NO MATCH in '{text}'")
See this Python demo.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
The ^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$ regex validates your three types of input:
^ - start of string
(\D+) - Group 1: one or more non-digits
(\d+(?=([,/-]))(?:\3\d+)*) - Group 2: one or more digits, and then zero or more repetitions of ,, / or - and one or more digits (and the separator chars should be consistent due to the capture used in the positive lookahead and the \3 backreference to that value used in the non-capturing group)
$ - end of string.
The re.sub pattern, (?<=,)(?=\d), matches a location between a comma and a digit, the Group 1 value is placed there (note the .replace('\\', '\\\\') is necessary since the replacement is dynamic).
import re
for text in ("Number1,2,3", "Number1-2-3", "Number1/2/3"):
print(re.sub(r"(\D+)(\d+)[/,-](\d+)[/,-](\d+)", r"\1\2,\1\3,\1\4", text))
\D+ matches "Number" or any other non-number text
\d+ matches a number (or more than one)
[/,-] matches any of /, ,, -
The rest is copy paste 3 times.
The substitution consists of backreferences to the matched "Number" string (\1) and then each group of the (\d+)s.
This works if you're sure that it's always three numbers divided by that separator. This does not ensure that it's the same separator between each number. But it's short.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
If you can make use of the pypi regex module you can use the captures collection with a named capture group.
([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)
([^\d\s,/]+) Capture group 1, match 1+ chars other than the listed
(?<num>\d+) Named capture group num matching 1+ digits
([,/-]) Capture either , / - in group 3
(?<num>\d+) Named capture group num matching 1+ digits
(?:\3(?<num>\d+))* Optionally repeat a backreference to group 3 to keep the separators the same and match 1+ digits in group num
(?!\S) Assert a whitspace boundary to the right to prevent a partial match
Regex demo | Python demo
import regex as re
pattern = r"([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)"
s = "Number1,2,3 or Number4/5/6 but not Number7/8,9"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group(1) + c for c in m.captures("num")]))
Output
Number1,Number2,Number3
Number4,Number5,Number6

Pandas regex to remove digits before consecutive dots

I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.
You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

Python REGEX How to extract particular numbers from variable

I have the following problem:
var a = ' 15159970 (30.12.2015), 15615115 (01.01.1970), 11224455, 55441123
I'd like a regex to extract only the numbers: 15159970, 15615115, 11224455, 55441123
What a have so far:
re.findall(r'(\d+\s)\(', a)
which only extracts the first 2 numbers: 15159970, 15615115
Having also a second var b = 15159970, 15615115, 11224455, 55441126 I would like to compare the 2 vars and if they differ then a print("vars are different!")
Thanks!
You may extract all chunks of digits not preceded with a digit or digit + dot and not followed with a dot + digit or a digit:
(?<!\d)(?<!\d\.)\d+(?!\.?\d)
See the regex demo
Details
(?<!\d) - a negative lookbehind that fails a location immediately preceded with a digit
(?<!\d\.) - a negative lookbehind that fails a location immediately preceded with a digit and a dot
\d+ - 1+ digits
(?!\.?\d) - a negative lookahead that fails a location immediately followed with a digit or a dot + a digit.
Python demo:
import re
a = ' 15159970 (30.12.2015), 15615115 (01.01.1970), 11224455, 55441123 '
print( re.findall(r'(?<!\d)(?<!\d\.)\d+(?!\.?\d)', a) )
# => ['15159970', '15615115', '11224455', '55441123']
Another solution: only extract the digit chunks outside of parentheses.
See this Python demo:
import re
text = "15159970 (30.12.2015), 15615115 (01.01.1970), 11224455, 55441123 (28.11.2014 12:43:14)"
print( list(filter(None, re.findall(r'\([^()]+\)|(\d+)', text))) )
# => ['15159970', '15615115', '11224455', '55441123']
Here, \([^()]+\)|(\d+) matches
\([^()]+\) - (, any 1+ chars other than ( and ) and then )
| - or
(\d+) - matches and captures into Group 1 one or more digits (re.findall only includes captured substrings if there is a capturing group in the pattern).
Empty items appear in the result when the non-parenthesized match occurs, thus, we need to remove them (either with list(filter(None, results)) or with [x for x in results if x]).

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

the use of regular expression

I'm new in regular expression, but I want to match a pattern in about 2 million strings.
There three forms of the origin strings shown as follows:
EC-2A-07<EC-1D-10>
EC-2-07
T1-ZJF-4
I want to get three parts of substrings besides -, which is to say I want to get EC, 2A, 07respectively. Especially, for the first string, I just want to divide the part before <.
I have tried .+[\d]\W, but cannot recognize EC-2-07, then I use .split('-') to split the string, and then use index in the returned list to get what I want. But it is low efficient.
Can you figure out a high efficient regular expression to meet my requirements?? Thanks a lot!
You need to use
^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})
See the regex demo
Details:
^ - start of string
([A-Z0-9]{2}) - Group 1 capturing 2 uppercase ASCII letters or digits
-- - a hyphen
([A-Z0-9]{1,3}) - Group 2 capturing 1 to 3 uppercase ASCII letters or digits
- - a hyphen
([A-Z0-9]{1,2}) - Group 3 capturing 1 to 2 uppercase ASCII letters or digits.
You may adjust the values in the {min,max} quantifiers as required.
Sample Python demo:
import re
regex = r"^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})"
test_str = "EC-2A-07<EC-1D-10>\nEC-2-07\nT1-ZJF-4"
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
#or with lines
lines = test_str.split('\n')
rx = re.compile("([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})")
for line in lines:
m = rx.match(line)
if m:
print('{0} :: {1} :: {2}'.format(m.group(1), m.group(2), m.group(3)))
You can try this:
^(\w+)-(\w+)-(\w+)(?=\W).*$
Explanation
Python Demo

Categories