I'm new in regular expression, but I want to match a pattern in about 2 million strings.
There three forms of the origin strings shown as follows:
EC-2A-07<EC-1D-10>
EC-2-07
T1-ZJF-4
I want to get three parts of substrings besides -, which is to say I want to get EC, 2A, 07respectively. Especially, for the first string, I just want to divide the part before <.
I have tried .+[\d]\W, but cannot recognize EC-2-07, then I use .split('-') to split the string, and then use index in the returned list to get what I want. But it is low efficient.
Can you figure out a high efficient regular expression to meet my requirements?? Thanks a lot!
You need to use
^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})
See the regex demo
Details:
^ - start of string
([A-Z0-9]{2}) - Group 1 capturing 2 uppercase ASCII letters or digits
-- - a hyphen
([A-Z0-9]{1,3}) - Group 2 capturing 1 to 3 uppercase ASCII letters or digits
- - a hyphen
([A-Z0-9]{1,2}) - Group 3 capturing 1 to 2 uppercase ASCII letters or digits.
You may adjust the values in the {min,max} quantifiers as required.
Sample Python demo:
import re
regex = r"^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})"
test_str = "EC-2A-07<EC-1D-10>\nEC-2-07\nT1-ZJF-4"
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
#or with lines
lines = test_str.split('\n')
rx = re.compile("([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})")
for line in lines:
m = rx.match(line)
if m:
print('{0} :: {1} :: {2}'.format(m.group(1), m.group(2), m.group(3)))
You can try this:
^(\w+)-(\w+)-(\w+)(?=\W).*$
Explanation
Python Demo
Related
I've got a list of strings.
input=['XX=BB|3|3|1|1|PLP|KLWE|9999|9999', 'XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999', '999|999|999|9999|999', ....]
This type '999|999|999|9999|999' remains unchanged.
I need to replace 9999|9999 with 12|21
I write this (?<=BB\|\d\|\d\|\d\|\d\|\S{3}\|\S{4}\|)9{2,9}\|9{2,9} to match 999|999. However, there are 4 to 6 \|\d in the middle. So how to match |d this pattern for multiple times.
Desired result:
['XX=BB|3|3|1|1|PLP|KLWE|12|21', 'XX=BB|3|3|1|1|2|PLP|KPOK|12|21', '999|999|999|9999|999'...]
thanks
You can use
re.sub(r'(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|)9{2,9}\|9{2,9}(?!\d)', r'\g<1>12|21', text)
See the regex demo.
Details:
(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|) - Capturing group 1:
BB - a BB string
(?:\|\d){4,6} - four, five or six repetitions of | and any digit sequence
\| - a | char
[^\s|]{3} - three chars other than whitespace and a pipe
\|[^\s|]{4}\| - a |, four chars other than whitespace and a pipe, and then a pipe char
9{2,9}\|9{2,9} - two to nine 9 chars, | and again two to nine 9 chars...
(?!\d) - not followed with another digit (note you may remove this if you do not need to check for the digit boundary here. You may also use (?![^|]) instead if you need to check if there is a | char or end of string immediately on the right).
The \g<1>12|21 replacement includes an unambiguous backreference to Group 1 (\g<1>) and a 12|21 substring appended to it.
See the Python demo:
import re
texts=['XX=BB|3|3|1|1|PLP|KLWE|9999|9999', 'XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999', '999|999|999|9999|999']
pattern = r'(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|)9{2,9}\|9{2,9}(?!\d)'
repl = r'\g<1>12|21'
for text in texts:
print( re.sub(pattern, repl, text) )
Output:
XX=BB|3|3|1|1|PLP|KLWE|12|21
XX=BB|3|3|1|1|2|PLP|KPOK|12|21
999|999|999|9999|999
I would just use re.sub here and search for the pattern \b9{2,9}\|9{2,9}\b:
inp = ["XX=BB|3|3|1|1|PLP|KLWE|9999|9999" "XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999"]
output = [re.sub(r'\b9{2,9}\|9{2,9}\b', '12|21', i) for i in inp]
print(output)
# ['XX=BB|3|3|1|1|PLP|KLWE|12|21', 'XX=BB|3|3|1|1|2|PLP|KPOK|12|21']
I have the following list of expressions in python
LIST1=["AR BR_18_0138249", "AR R_16_01382649", "BR 16 0138264", "R 16 01382679" ]
In the above string a few patterns are alpha numeric but there is a space between the two second set of sequences. I expect the following output
"AR BR_18_0138249"
"AR R_16_01382649"
"BR 16 0138264"
"R 16 01382679"
I have tried the following code
import regex as re
pattern = r"(\bB?R_\w+)(?!.*\1)|(\bB?R \w+)(?!.*\1)|(\bR?^sd \w+)(?!.*\1)"
for i in LIST1:
rest = re.search(pattern, i)
if rest:
print(rest.group(1))
I have obtained the following result
BR_18_0138249
R_16_01382649
None
None
I am unable to get the sequences with the spaces. I request someone to guide me in this regard
You can use
\b(B?R(?=([\s_]))(?:\2\d+)+)\b(?!.*\b\1\b)
See the regex demo
Details
\b - a word boundary
(B?R(?=([\s_]))(?:\2\d+)+) - Group 1: an optional B, then R, then one or more sequences of a whitespace or underscore followed with one or more digits (if you need to support letters here, replace \d+ with [^\W_])
\b - a word boundary
(?!.*\b\1\b) - a negative lookahead that fails the match if there are
.* - any zero or more chars other than line break chars, as many as possible
\b\1\b - the same value as in Group 1 matched as a whole word (not enclosed with letters, digits or underscores).
See a Python re demo (you do not need the PyPi regex module here):
import re
LIST1=["AR BR_18_0138249", "AR R_16_01382649", "BR 16 0138264", "R 16 01382679" ]
pattern = r"\b(B?R(?=([\s_]))(?:\2\d+)+)\b(?!.*\b\1\b)"
for i in LIST1:
rest = re.search(pattern, i)
if rest:
print(rest.group(1))
This does the work:
[A-Z]{1,2}\s([A-Z]{1,2}+(?:_[0-9]+)*|[0-9]+(?:\s[0-9]+)*)
This regex gives below output:
AR BR_18_0138249
AR R_16_01382649
BR 16 0138264
R 16 01382679
See demo here
I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())
I have a string if the alphabetical part of a word is more than 3 letters, I want to store that in a list. I need to store "hour" and "lalal" into a list.
I wrote a regex pattern for alpha-digit and digit alpha sequences like below.
regex = ["([a-zA-Z])-([0-9])*","([0-9])*-([a-zA-Z])"]
tring = 'f-16 is 1-hour, lalal-54'
for r in regex:
m = re.search(r,tring)
d.append((m.group(0))
print(d)
But this obviously gives me all the alphanumeric patterns which are being stored too. So, I thought I could extend this to count the letters in each pattern and store it differently too. Is that possible?
Edit: Another example would
tring = I will be there in 1-hour
and the output for this should be ['hour']
So you want to only capture alphanumeric text if either it is preceded or followed by a number and a hyphen. You can use this regex which uses alternation for capturing both the cases,
([a-zA-Z]{4,})-\d+|\d+-([a-zA-Z]{4,})
Explanation:
([a-zA-Z]{4,}) - Captures the alphanumeric text of length four or more and stores in group1
-\d+ - Ensures it is followed by hyphen and one or more digit
| - Alternation as there are two cases
\d+- - Matches one or more digits and a hyphen
([a-zA-Z]{4,}) - Captures the alphanumeric text of length four or more and stores in group2
Demo
Check this python code,
import re
s = 'f-16 is 1-hour, lalal-54 I will be there in 1-hours'
d = []
for m in re.finditer(r'([a-zA-Z]{4,})-\d+|\d+-([a-zA-Z]{4,})',s):
if (m.group(1)):
d.append(m.group(1))
elif (m.group(2)):
d.append(m.group(2))
print(d)
s = 'f-16 is 1-hour, lalal-54'
arr = re.findall(r'[a-zA-Z]{4,}', s)
print(arr)
Prints,
['hour', 'lalal', 'hours']
I have some lines like below with numbers and strings. Some have only numbers while some have some strings as well before them:
'abc' (17245...64590)
'cde' (12244...67730)
'dsa' complement (12345...67890)
I would like to extract both formats with and without numbers. So, the first two lines should contain only numbers while the third line should also contain string before the numbers.
I am using this command to achieve this.
result = re.findall("\bcomplement\b|\d+", line)
Any idea, how to do it.
Expected output would be like this:
17245, 64590
12244, 67730
complement, 12345, 67890
If the number of digit chunks inside the parentheses is always 2 and they are separated with 1+ dots use
re.findall(r'\s{2,}(?:(\w+)\s*)?\((\d+)\.+(\d+)\)', s)
See the regex demo. And a sample Python demo:
import re
s= ''''abc' (17245...64590)
'cde' (12244...67730)
'dsa' complement (12345...67890)'''
rx = r"\s{2,}(?:(\w+)\s*)?\((\d+)\.+(\d+)\)"
for x in re.findall(rx, s):
print(", ".join([y for y in x if y]))
Details
\s{2,} - 2 or more whitespaces
(?:(\w+)\s*)? - an optional sequence of:
(\w+) - Group 1: one or more word chars
\s* - 0+ whitespaces
\( - a (
(\d+) - Group 2: one or more digits
\.+ - 1 or more dots
(\d+) - Group 3: one or more digits
\) - a ) char.
If the number of digit chunks inside the parentheses can vary you may use
import re
s= ''''abc' (17245...64590)
'cde' (12244...67730)
'dsa' complement (12345...67890)'''
for m in re.finditer(r'\s{2,}(?:(\w+)\s*)?\(([\d.]+)\)', s):
res = []
if m.group(1):
res.append(m.group(1))
res.extend(re.findall(r'\d+', m.group(2)))
print(", ".join(res))
Both Python snippets output:
17245, 64590
12244, 67730
complement, 12345, 67890
See the online Python demo. Note it can match any number of digit chunks inside parentheses and it assumes that are at least 2 whitespace chars in between Column 1 and Column 2.
See the regex demo, too. The difference with the first one is that there is no third group, the second and third groups are replaced with one second group ([\d.]+) that captures 1 or more dots or digits (the digits are later extracted with re.findall(r'\d+', m.group(2))).