python regex unexpected match groups - python

I am trying to find all occurrences of either "_"+digit or "^"+digit, using the regex ((_\^)[1-9])
The groups I'd expect back eg for "X_2ZZZY^5" would be [('_2'), ('^5')] but instead I am getting [('_2', '_'), ('^5', '^')]
Is my regex incorrect? Or is my expectation of what gets returned incorrect?
Many thanks
** my original re used (_|\^) this was incorrect, and should have been (_\^) -- question has been amended accordingly

You have 2 groups in your regex - so you're getting 2 groups. And you need to match atleast 1 number that follows.
try this:
([_\^][1-9]+)
See it in action here

Demand at least 1 digit (1-9) following the special characters _ or ^, placed inside a single capture group:
import re
text = "X_2ZZZY^5"
pattern = r"([_\^][1-9]{1,})"
regex = re.compile(pattern)
res = re.findall(regex, text)
print(res)
Returning:
['_2', '^5']

Related

finding an exact match using RegEx in Python

I'm searching for exact course codes in a text. Codes look like this
MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*
So 3 or 4 uppercase letters followed by 4 digits.
I only want ones that do not end with "*" symbol.
I have tried
course_code = re.compile('[A-Z]{4}[0-9]{4}|[A-Z]{3}[0-9]{4}')
which is probably one of the worse ways to do it but kinda works as I can get all the courses listed above. The issue is I don't want those 3 course codes ending with a "*" (failed courses have a * next to their codes) to be included in the list.
I tried adding \w or $ to the end of the expression. Whichever I add, the code returns an empty list.
If I read your requirements correctly, you want this pattern:
^[A-Z]{3,4}[0-9]{4}$
This assumes that you would be searching your entire text stored in a Python string using regex in multiline mode, q.v. this demo:
inp = """MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*"""
matches = re.findall(r'^[A-Z]{3,4}[0-9]{4}$', inp, flags=re.M)
print(matches) # ['MAT1051']
import re
# Add a "$" at the end of the re.
# It requires the match to end after the 4 digits.
course_code = re.compile('[A-Z]{4}[0-9]{4}$|[A-Z]{3}[0-9]{4}$')
# No match here
m = re.match(course_code, "MAT1051*")
print(m)
# This matches
m = re.match(course_code, "MAT1051")
print(m)

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Get last part after number regex python

I have always 2 numbers in between and I want to extract everything before 3 so Salvatore and everything after 2 Abdulla
For example I have the following:
txt = "Salvatore32Abdulla"
first = re.findall("^\D+", txt)
last = re.search(,txt)
Expected result:
first = 'Salvatore'
last = 'Abdulla'
I can get the first part, but after 2 I can't get the last part
You could also do this in a single line by slightly changing the solution suggested by #ctwheels as follows. I would suggest you to use re.findall as that gets the job done with a single blow.
import re
txt = "Salvatore32Abdulla"
Option-1
Single line extraction of the non-numeric parts.
first, last = re.findall("\D+", txt)
print((first, last))
('Salvatore', 'Abdulla')
Option-2
If you would (for some reason) also want to keep track of the number in between:
first, num, last = re.findall("(\D+)(\d{2})(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Option-3
As an extension of Option-2 and considering the text with a form 'Salvatore####...###Abdulla', where ####...### denotes a continuous block of digits separating the non-numeric parts and you may or may not have any idea of how many digits could be in-between, you could use the following:
first, num, last = re.findall("(\D+)(\d*)(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Why am I not getting the expected results?
You currently have one issue with your regex and one with your code.
Your regex contains ^, which anchors it to the start of the string. This will only allow you to match Salvatore. You're using findall (which is the appropriate choice if you change the regex to simply \D+), but right now it's only getting one result.
The second re.search call is not needed as you can capture first and last with the findall given an appropriate pattern (see below).
How do I fix it?
See code in use here
import re
txt = "Salvatore32Abdulla"
x = re.findall("\D+", txt)
print(x)
Result:
['Salvatore', 'Abdulla']
You could use a regex like this:
txt = "Salvatore32Abdulla"
regex = r"(\D+)\d\d(\D+)"
match = re.match(regex, txt)
first = match.group(1)
last = match.group(2)
Part after last digit:
match = re.search(r'\D+$',txt)
if match:
print(match.group())
See Python proof | regex proof.
Results: Abdulla
EXPLANATION
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Python RegEx - translate string

I have the following string:
ref:_00D30jPy._50038vQl5C:ref
And would like to formalize the following output string:
5003800000vQl5C
The required regex actions are:
Remove all leading characters until the digit '5'.
Add 5 zeros starting the fifth digit.
Remove the closing ':ref'.
I initially made the following regex to match the whole string:
(ref:(\S+):ref)
How can I alter the Python RegEx to achieve the above?
Use re.sub:
import re
s = 'ref:_00D30jPy._50038vQl5C:ref'
result = re.sub(r'^[^5]*(5.{4})(.*?):ref$', r'\g<1>00000\g<2>', s, 0, re.MULTILINE)
print(result)
Output:
5003800000vQl5C
Explanation:
^[^5]*: match characters except 5 from the beginning
(5.{4}): capture the first 5 characters to group 1
(.*?):ref$: capture the remaining to group 2 except the :ref at the end
\g<1>00000\g<2>: replace the whole line with \g<1>00000\g<2> where \g<1> and \g<2> are substituted by group 1 and 2 repsectively.
Demo has a Python 2-compatible code generator and detailed explanation.
regex is not required for this task. It can be achieved more simply using string slicing.
If the input strings maintain the same format and lengths you can simply do this:
s = 'ref:_00D30jPy._50038vQl5C:ref'
new = '{}00000{}'.format(s[15:20], s[20:-4])
If there is some variability then search for the first '5' in the string and slice from there:
start = s.index('5')
new = '{}00000{}'.format(s[start:start+5], s[start+5:-4])

Categories