How to create non-greedy regular expression from right? - python

I have a file named 'ab9c_xy8z_12a3.pdf' . I want to capture part after the last underscore and before '.pdf'.
Writing regular expression like :
s = 'ab9c_xy8z_12a3.pdf'
m = re.search(r'_.*?\.pdf',s)
m.group(0)
returns:
'_xy8z_12a3.pdf'
In this example, I would like to capture only '12a3' part. Thank you for your help.

The _.*?\.pdf regex matches the first underscore with _, then matches any 0+ chars other than a newline, as few as possible, but up to the leftmost occurrence of .pdf, which turns out to be at the end of the string. So, . matched all underscores on its way to .pdf, just because of the way a regex engine parses the string (from left to right) and due to . pattern.
You may fix the pattern by using a negated character class [^_] instead of . that will "subtract" underscores from . pattern.
([^_]+)\.pdf
and grab Group 1 value. See the regex demo.
Python demo:
import re
rx = r"([^_]+)\.pdf"
s = "ab9c_xy8z_12a3.pdf"
m = re.search(rx, s)
if m:
print(m.group(1)) # => 12a3

Use re.split instead:
>>> re.split('[_.]', 'ab9c_xy8z_12a3.pdf')[-2]
'12a3'

Related

Pandas regex to remove digits before consecutive dots

I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.
You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

Regex to find N characters between underscore and period

I have a filename having numerals like test_20200331_2020041612345678.csv.
So I just want to read only first 8 characters from the number between last underscore and .csv using a regex.
For e.g: From the file name test_20200331_2020041612345678.csv --> i want to read only 20200416 using regex.
Regex tried: (?<=_)(\d+)(?=\.)
But it is returning the full number between underscore and period i.e 2020041612345678
Also, when tried quantifier like (?<=_)(\d{8})(?=\.) its not matching with any string
The (?<=_)(\d{8})(?=\.) does not work because the (?=\.) positive lookahead requires the presence of a . char immediately to the right of the current location, i.e. right after the eigth digit, but there are more digits in between.
You may add \d* before \. to match any amount of digits after the required 8 digits, use
(?<=_)\d{8}(?=\d*\.)
Or, with a capturing group, you do not even need lookarounds (just make sure you access Group 1 when a match is obtained):
_(\d{8})\d*\.
See the regex demo
Python demo:
import re
s = "test_20200331_2020041612345678.csv"
m = re.search(r"(?<=_)\d{8}(?=\d*\.)", s)
# m = re.search(r"_(\d{8})\d*\.", s) # capturing group approach
if m:
print(m.group()) # => 20200416
# print(m.group(1)) # capturing group approach

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Using Python Regex to find a phrase between 2 tags

I've got a string that I want to use regex to find the characters encapsulated between two known patterns, "Cp_6%3A" then some characters then "&" and potentially more characters, or no & and just the end of string.
My code looks like this:
def extract_id_from_ref(ref):
id = re.search("Cp\_6\%3A(.*?)(\& | $)", ref)
print(id)
But this isn't producing anything, Any ideas?
Thanks in advance
Note that (\& | $) matches either the & char and a space after it, or a space and end of string (the spaces are meaningful here!).
Use a negated character class [^&]* (zero or more chars other than &) to simplify the regex (no need for an alternation group or lazy dot matching pattern) and then access .group(1):
def extract_id_from_ref(ref):
m = re.search(r"Cp_6%3A([^&]*)", ref)
if m:
print(m.group(1))
Note that neither _ nor % are special regex metacharacters, and do not have to be escaped.
See the regex demo.
The problem is that spaces in a regex pattern, are also taken into account. Furthermore in order to add a backspace to the string, you either have to add \\ (two backslashes) or use a raw string:
So you should write:
r"Cp_6\%3A(.*?)(?:\&|$)"
If you then match with:
def extract_id_from_ref(ref):
id = re.search(r"Cp_6\%3A(.*?)(?:\&|$)", ref)
print(id)
It should work.

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.

Categories