Leave behind a substring when extracting from a regex match - python

I have the following regex:
^https?://www.example.com(:80)?/([^/]+)/$
It is intended to match URLs like:
http://www.example.com:80/about-me/
https://www.example.com/about-me/
What I want to do when given a URL:
Ensure that the URL matches the regex.
If the URL matches the regex, extract the whole URL without :80.
I know how to do (1), but I need help with (2). For example, for http://www.example.com:80/about-me/, I want to match it with the regex first, then extract http://www.example.com/about-me/ out of it. I want to discard :80 during extraction. How can I do this?
I am using the re module from the standard library in Python 3.6.

You can extract just the relevant groups, as in the following:
s = "http://www.example.com:80/about-me/"
exp = r'^(https?://www\.example\.com)(:80)?(/[^/]+/)$'
m = re.match(exp, s)
groups = m.groups()
print(groups[0] + groups[2])
# ==> http://www.example.com/about-me/
Note that you should escape the URL's dots using \..

You might use urlparse to replace the port from the url:
parsedUrl = urlparse('http://www.example.com:80/about-me/')
if parsedUrl.netloc == "www.example.com:80":
stripped = parsedUrl._replace(netloc=parsedUrl.netloc.replace(":" + str(parsedUrl.port), ""))
print(urlunparse(stripped))
Python demo
Output
http://www.example.com/about-me/
Or use a pattern with 2 capturing groups and use those in the replacement.
If you want to match 1 or more digits instead of only 80, use \d+ and note to escape the dot \.
^(https?://www\.example\.com)(?::80)?(/[^/]+/)$
Regex demo | Python demo
import re
regex = r"^(https?://www\.example\.com)(?::80)?(/[^/]+/)$"
s = "http://w...content-available-to-author-only...e.com:80/about-me/"
result = re.sub(regex, r"\1\2", s, 1)
print(result)
Output
http://www.example.com/about-me/

Related

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

python re.sub not replacing all the occurance of string

I'm not getting the desire output, re.sub is only replacing the last occurance using python regular expression, please explain me what i"m doing wrong
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
re.sub("http://.*[#]", "", srr)
'image-1CE005XG03'
Desire output without http://www.google.com/#image from the above string.
image-1CCCC|image-1VVDD|image-123|image-1CE005XG03
I would use re.findall here, rather than trying to do a replacement to remove the portions you don't want:
src = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
matches = re.findall(r'https?://www\.\S+#([^|\s]+)', src)
output = '|'.join(matches)
print(output) # image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Note that if you want to be more specific and match only Google URLs, you may use the following pattern instead:
https?://www\.google\.\S+#([^|\s]+)
>>> "|".join(re.findall(r'#([^|\s]+)', srr))
'image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03'
Here is another solution,
"|".join(i.split("#")[-1] for i in srr.split("|"))
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Using correct regex in re.sub as suggested in comment above:
import re
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
print (re.sub(r"\s*https?://[^#\s]*#", "", srr))
Output:
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
RegEx Details:
\s*: Match 0 or more whitespaces
https?: Match http or https
://: Match ://
[^#\s]*: Match 0 or more of any characters that are not # and whitespace
#: Match a #

regex pipe delimiter with groups

I have a url within a URL that's not encoded. It looks like this
https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
My domain can be mydomain.com or mydomain.io . Also
The /400x400/ part can actually vary and be like /blahblah/XxY/blahblah or it can be totally missing. The image can be jpg, jpeg, png
I want to extract the second part of the URL at the end
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
I have regex like this
https://myhost.mydomain.com/[a-zA-Z0-9=]*/.+[\/a-zA-Z0-9]?(/https://[a-zA-Z0-9=-]*.mydomain.(com|io)/images/[a-zA-Z0-9-]*.(png|jpg|jpeg))
This identifies it as 4 groups
However, I want to extract the second URL as a group - so the whole https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
Can you please help me fix my regex? Thanks !
Try using
import re
s = "https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png"
m = re.search(r"https://.+(https.+)$", s)
if m:
print(m.group(1))
Output:
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
I would suggest this approach:
https?(?!.*https?):\/\/.*\bmydomain\.(?:com|io).*
This regex uses a negative lookahead to ensure that the URL we match is the last one in the input string. Sample script:
inp = "https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png"
url = re.findall(r'https?(?!.*https?):\/\/.*\bmydomain\.(?:com|io).*', inp)[0]
print(url)
This prints:
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
As there are 2 links, you could match the first link and capture the second link in group 1.
https?://myhost\.mydomain\.(?:com|io)/\S*?(https?://myhost\.mydomain\.(?:com|io)/\S*\.(?:jpe?g|png))
https?://myhost\.mydomain\.(?:com|io)/ Match the start of the first link
\S*? Match 0+ times a non whitespace char non greedy
( Capture group 1
https?://myhost\.mydomain\.(?:com|io)/ Match the start of the second link
\S* Match 0+ times a non whitespace char
\.(?:jpe?g|png) Match either .jpg or .jpeg or .png
) Close group 1
Regex demo | Python demo
For example
import re
regex = r"https?://myhost\.mydomain\.(?:com|io)/\S*?(https?://myhost\.mydomain\.(?:com|io)/\S*\.(?:jpe?g|png))"
test_str = ("https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png")
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

How to create non-greedy regular expression from right?

I have a file named 'ab9c_xy8z_12a3.pdf' . I want to capture part after the last underscore and before '.pdf'.
Writing regular expression like :
s = 'ab9c_xy8z_12a3.pdf'
m = re.search(r'_.*?\.pdf',s)
m.group(0)
returns:
'_xy8z_12a3.pdf'
In this example, I would like to capture only '12a3' part. Thank you for your help.
The _.*?\.pdf regex matches the first underscore with _, then matches any 0+ chars other than a newline, as few as possible, but up to the leftmost occurrence of .pdf, which turns out to be at the end of the string. So, . matched all underscores on its way to .pdf, just because of the way a regex engine parses the string (from left to right) and due to . pattern.
You may fix the pattern by using a negated character class [^_] instead of . that will "subtract" underscores from . pattern.
([^_]+)\.pdf
and grab Group 1 value. See the regex demo.
Python demo:
import re
rx = r"([^_]+)\.pdf"
s = "ab9c_xy8z_12a3.pdf"
m = re.search(rx, s)
if m:
print(m.group(1)) # => 12a3
Use re.split instead:
>>> re.split('[_.]', 'ab9c_xy8z_12a3.pdf')[-2]
'12a3'

Categories