regex pipe delimiter with groups - python

I have a url within a URL that's not encoded. It looks like this
https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
My domain can be mydomain.com or mydomain.io . Also
The /400x400/ part can actually vary and be like /blahblah/XxY/blahblah or it can be totally missing. The image can be jpg, jpeg, png
I want to extract the second part of the URL at the end
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
I have regex like this
https://myhost.mydomain.com/[a-zA-Z0-9=]*/.+[\/a-zA-Z0-9]?(/https://[a-zA-Z0-9=-]*.mydomain.(com|io)/images/[a-zA-Z0-9-]*.(png|jpg|jpeg))
This identifies it as 4 groups
However, I want to extract the second URL as a group - so the whole https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png
Can you please help me fix my regex? Thanks !

Try using
import re
s = "https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png"
m = re.search(r"https://.+(https.+)$", s)
if m:
print(m.group(1))
Output:
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png

I would suggest this approach:
https?(?!.*https?):\/\/.*\bmydomain\.(?:com|io).*
This regex uses a negative lookahead to ensure that the URL we match is the last one in the input string. Sample script:
inp = "https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png"
url = re.findall(r'https?(?!.*https?):\/\/.*\bmydomain\.(?:com|io).*', inp)[0]
print(url)
This prints:
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png

As there are 2 links, you could match the first link and capture the second link in group 1.
https?://myhost\.mydomain\.(?:com|io)/\S*?(https?://myhost\.mydomain\.(?:com|io)/\S*\.(?:jpe?g|png))
https?://myhost\.mydomain\.(?:com|io)/ Match the start of the first link
\S*? Match 0+ times a non whitespace char non greedy
( Capture group 1
https?://myhost\.mydomain\.(?:com|io)/ Match the start of the second link
\S* Match 0+ times a non whitespace char
\.(?:jpe?g|png) Match either .jpg or .jpeg or .png
) Close group 1
Regex demo | Python demo
For example
import re
regex = r"https?://myhost\.mydomain\.(?:com|io)/\S*?(https?://myhost\.mydomain\.(?:com|io)/\S*\.(?:jpe?g|png))"
test_str = ("https://myhost.mydomain.com/pnLVyL7HjrxMlxjBQkhcOMr2WUs=/400x400/https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png")
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
https://myhost.mydomain.com/images/98f9a734-52e2-4616-adf7-bf0165bbf738.png

Related

how to extract the front and back of a designated special token using regex?

How to extract the front and back of a designated special token(in this case, -, not #)?
And if those that are connected by - are more than two, I want to extract those too. (In the example, Bill-Gates-Foundation)
e.g)
from 'Meinda#Bill-Gates-Foundation#drug-delivery' -> ['Bill-Gates-Foundation', 'drug-delivery']
I tried p = re.compile('#(\D+)\*(\D+)')
but that was not what I wanted.
You can exclude matchting the # char and repeat 1 or more times the -
#([^\s#-]+(?:-[^\s#-]+)+)
Explanation
# Match literally
( Capture group 1 (returned by re.findall)
[^\s#-]+ Match 1+ non whitespace chars except - and #
(?:-[^\s#-]+)+ Repeat 1+ times matching - and again 1+ non whitespace chars except - and #
) Close group 1
Regex demo
import re
pattern = r"#([^\s#-]+(?:-[^\s#-]+)+)"
s = r"Meinda#Bill-Gates-Foundation#drug-delivery"
print(re.findall(pattern, s))
Output
['Bill-Gates-Foundation', 'drug-delivery']
#ahmet-buğra-buĞa gave an answer with regex.
If you don't have to use regex, then it is easier way is to just use split.
test_str = "Meinda#Bill-Gates-Foundation#drug-delivery"
test_str.split("#")[1:]
This outputs
['Bill-Gates-Foundation', 'drug-delivery']
You can make it a function like so
def get_list_of_strings_after_first(original_str, token_to_split_on):
return original_str.split("#")[1:]
get_list_of_strings_after_first("Meinda#Bill-Gates-Foundation#drug-delivery", "#")
This give the same output
['Bill-Gates-Foundation', 'drug-delivery']

split string based on pattern python

I am trying to delete a pattern off my string and only bring back the word I want to store.
example return
2022_09_21_PTE_Vendor PTE
2022_09_21_SSS_01_Vendor SSS_01
2022_09_21_OOS_market OOS
what I tried
fileName = "2022_09_21_PTE_Vendor"
newFileName = fileName.strip(re.split('[0-9]','_Vendor.xlsx'))
With Python's re module please try following Python code with its sub function written and tested in Python3 with shown samples. Documentation links for re and sub are added in hyperlinks used in their names in 1st sentence.
Here is the Online demo for used Regex.
import re
fileName = "2022_09_21_PTE_Vendor"
re.sub(r'^\d{4}(?:_\d{2}){2}_(.*?)_.+$', r'\1', fileName)
'PTE'
Explanation: Adding detailed explanation for used regex.
^\d{4} ##From starting of the value matching 4 digits here.
(?: ##opening a non-capturing group here.
_\d{2} ##Matching underscore followed by 2 digits
){2} ##Closing non-capturing group and matching its 2 occurrences.
_ ##Matching only underscore here.
(.*?) ##Creating capturing group here where using lazy match concept to get values before next mentioned character.
_.+$ ##Matching _ till end of the value here.
Use a regular expression replacement, not split.
newFileName = re.sub(r'^\d{4}_\d{2}_\d{2}_(.+)_[^_]+$', r'\1', fileName)
^\d{4}_\d{2}_\d{2}_ matches the date at the beginning. [^_]+$ matches the part after the last _. And (.+) captures everything between them, which is copied to the replacement with \1.
Assuming that the date characters at the beginning are always "YYYY_MM_DD" you could do something like this:
fileName = "2022_09_21_SSS_01_Vendor"
fileName = fileName.lstrip()[11:] // Removes the date portion
fileName = fileName.rstrip()[:fileName.rfind('_')] // Finds the last underscore and removes underscore to end
print(fileName)
This should work:
newFileName = fileName[11:].rsplit("_")[0]

ignore first occurance of letter regex

using the following strings
9989S90K72MF-1
9989S90S-1
9989S75K60MF-1
9989S75S-1
I Would like to extract the below from those strings.
9989S90
9989S90
9989S75
9989S75
So far I have:
(^.*?(?=K|-))
Which gives me:
9989S90
9989S90S
9989S75
9989S75S
Here's a link https://regex101.com/r/d1nQj0/1
I've tried a few different regex but can't seem to nail it. Is there a way to ignore the first occurrence of a digit/letter? Which in my case would be S
The following regex matches a string at the beginning of a line that contains a single S up to but not including the first occurrence of S or K
^(.*?S.*?)(?=K|S)
For the example data, you could also match 1+ digits, then S followed by 1+ digits.
^\d+S\d+
Regex demo
If there has to be a S K or - at the right:
^\d+S\d+(?=[KS-])
Regex demo
Example
import re
regex = r"^\d+S\d+(?=[KS-])"
s = ("9989S90K72MF-1\n"
"9989S90S-1\n"
"9989S75K60MF-1\n"
"9989S75S-1")
print(re.findall(regex, s, re.MULTILINE))
Output
['9989S90', '9989S90', '9989S75', '9989S75']

Leave behind a substring when extracting from a regex match

I have the following regex:
^https?://www.example.com(:80)?/([^/]+)/$
It is intended to match URLs like:
http://www.example.com:80/about-me/
https://www.example.com/about-me/
What I want to do when given a URL:
Ensure that the URL matches the regex.
If the URL matches the regex, extract the whole URL without :80.
I know how to do (1), but I need help with (2). For example, for http://www.example.com:80/about-me/, I want to match it with the regex first, then extract http://www.example.com/about-me/ out of it. I want to discard :80 during extraction. How can I do this?
I am using the re module from the standard library in Python 3.6.
You can extract just the relevant groups, as in the following:
s = "http://www.example.com:80/about-me/"
exp = r'^(https?://www\.example\.com)(:80)?(/[^/]+/)$'
m = re.match(exp, s)
groups = m.groups()
print(groups[0] + groups[2])
# ==> http://www.example.com/about-me/
Note that you should escape the URL's dots using \..
You might use urlparse to replace the port from the url:
parsedUrl = urlparse('http://www.example.com:80/about-me/')
if parsedUrl.netloc == "www.example.com:80":
stripped = parsedUrl._replace(netloc=parsedUrl.netloc.replace(":" + str(parsedUrl.port), ""))
print(urlunparse(stripped))
Python demo
Output
http://www.example.com/about-me/
Or use a pattern with 2 capturing groups and use those in the replacement.
If you want to match 1 or more digits instead of only 80, use \d+ and note to escape the dot \.
^(https?://www\.example\.com)(?::80)?(/[^/]+/)$
Regex demo | Python demo
import re
regex = r"^(https?://www\.example\.com)(?::80)?(/[^/]+/)$"
s = "http://w...content-available-to-author-only...e.com:80/about-me/"
result = re.sub(regex, r"\1\2", s, 1)
print(result)
Output
http://www.example.com/about-me/

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

Categories