Why does capturing urls with proper protocols fail? - python

I started writing a regex to try to capture as many urls as possible. However, for some reason, I cannot make it work.
Regex:
^(https?|ftps?|mailto|gopher|telnet|www\.)\:.+?\/(?=\s)
Demo: Regex101
Any help is appreciated. Thx in advance.

You may use
^(?:(?:https?|ftps?|gopher|telnet):\/\/|www\.|mailto:)\S+
See the regex demo and its graph:
Details
^ - start of string
(?:(?:https?|ftps?|gopher|telnet):\/\/|www\.|mailto:) - any of
(?:https?|ftps?|gopher|telnet):\/\/ - http, https, ftp, ftps, gopher or telnet and then :// substring
| - or
www\. - www. substring
| - or
mailto: - mailto: substring
\S+ - 1 or more non-whitespace chars.

Your pattern \/(?=\s) requires the URL to end in a slash. You can check this by adding a slash to the end of any of the URLs in your test snippet.
There's no real reason to do that - you can just remove the \/ and allow the URL to end on any character followed by a whitespace.
However, in addition to this you should be aware that the whitespace thing isn't very robust. If a URL occurs in text, it may be followed by punctuation or parentheses, which are technically valid URL characters and which your filter (minus the \/) will include, even though they are likely not part of it.
There's obviously some ambiguity in these cases, but it might be a better heuristic to exclude any punctuation characters at the end of the URL.
(If you want to be really sophisticated about it, you can do what GitHub's markdown parser does, and include closing parentheses at the end if and only if they match with opening parentheses inside the URL. This helps recognize links in contexts like like (See https://en.wikipedia.org/wiki/Something_(disambiguation)). But this isn't feasible with only regex, and requires some extra processing.)

Related

Extract URL from text without space between URL in Python3

I have problem with python regex, I would like to extract any URL in text except email address. My current regex pattern still can't extract url if there is no space before URL. This is my regex pattern
\b((?:(?:https|ftp|http)?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))
you can check on this regex editor (https://regex101.com/r/lcNc9N/9) , my pattern still can't recognize URL if there's no space before it, any hints or solutions are welcome.
If there's characters other then space before it, it's no longer URL :)
From the RFC:
In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
So I don't really know what you mean by "URL," but replacing the first \b with something like:
[\s\w]*?
Might be what you want. The first group will match URLs even if there are digits, alphabet letters or underscores before them.

How to search for and replace a term within another search term

I have a url I get from parsing a swagger's api.json file in Python.
The URL looks something like this and I want to replace the dashes with underscores, but only inside the curly brackets.
10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name
So, {pet-owner} will become {pet_owner}, but pet-store-account will remain the same.
I am looking for a regular expression that will allow me to perform a non-greedy search and then do a search-replace on each of the first search's findings.
a Python re approach is what I am looking for, but I will also appreciate if you can suggest a Vim one liner.
The expected final result is:
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
Provided that you expect all '{...}' blocks to be consistent, you may use a trailing context to determine whether a given dash is inside a block, actually just requiring it to be followed by '...}' where '.' is not a '{'
exp = re.compile(r'(?=[^{]*})-')
...
substituted_url = re.sub(exp,'_',url_string)
Using lookahead and lookbehind in Vim:
s/\({[^}]*\)\#<=-\([^{]*}\)\#=/_/g
The pattern has three parts:
\({[^}]*\)\#<= matches, but does not consume, an opening brace followed by anything except a closing brace, immediately behind the next part.
- matches a hyphen.
\([^{]*}\)\#= matches, but does not consume, anything except an opening brace, followed by a closing brace, immediately ahead of the previous part.
The same technique can't be exactly followed in Python regular expressions, because they only allow fixed-width lookbehinds.
Result:
Before
outside-braces{inside-braces}out-again{in-again}out-once-more{in-once-more}
After
outside-braces{inside_braces}out-again{in_again}out-once-more{in_once_more}
Because it checks for braces in the right place both before and after the hyphen, this solution (unlike others which use only lookahead assertions) behaves sensibly in the face of unmatched braces:
Before
b-c{d-e{f-g}h-i
b-c{d-e}f-g}h-i
b-c{d-e}f-g{h-i
b-c}d-e{f-g}h-i
After
b-c{d-e{f_g}h-i
b-c{d_e}f-g}h-i
b-c{d_e}f-g{h-i
b-c}d-e{f_g}h-i
Use a two-step approach:
import re
url = "10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name"
rx = re.compile(r'{[^{}]+}')
def replacer(match):
return match.group(0).replace('-', '_')
url = rx.sub(replacer, url)
print(url)
Which yields
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
This looks for pairs of { and } and replaces every - with _ inside it.
There may be solutions with just one line but this one is likely to be understood in a couple of months as well.
Edit: For one-line-gurus:
url = re.sub(r'{[^{}]+}',
lambda x: x.group(0).replace('-', '_'),
url)
Solution in Vim:
%s/\({.*\)\#<=-\(.*}\)\#=/_/g
Explanation of matched pattern:
\({.*\)\#<=-\(.*}\)\#=
\({.*\)\#<= Forces the match to have a {.* behind
- Specifies a dash (-) as the match
\(.*}\)\#= Forces the match to have a .*} ahead
Use python lookahead to ignore the string enclosed within curly brackets {}:
Description:
(?=...):
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Solution
a = "10.147.48.10:8285/pet-store-account/**{pet-owner}**/version/**{pet-type-id}**/pet-details-and-name"
import re
re.sub(r"(?=[^{]*})-", "_", a)
Output:
'10.147.48.10:8285/pet-store-account/**{pet_owner}**/version/**{pet_type_id}**/pet-details-and-name'
Another way to do in Vim is to use a sub-replace-expression:
:%s/{\zs[^}]*\ze}/\=substitute(submatch(0),'-','_','g')/g
Using \zs and \ze we set the match between the { & } characters. Using \={expr} will evaluate {expr} as the replacement for each substitution. Using VimScripts substitution function, substitute({text}, {pat}, {replace}, {flag}), on the entire match, submatch(0), to convert - to _.
For more help see:
:h sub-replace-expression
:h /\zs
:h submatch()
:h substitute()

How to set regex for website url pattern

The url pattern is
http://www.hepsiburada.com/philips-40pfk5500-40-102-ekran-full-hd-200-hz-uydu-alicili-cift-cekirdek-smart-android-led-tv-p-EVPHI40PFK5500
This website has similar urls. The unique identifier is -p- for this url.
The url pattern always has -p- before word which is at end of url.
I used the following regex
(.*)hepsiburada\.com\/([\w.-]+)([\-p\-\w+])\Z
it matched but it match many patterns on this website.
For example regex should match url above but it shouldnt match with
http://www.hepsiburada.com/bilgisayarlar-c-2147483646
Since you are using a re.match you really need to match the string from the beginning. However, the main problem is that your -p- is inside a character class, and is thus treated as separate symbols that can be matched. Same is with the \w+ - it is considered as \w and + separately.
So, use a sequence:
(.*)hepsiburada\.com/([\w.-]+)(-p-\w+)$
See this regex demo
Or
^https?://(?:www\.)?hepsiburada\.com/([\w.-]+)(-p-\w+)$
See the regex demo
Note that most probably you even have no need in the capture groups, and (...) parentheses can be removed from the pattern.

How to apply Condition in regex

Hello i am a newbie and currently trying to learn about regex pattern by experimenting on various patterns. I tried to create the regex pattern for this url but failed. It's a pagination link of amazon.
http://www.amazon.in/s/lp_6563520031_pg_2?rh=n%3A5866078031%2Cn%3A%215866079031%2Cn%3A6563520031&page=2s&ie=UTF8&qid=1446802571
Or
http://www.amazon.in/Tena-Wet-Wipe-Pulls-White/dp/B001O1G242/ref=sr_1_46?s=industrial&ie=UTF8&qid=1446802608&sr=1-46
I just want to check the url by only these two things.
If the url has dp directory or product directory
If the url has query string page having any digit
I tried to create the regex pattern but failed. I want that if the first thing is not there the regex pattern should match the second (or vice versa).
Here's the regex pattern I made:
.*\/(dp|product)\/ | .*page
Here is my regex101 link: https://regex101.com/r/zD2gP5/1#python
Since you just want to check if a string contains some pattern, you can use
\/(?:dp|product)\/|[&?]page=
See regex demo
In Python, just check with re.search:
import re
p = re.compile(r'/(?:dp|product)/|[&?]page=')
test_str = "http://w...content-available-to-author-only...n.in/s/lp_6563520031_pg_2?rh=n%3A5866078031%2Cn%3A%215866079031%2Cn%3A6563520031&page=2s&ie=UTF8&qid=14468025716"
if p.search(test_str):
print ("Found!")
Also, in Python regex patterns, there is no need to escape / slashes.
The regex matches two alternative subpatterns (\/(?:dp|product)\/ and [&?]page=):
/ - a forward slash
(?:dp|product) - either dp or product (without storing the capture inside the capture buffer since it is a non-capturing group)
/ - a slash
| - or...
[&?] - either a & or ? (we check the start of a query string parameter)
page= - literal sequence of symbols page=.
\/(dp|product)\/|page=(?=[^&]*\d)[^&]+
This would be my idea, please test it and let me know if you have question about.

How does this code work to extract URL's from a string with regex

I'm using a snippet i found on stackexchange that finds all url's in a string, using re.findall(). It works perfectly, however to further my knowledge I would like to know how exactly it works. The code is as follows-
re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', site)
As far as i understand, its finding all strings starting with http or https (is that why the [s] is in square brackets?) but I'm not really sure about all the stuff after- the (?:[etc etc etc]))+. I think the stuff in the square brackets eg. [a-zA-Z] is meaning all letters from a to z caps or not, but what about the rest of the stuff? And how is it working to only get the url and not random string at the end of the url?
Thanks in advance :)
Using this link you can get your regex explained:
Your regex explained
To add a bit more:
[s]? means "an optional 's' character" but that's because of the ? not of the brackets [I think they are superfluous.
Space isn't one of the accepted characters so it would stop there indeed. Same for '/'. It is not literally mentioned nor is it part of the character range $-_ (see http://www.asciitable.com/index/asciifull.gif).
(?:%[0-9a-fA-F][0-9a-fA-F]) this matches hexadecimal character codes in URLs e.g. %2f for the '/' character.
A non-capturing group means that the group is matched but that the resulting match is not stored in the regex return value, i.e. you cannot extract that matching bit of the string after the regex has been run against your string.

Categories