Extract URL from text without space between URL in Python3

Extract URL from text without space between URL in Python3 - python

I have problem with python regex, I would like to extract any URL in text except email address. My current regex pattern still can't extract url if there is no space before URL. This is my regex pattern
\b((?:(?:https|ftp|http)?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))
you can check on this regex editor (https://regex101.com/r/lcNc9N/9) , my pattern still can't recognize URL if there's no space before it, any hints or solutions are welcome.

If there's characters other then space before it, it's no longer URL :)
From the RFC:
In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
So I don't really know what you mean by "URL," but replacing the first \b with something like:
[\s\w]*?
Might be what you want. The first group will match URLs even if there are digits, alphabet letters or underscores before them.

Related

Trying to search for a certain pattern in text file

Okay so in Python, I'm trying to search for the pattern "comma, space, any lowercase character", but I cant get a regular expression that seems to work. The whole regular expressions thing is pretty new to me and I have no idea what I'm doing. I was able to search for a "number, space, any character using "[1-9]+ [a-zA-z]", but I'm not sure how to search for the pattern mentioned above. The picture included is an example of what pattern I am trying to search for in the text file.
Thanks,
Schulzy

A Regex expression that would work is
, [a-z]
the comma and space are matched exactly, and the '[]' is a group, where anything in the group could be matched. you want any lowercase char's, so we put [a-z] for any character between lowercase a to z.

Why does capturing urls with proper protocols fail?

I started writing a regex to try to capture as many urls as possible. However, for some reason, I cannot make it work.
Regex:
^(https?|ftps?|mailto|gopher|telnet|www\.)\:.+?\/(?=\s)
Demo: Regex101
Any help is appreciated. Thx in advance.

You may use
^(?:(?:https?|ftps?|gopher|telnet):\/\/|www\.|mailto:)\S+
See the regex demo and its graph:
Details
^ - start of string
(?:(?:https?|ftps?|gopher|telnet):\/\/|www\.|mailto:) - any of
(?:https?|ftps?|gopher|telnet):\/\/ - http, https, ftp, ftps, gopher or telnet and then :// substring
| - or
www\. - www. substring
| - or
mailto: - mailto: substring
\S+ - 1 or more non-whitespace chars.

Your pattern \/(?=\s) requires the URL to end in a slash. You can check this by adding a slash to the end of any of the URLs in your test snippet.
There's no real reason to do that - you can just remove the \/ and allow the URL to end on any character followed by a whitespace.
However, in addition to this you should be aware that the whitespace thing isn't very robust. If a URL occurs in text, it may be followed by punctuation or parentheses, which are technically valid URL characters and which your filter (minus the \/) will include, even though they are likely not part of it.
There's obviously some ambiguity in these cases, but it might be a better heuristic to exclude any punctuation characters at the end of the URL.
(If you want to be really sophisticated about it, you can do what GitHub's markdown parser does, and include closing parentheses at the end if and only if they match with opening parentheses inside the URL. This helps recognize links in contexts like like (See https://en.wikipedia.org/wiki/Something_(disambiguation)). But this isn't feasible with only regex, and requires some extra processing.)

Regex to identify Reddit usernames

I am making a bot with a option to not post if the username is not a certain user.
Reddit usernames can contain letters in both cases, and have numbers.
Which regex can be used to identify such a username? The format is /u/USERNAME where username can have letters of both cases and numbers, such as ExaMp13.
I have tried /u/[A-Z][a-z][0-9]

Valid characters for Reddit usernames are preceded by /u/ and include:
UPPERCASE
lowercase
Digits
Underscore
Hyphen
This regex meets those criteria:
/u/[A-Za-z0-9_-]+

Brief
Thanks for updating your post with something you've tried as this gives us an idea of what you may not be understanding (and helps us explain where you went wrong and how to fix it).
Your regex doesn't work because it checks for [A-Z] followed by [a-z], then by [0-9]. So your regex will only match something like Be1
Answer
What you should instead try for is [a-zA-Z0-9] or \w and specifying a quantifier such as + (one or more).
For your specific problem, you should use \/u\/(\w+) (or /u/(\w+) since python doesn't care about escaping). This will allow you to then check the first capture group against a list of users you want to not post for.
These regular expressions will ensure that it matches /u/ followed by any word character [a-zA-Z0-9_] between 1 and unlimited times.
See a working example here

You can use a regex like this:
/u/\w+

How to set regex for website url pattern

The url pattern is
http://www.hepsiburada.com/philips-40pfk5500-40-102-ekran-full-hd-200-hz-uydu-alicili-cift-cekirdek-smart-android-led-tv-p-EVPHI40PFK5500
This website has similar urls. The unique identifier is -p- for this url.
The url pattern always has -p- before word which is at end of url.
I used the following regex
(.*)hepsiburada\.com\/([\w.-]+)([\-p\-\w+])\Z
it matched but it match many patterns on this website.
For example regex should match url above but it shouldnt match with
http://www.hepsiburada.com/bilgisayarlar-c-2147483646

Since you are using a re.match you really need to match the string from the beginning. However, the main problem is that your -p- is inside a character class, and is thus treated as separate symbols that can be matched. Same is with the \w+ - it is considered as \w and + separately.
So, use a sequence:
(.*)hepsiburada\.com/([\w.-]+)(-p-\w+)$
See this regex demo
Or
^https?://(?:www\.)?hepsiburada\.com/([\w.-]+)(-p-\w+)$
See the regex demo
Note that most probably you even have no need in the capture groups, and (...) parentheses can be removed from the pattern.

How does this code work to extract URL's from a string with regex

I'm using a snippet i found on stackexchange that finds all url's in a string, using re.findall(). It works perfectly, however to further my knowledge I would like to know how exactly it works. The code is as follows-
re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', site)
As far as i understand, its finding all strings starting with http or https (is that why the [s] is in square brackets?) but I'm not really sure about all the stuff after- the (?:[etc etc etc]))+. I think the stuff in the square brackets eg. [a-zA-Z] is meaning all letters from a to z caps or not, but what about the rest of the stuff? And how is it working to only get the url and not random string at the end of the url?
Thanks in advance :)

Using this link you can get your regex explained:
Your regex explained
To add a bit more:
[s]? means "an optional 's' character" but that's because of the ? not of the brackets [I think they are superfluous.
Space isn't one of the accepted characters so it would stop there indeed. Same for '/'. It is not literally mentioned nor is it part of the character range $-_ (see http://www.asciitable.com/index/asciifull.gif).
(?:%[0-9a-fA-F][0-9a-fA-F]) this matches hexadecimal character codes in URLs e.g. %2f for the '/' character.
A non-capturing group means that the group is matched but that the resulting match is not stored in the regex return value, i.e. you cannot extract that matching bit of the string after the regex has been run against your string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URL from text without space between URL in Python3 - python

Related

Trying to search for a certain pattern in text file

Why does capturing urls with proper protocols fail?

Regex to identify Reddit usernames

How to set regex for website url pattern

How does this code work to extract URL's from a string with regex

Categories

Resources