I have such complex file: http://regexr.com/3a8n4
I need to regex every domain out of it, meaning such a line:
http://liqueur.werbeschalter.com/if/?http%3A%2F%2Fwww.vornamenkartei.de
should yield me:
liqueur.werbeschalter.com and www.vornamenkartei.de
I could do this with python.
Any ideas?
Trying this:
https?:\/\/(.+?)\/
Should be ok, but I wanted to get also the other domains after the "http%3A..."
(?:https?:\/\/|www\.)([^\/]+)\/.*$
Relatively simple, gets everything between the scheme and the start of the path, and captures it on group 1.
(?:): non-capturing group
https?|www.\: matches http with a optional s, OR www.
:\/\/: just the start of a URL, no special meaning. \s are for escaping
([^\/]+): creates a matching group (()) that matches any character except \/ one or more times
\/: matches a literal slash
See here: http://regexr.com/3a8n7
But ideally you wouldn't use regexes directly to parse the URL. Instead, use urlparse:
import re
import urlparse
with open("yourfile") as f:
for line in f:
referrer = re.match("Referrer: (.*)$")
url = urlparse.urlparse(referrer)
print(url.netloc) # or whatever you want to do
To get both the domain names and the URL-encoded domain names, you might want to try the following:
(?:https?(?::\/\/|%3A%2F%2F))([^\/%]*)
The reason for the % in the character class is in case there is a URL-encoded forward slash in the URL.
Please see Regex Demo here.
How about this ?
for url in urls:
result = urlparse(url)
print("{}://{}".format(result.scheme, result.netloc))
unquoted = unquote(result.query)
parsed_qs = parse_qs(unquoted, keep_blank_values=True)
extracted_strings = list(parsed_qs.keys())
for get_arg_values in parsed_qs.values():
extracted_strings.extend(get_arg_values)
for possible_url in extracted_strings:
if possible_url.startswith('http'):
parsed_url = urlparse(possible_url)
print("{}://{}".format(parsed_url.scheme, parsed_url.netloc))
Python has means to parse urls and get params, we also need to process special case when get parameter doesn't have value and process keys as well.
EDIT: updated code
Related
I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.
import re
# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic
This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.
How can I make the regex respect up till the final bracket?
For those types of urls, you'd need a recursive approach which only the newer regex module supports:
import regex as re
data = """
It's very easy to make some words **bold** and other words *italic* with Markdown.
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""
pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')
for match in pattern.finditer(data):
description, _, url = match.groups()
print(f"{description}: {url}")
This yields
link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)
See a demo on regex101.com.
This cryptic little beauty boils down to
\[([^][]+)\] # capture anything between "[" and "]" into group 1
(\( # open group 2 and match "("
((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
# capture the content into group 3 (the url)
\)) # match ")" and close group 2
NOTE: The problem with this approach is that it fails for e.g. urls like
[some nasty description](https://google.com/()
# ^^^
which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.
I think you need to distinguish between what makes a valid link in markdown, and (optionally) what is a valid url.
Valid links in markdown can, for example, also be relative paths, and urls may or may not have the 'http(s)' or the 'www' part.
Your code would already work by simply using link_url = "http[s]?://.+" or even link_url = ".*". It would solve the problem of urls ending with brackets, and would simply mean that you rely on the markdown structure []() to find links .
Validating urls is an entirely different discussion: How do you validate a URL with a regular expression in Python?
Example code fix:
import re
# Extract []() style links
link_name = "[^\[]+"
link_url = "http[s]?://.+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'
for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
name = match[0]
url = match[1]
print(url)
# url will be https://wiki.com/atopic_(subtopic)
Note that I also adjusted link_name, to prevent problems with a single '[' somewhere in the markdown text.
I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.
I'm using python and django to match urls for my site. I need to match a url that looks like this:
/company/code/?code=34k3593d39k
The part after ?code= is any combination of letters and numbers, and any length.
I've tried this so far:
r'^company/code/(.+)/$'
r'^company/code/(\w+)/$'
r'^company/code/(\D+)/$'
r'^company/code/(.*)/$'
But so far none are catching the expression. Any ideas? Thanks
code=34k3593d39k is GET parameter and you don't need to define the pattern for it in URL pattern. You can access it using request.GET.get('code') under view. The pattern should be just:
r'^company/code/$'
Usage, accessing GET parameter:
def my_view(request):
code = request.GET.get('code')
print code
Check the documentation:
The URLconf searches against the requested URL, as a normal Python
string. This does not include GET or POST parameters, or the domain
name.
The first pattern will work if you move the last / to just after the ^:
>>> import re
>>> re.match(r'^/company/code/(.+)$', '/company/code/?code=34k3593d39k')
<_sre.SRE_Match object at 0x0209C4A0>
>>> re.match(r'^/company/code/(.+)$', '/company/code/?code=34k3593d39k').groups()
('?code=34k3593d39k',)
>>>
Note too that the ^ is unnecessary because re.match matches from the start of the string:
>>> re.match(r'/company/code/(.+)$', '/company/code/?code=34k3593d39k').groups()
('?code=34k3593d39k',)
>>>
I'm using Django's URLconf, the URL I will receive is /?code=authenticationcode
I want to match the URL using r'^\?code=(?P<code>.*)$' , but it doesn't work.
Then I found out it is the problem of '?'.
Becuase I tried to match /aaa?aaa using r'aaa\?aaa' r'aaa\\?aaa' even r'aaa.*aaa' , all failed, but it works when it's "+" or any other character.
How to match the '?', is it special?
>>> s="aaa?aaa"
>>> import re
>>> re.findall(r'aaa\?aaa', s)
['aaa?aaa']
The reason /aaa?aaa won't match inside your URL is because a ? begins a new GET query.
So, the matchable part of the URL is only up to the first 'aaa'. The remaining '?aaa' is a new query string separated by the '?' mark, containing a variable "aaa" being passed as a GET parameter.
What you can do here is encode the variable before it makes its way into the URL. The encoded form of ? is %3F.
You should also not match a GET query such as /?code=authenticationcode using regex at all. Instead, match your URL up to / using r'^$'. Django will pass the variable code as a GET parameter to the request object, which you can obtain in your view using request.GET.get('code').
You are not allowed to use ? in a URL as a variable value. The ? indicates that there are variables coming in.
Like: http://www.example.com?variable=1&another_variable=2
Replace it or escape it. Here's some nice documentation.
Django's urls.py does not parse query strings, so there is no way to get this information at the urls.py file.
Instead, parse it in your view:
def foo(request):
code = request.GET.get('code')
if code:
# do stuff
else:
# No code!
"How to match the '?', is it special?"
Yes, but you are properly escaping it by using the backslash. I do not see where you have accounted for the leading forward slash, though. That bit just needs to be added in:
r'^/\?code=(?P<code>.*)$'
supress the regex metacharacters with []
>>> s
'/?code=authenticationcode'
>>> r=re.compile(r'^/[?]code=(.+)')
>>> m=r.match(s)
>>> m.groups()
('authenticationcode',)
I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.
Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:
return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)
(where the variable 'tag' is a bit of html so the video works and 'value' a user post)
Now this works.. until the url is like this:
'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'
Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.
Example HTML:
No replies to this post..
Youtube vid:
http://www.youtube.com/watch?v=-JyZLS2IhkQ
More blabla
Thanks for your thoughts, much appreciated
Stefan
Here how I'm solving it:
import re
def youtube_url_validation(url):
youtube_regex = (
r'(https?://)?(www\.)?'
'(youtube|youtu|youtube-nocookie)\.(com|be)/'
'(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')
youtube_regex_match = re.match(youtube_regex, url)
if youtube_regex_match:
return youtube_regex_match
return youtube_regex_match
TESTS:
youtube_urls_test = [
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca']
for url in youtube_urls_test:
m = youtube_url_validation(url)
if m:
print('OK {}'.format(url))
print(m.groups())
print(m.group(6))
else:
print('FAIL {}'.format(url))
You should specify your regular expressions as raw strings.
You don't have to escape every character that looks special, just the ones which are.
Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.
If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.
You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'
Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).
In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).
Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.
r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'
Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.
What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.
from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}
cleaned_youtube_url = urlunparse((youtube_url.scheme, \
youtube_url.netloc, \
youtube_url.path,
None, \
urlencode(new_params), \
youtube_url.fragment))
It's a bit more code, but it allows you to avoid regex madness.
And as hop said, you should use raw strings for the regex.
Here's how I implemented it in my script:
string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"
youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)
if youtube:
print youtube
That outputs:
["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]
If you just wanted to grab the video id, for example, you would do:
video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list