I'm using Django's URLconf, the URL I will receive is /?code=authenticationcode
I want to match the URL using r'^\?code=(?P<code>.*)$' , but it doesn't work.
Then I found out it is the problem of '?'.
Becuase I tried to match /aaa?aaa using r'aaa\?aaa' r'aaa\\?aaa' even r'aaa.*aaa' , all failed, but it works when it's "+" or any other character.
How to match the '?', is it special?
>>> s="aaa?aaa"
>>> import re
>>> re.findall(r'aaa\?aaa', s)
['aaa?aaa']
The reason /aaa?aaa won't match inside your URL is because a ? begins a new GET query.
So, the matchable part of the URL is only up to the first 'aaa'. The remaining '?aaa' is a new query string separated by the '?' mark, containing a variable "aaa" being passed as a GET parameter.
What you can do here is encode the variable before it makes its way into the URL. The encoded form of ? is %3F.
You should also not match a GET query such as /?code=authenticationcode using regex at all. Instead, match your URL up to / using r'^$'. Django will pass the variable code as a GET parameter to the request object, which you can obtain in your view using request.GET.get('code').
You are not allowed to use ? in a URL as a variable value. The ? indicates that there are variables coming in.
Like: http://www.example.com?variable=1&another_variable=2
Replace it or escape it. Here's some nice documentation.
Django's urls.py does not parse query strings, so there is no way to get this information at the urls.py file.
Instead, parse it in your view:
def foo(request):
code = request.GET.get('code')
if code:
# do stuff
else:
# No code!
"How to match the '?', is it special?"
Yes, but you are properly escaping it by using the backslash. I do not see where you have accounted for the leading forward slash, though. That bit just needs to be added in:
r'^/\?code=(?P<code>.*)$'
supress the regex metacharacters with []
>>> s
'/?code=authenticationcode'
>>> r=re.compile(r'^/[?]code=(.+)')
>>> m=r.match(s)
>>> m.groups()
('authenticationcode',)
Related
I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]
I have such complex file: http://regexr.com/3a8n4
I need to regex every domain out of it, meaning such a line:
http://liqueur.werbeschalter.com/if/?http%3A%2F%2Fwww.vornamenkartei.de
should yield me:
liqueur.werbeschalter.com and www.vornamenkartei.de
I could do this with python.
Any ideas?
Trying this:
https?:\/\/(.+?)\/
Should be ok, but I wanted to get also the other domains after the "http%3A..."
(?:https?:\/\/|www\.)([^\/]+)\/.*$
Relatively simple, gets everything between the scheme and the start of the path, and captures it on group 1.
(?:): non-capturing group
https?|www.\: matches http with a optional s, OR www.
:\/\/: just the start of a URL, no special meaning. \s are for escaping
([^\/]+): creates a matching group (()) that matches any character except \/ one or more times
\/: matches a literal slash
See here: http://regexr.com/3a8n7
But ideally you wouldn't use regexes directly to parse the URL. Instead, use urlparse:
import re
import urlparse
with open("yourfile") as f:
for line in f:
referrer = re.match("Referrer: (.*)$")
url = urlparse.urlparse(referrer)
print(url.netloc) # or whatever you want to do
To get both the domain names and the URL-encoded domain names, you might want to try the following:
(?:https?(?::\/\/|%3A%2F%2F))([^\/%]*)
The reason for the % in the character class is in case there is a URL-encoded forward slash in the URL.
Please see Regex Demo here.
How about this ?
for url in urls:
result = urlparse(url)
print("{}://{}".format(result.scheme, result.netloc))
unquoted = unquote(result.query)
parsed_qs = parse_qs(unquoted, keep_blank_values=True)
extracted_strings = list(parsed_qs.keys())
for get_arg_values in parsed_qs.values():
extracted_strings.extend(get_arg_values)
for possible_url in extracted_strings:
if possible_url.startswith('http'):
parsed_url = urlparse(possible_url)
print("{}://{}".format(parsed_url.scheme, parsed_url.netloc))
Python has means to parse urls and get params, we also need to process special case when get parameter doesn't have value and process keys as well.
EDIT: updated code
I am scraping a page with Python and BeautifulSoup library.
I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this
javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
You can write a straightforward regex to extract the URL.
>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'
The regex in question here is
'(.*?)'
Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.
You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.
I did it that way.
terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
terms.split("('")[1].split("','")[0]
outputs
/Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Instead of a regex, you could just partition it twice on something, (eg: '):
s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Here's a quick and ugly answer
href.split("'")[1]
I'm using python and django to match urls for my site. I need to match a url that looks like this:
/company/code/?code=34k3593d39k
The part after ?code= is any combination of letters and numbers, and any length.
I've tried this so far:
r'^company/code/(.+)/$'
r'^company/code/(\w+)/$'
r'^company/code/(\D+)/$'
r'^company/code/(.*)/$'
But so far none are catching the expression. Any ideas? Thanks
code=34k3593d39k is GET parameter and you don't need to define the pattern for it in URL pattern. You can access it using request.GET.get('code') under view. The pattern should be just:
r'^company/code/$'
Usage, accessing GET parameter:
def my_view(request):
code = request.GET.get('code')
print code
Check the documentation:
The URLconf searches against the requested URL, as a normal Python
string. This does not include GET or POST parameters, or the domain
name.
The first pattern will work if you move the last / to just after the ^:
>>> import re
>>> re.match(r'^/company/code/(.+)$', '/company/code/?code=34k3593d39k')
<_sre.SRE_Match object at 0x0209C4A0>
>>> re.match(r'^/company/code/(.+)$', '/company/code/?code=34k3593d39k').groups()
('?code=34k3593d39k',)
>>>
Note too that the ^ is unnecessary because re.match matches from the start of the string:
>>> re.match(r'/company/code/(.+)$', '/company/code/?code=34k3593d39k').groups()
('?code=34k3593d39k',)
>>>
I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."
You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk
I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!
Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)