I would like to parse through a set of URLs, so I would like to concatenate an integer where the page id is changing like this.
In the middle of the URL there is %count% but it seems not working. How can I concatenate it?
count=2
while (count < pages):
mech = Browser()
url = 'http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
url = int(raw_input(url))
mech = Browser()
page = mech.open(url)
soup = BeautifulSoup(page)
print url
for thediv in soup.findAll('li',{'class':' ilo2'}):
links = thediv.find('a')
links = links['href']
print links
count = count+1
I am getting this error:
TypeError: not all arguments converted during string formatting
Final Url Format
http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491
The % operator does not work like that in python.
Here is how you should use it :
url = 'http://....../ref=sr_pg_%s?rh=.............' % (count, )
As you already have % symbols in your URL pattern, you should begin by doubling them so they won't be seen as placeholders by python :
url = 'http://www.amazon.com/s/ref=sr_pg_%s?rh=n%%3A2858778011%%2Cp_drm_rights%%3APurchase%%7CRental%%2Cn%%3A2858905011%%2Cp_n_date%%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491' % (count, )
That being said, there is python module dedicated to parse and create URL, it is named urllib and you can find its documentation here : https://docs.python.org/3.3/library/urllib.parse.html
You have urlencoded entities in your string (%3A etc.). You might try using {} syntax instead:
url = 'http://.....{}...{}...'.format(first_arg, second_arg)
then you'll see any other issues in the string also..
If you were looking to keep the string as is (not inserting a variable value inside), the problem would be due to the fact that you use single quotes ' to delimit your string that contains itself quotes inside. You can use instead double quotes:
url = "http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491"
A better solution is escaping the quotes:
url = 'http://www.amazon.com/s/ref=sr_pg_%s\'% count %\'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
Instead of trying to parse or edit URLs using raw strings, one should use the dedicated module, urllib2 (or urllib, depending on the python version).
Here is a simple example, using the OP's url :
from urllib2 import urlparse
original_url = (
"""http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2"""
"""Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date"""
"""%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491""")
parsed = urlparse.urlparse(original_url)
This returns something like that :
ParseResult(
scheme='http', netloc='www.amazon.com', path='/s/ref=sr_pg_2',
params='',
query='rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491', fragment='')
Then we edit the path part of the url
scheme, netloc, path, params, query, fragment = parsed
path = '/s/ref=sr_pg_%d' % (count, )
And we "unparse" the url :
new_url = urlparse.urlunparse((scheme, netloc, path, params, query, fragment))
And we have a new url with path edited :
'http://www.amazon.com/s/ref=sr_pg_423?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
Related
I need to split an url which is changing the positions of it's values very oftenly.
for example:-
This is the url with three different positions of request token
01:-https://127.0.0.1/?action=login&type=login&status=success&request_token=oCS44HJQT2ZSCGb39H76CjgXb0s2klwA
02:-https://127.0.0.1/?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success
03:-https://127.0.0.1/?&action=login&request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&type=login&status=success
From thses url i need only the value of request token which comes after the '=' with an alphanumeric number like this '43CbEWSxdqztXNRpb2zmypCr081eF92d'.
And to split this url i'm using this code
request_token = driver.current_url.split('=')[1].split('&action')[0]
But it gives me error when the url is not in the specified position.
So can anyone please give me a solution to this url splitting in just a single line in python and it'd be a great blessing for me from my fellow stack members.
Note:- Here i'm using driver.current_url because i'm working in selenium to do the thing.
You can use the urllib.parse module to parse URLs properly.
>>> from urllib.parse import urlparse, parse_qs
>>> url = "?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success"
>>> query = parse_qs(urlparse(url).query)
>>> query['request_token']
['43CbEWSxdqztXNRpb2zmypCr081eF92d']
>>> query['request_token'][0]
'43CbEWSxdqztXNRpb2zmypCr081eF92d'
This handles the actual structure of the URLs and doesn't depend on the position of the parameter or other special cases you'd have to handle in a regex.
Assuming you have the URLs as strings then you could use a regular expression to isolate the request tokens.
import re
urls = ['https://127.0.0.1/?action=login&type=login&status=success&request_token=oCS44HJQT2ZSCGb39H76CjgXb0s2klwA',
'https://127.0.0.1/?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success',
'https://127.0.0.1/?&action=login&request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&type=login&status=success']
for url in urls:
m = re.match('.*request_token=(.*?)(?:&|$)', url)
if m:
print(m.group(1))
I am writing a simple script that checks if a website is present on google first search for a determined keyword.
Now,this is the function that parse a url and return the host name:
def parse_url(url):
url = urlparse(url)
hostname = url.netloc
return hostname
and starting from a list of tags selected by:
linkElems = soup.select('.r a') #in google first page the resulting urls have class r
I wrote this:
for link in linkElems:
l = link.get("href")[7:]
url = parse_url(l)
if "www.example.com" == url:
#do stuff (ex store in a list, etc)
in this last one, in the second line, i have to start from the seventh index, because all href values start with '/url?q='.
I am learning python, so i am wondering if there is a better way to do this, or simply an alternative one (maybe with regex or replace method or from urlparse library)
You can use python lxml module to do that which is also order of magnitude faster than BeautifulSoup.
This can be done something like this :
import requests
from lxml import html
blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)
print(root.xpath('//h3[#class="r"]/a/#href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[#class="r"]/a/#href')])
This will result in :
http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']
When I append a Unicode string to the end of str, I can not click on the URL.
Bad:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Ángel_Garasa"
print url
Good:
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
url = base_url + u"Toby_Maquire"
print url
It appears that you're printing the results in an IDE, perhaps PyCharm. You need to percent encode a UTF-8 encoded version of the string:
import urllib
base_url = 'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles='
name = u"Ángel_Garasa"
print base_url + urllib.quote(name.encode("utf-8"))
This shows:
In your case you need to update your code, so that the relevant field from the database is percent encoded. You only need to encode this one field to UTF-8 just for the percent encoding.
This function takes a string as an input, and if the string starts with http:// or the string starts with https://, the function will assume that the string is an absolute link. If the URL starts with /, the function will convert it to an absolute link.
Note that base is a global variable for now. My main concern is that this function is making too many assumptions. Is there a way to accomplish the task of resolving URLs without so many assumptions?
def get_url(item):
#absolute link
if item.startswith('http://') or item.startswith('https://'):
url = item
#root-relative link
elif item.startswith('/'):
url = base + item
else:
url = base + "/" + item
return url
Use urljoin from the urlparse module.
from urlparse import urljoin
base = 'http://myserver.com'
def get_url(item):
return urljoin(base, item)
urljoin handles absolute or relative links itself.
Examples
print get_url('/paul.html')
print get_url('//otherserver.com/paul.html')
print get_url('https://paul.com/paul.html')
print get_url('dir/paul.html')
Output
http://myserver.com/paul.html
http://otherserver.com/paul.html
https://paul.com/paul.html
http://myserver.com/dir/paul.html
1-Use regex
2-Add a trailing / to you base url
import re
base = 'http://www.example.com/'
def get_url(item):
#absolute link
pattern = "(http|https)://[\w\-]+(\.[\w\-]+)+\S*" # regex pattern to approve http and https started strings
if re.search(pattern, item):
url = item
#root-relative link
else:
url = base + item.lstrip('/')
return url
I am trying to join a string in a URL, but the problem is that since it's spaced the other part does not get recognized as part of the URL.
Here would be an example:
import urllib
import urllib2
website = "http://example.php?id=1 order by 1--"
request = urllib2.Request(website)
response = urllib2.urlopen(request)
html = response.read()
The "order by 1--" part is not recognized as part of the URL.
You should better use urllib.urlencode or urllib.quote:
website = "http://example.com/?" + urllib.quote("?id=1 order by 1--")
or
website = "http://example.com/?" + urllib.urlencode({"id": "1 order by 1 --"})
and about the query you're trying to achieve:
I think you're forgetting a ; to end the first query.
Of course not. Spaces are invalid in a query string, and should be replaced by +.
http://example.com/?1+2+3