How to join a string to a URL in Python? - python

I am trying to join a string in a URL, but the problem is that since it's spaced the other part does not get recognized as part of the URL.
Here would be an example:
import urllib
import urllib2
website = "http://example.php?id=1 order by 1--"
request = urllib2.Request(website)
response = urllib2.urlopen(request)
html = response.read()
The "order by 1--" part is not recognized as part of the URL.

You should better use urllib.urlencode or urllib.quote:
website = "http://example.com/?" + urllib.quote("?id=1 order by 1--")
or
website = "http://example.com/?" + urllib.urlencode({"id": "1 order by 1 --"})
and about the query you're trying to achieve:
I think you're forgetting a ; to end the first query.

Of course not. Spaces are invalid in a query string, and should be replaced by +.
http://example.com/?1+2+3

Related

String Splitting of an URL which always changes the position of it's values in python

I need to split an url which is changing the positions of it's values very oftenly.
for example:-
This is the url with three different positions of request token
01:-https://127.0.0.1/?action=login&type=login&status=success&request_token=oCS44HJQT2ZSCGb39H76CjgXb0s2klwA
02:-https://127.0.0.1/?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success
03:-https://127.0.0.1/?&action=login&request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&type=login&status=success
From thses url i need only the value of request token which comes after the '=' with an alphanumeric number like this '43CbEWSxdqztXNRpb2zmypCr081eF92d'.
And to split this url i'm using this code
request_token = driver.current_url.split('=')[1].split('&action')[0]
But it gives me error when the url is not in the specified position.
So can anyone please give me a solution to this url splitting in just a single line in python and it'd be a great blessing for me from my fellow stack members.
Note:- Here i'm using driver.current_url because i'm working in selenium to do the thing.
You can use the urllib.parse module to parse URLs properly.
>>> from urllib.parse import urlparse, parse_qs
>>> url = "?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success"
>>> query = parse_qs(urlparse(url).query)
>>> query['request_token']
['43CbEWSxdqztXNRpb2zmypCr081eF92d']
>>> query['request_token'][0]
'43CbEWSxdqztXNRpb2zmypCr081eF92d'
This handles the actual structure of the URLs and doesn't depend on the position of the parameter or other special cases you'd have to handle in a regex.
Assuming you have the URLs as strings then you could use a regular expression to isolate the request tokens.
import re
urls = ['https://127.0.0.1/?action=login&type=login&status=success&request_token=oCS44HJQT2ZSCGb39H76CjgXb0s2klwA',
'https://127.0.0.1/?request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&action=login&type=login&status=success',
'https://127.0.0.1/?&action=login&request_token=43CbEWSxdqztXNRpb2zmypCr081eF92d&type=login&status=success']
for url in urls:
m = re.match('.*request_token=(.*?)(?:&|$)', url)
if m:
print(m.group(1))

Regular Expression In Order To Identify Tor Domains

I am working on a scraper that goes through html code trying to scrape tor domains. However I am having trouble coming up with a piece of code to match tor domains.
Tor domains are typically in the format of:
http://sitegoeshere.onion
or
https://sitegoeshere.onion
I just want to match urls that would be contained within a page, in the format http://sitetexthere.onion or https://sitehereitis.onion. This is within a bunch of text that may not be urls. It should just pull out the urls.
I am sure there is an easy or good piece of regex that'll do this but I have not been able to find one. If anyone is able to link one or quickly spin one up that'd be muchos appreciated. Many thanks.
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
r = session.get('http://facebookcorewwwi.onion')
print(r.text)
The regex.match will return None if the URL isn't matched.
import re
regex = re.compile(r"^https?\:\/\/[\w\-\.]+\.onion")
url = 'https://sitegoes-here.onion'
if regex.match(url):
print('Valid Tor Domain!')
else:
print('Invalid Tor Domain!')
For optional http(s):
regex = re.compile(r"^(?:https?\:\/\/)?[\w\-\.]+\.onion")
Regex patterns are mostly standard, so, i would recommend you this pattern:
'.onion$'
Backslash escapes the dot, and '$' character means the end of string. Since all urls starts with 'http(s)://' there's no need to including it in the pattern.
Assuming these are taken from href attributes you could try an attribute = value selector with $ ends with operator
from bs4 import BeautifulSoup as bs
import requests
resp = requests.get("https://en.wikipedia.org/wiki/Tor_(anonymity_network)") #example url. Replace with yours.
soup = bs(resp.text,'lxml')
links = [item['href'] for item in soup.select('[href$=".onion"]')]

python - parsing an url

I am writing a simple script that checks if a website is present on google first search for a determined keyword.
Now,this is the function that parse a url and return the host name:
def parse_url(url):
url = urlparse(url)
hostname = url.netloc
return hostname
and starting from a list of tags selected by:
linkElems = soup.select('.r a') #in google first page the resulting urls have class r
I wrote this:
for link in linkElems:
l = link.get("href")[7:]
url = parse_url(l)
if "www.example.com" == url:
#do stuff (ex store in a list, etc)
in this last one, in the second line, i have to start from the seventh index, because all href values start with '/url?q='.
I am learning python, so i am wondering if there is a better way to do this, or simply an alternative one (maybe with regex or replace method or from urlparse library)
You can use python lxml module to do that which is also order of magnitude faster than BeautifulSoup.
This can be done something like this :
import requests
from lxml import html
blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)
print(root.xpath('//h3[#class="r"]/a/#href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[#class="r"]/a/#href')])
This will result in :
http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']

concatenate an integer to the url in python giving error

I would like to parse through a set of URLs, so I would like to concatenate an integer where the page id is changing like this.
In the middle of the URL there is %count% but it seems not working. How can I concatenate it?
count=2
while (count < pages):
mech = Browser()
url = 'http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
url = int(raw_input(url))
mech = Browser()
page = mech.open(url)
soup = BeautifulSoup(page)
print url
for thediv in soup.findAll('li',{'class':' ilo2'}):
links = thediv.find('a')
links = links['href']
print links
count = count+1
I am getting this error:
TypeError: not all arguments converted during string formatting
Final Url Format
http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491
The % operator does not work like that in python.
Here is how you should use it :
url = 'http://....../ref=sr_pg_%s?rh=.............' % (count, )
As you already have % symbols in your URL pattern, you should begin by doubling them so they won't be seen as placeholders by python :
url = 'http://www.amazon.com/s/ref=sr_pg_%s?rh=n%%3A2858778011%%2Cp_drm_rights%%3APurchase%%7CRental%%2Cn%%3A2858905011%%2Cp_n_date%%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491' % (count, )
That being said, there is python module dedicated to parse and create URL, it is named urllib and you can find its documentation here : https://docs.python.org/3.3/library/urllib.parse.html
You have urlencoded entities in your string (%3A etc.). You might try using {} syntax instead:
url = 'http://.....{}...{}...'.format(first_arg, second_arg)
then you'll see any other issues in the string also..
If you were looking to keep the string as is (not inserting a variable value inside), the problem would be due to the fact that you use single quotes ' to delimit your string that contains itself quotes inside. You can use instead double quotes:
url = "http://www.amazon.com/s/ref=sr_pg_%s'% count %'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491"
A better solution is escaping the quotes:
url = 'http://www.amazon.com/s/ref=sr_pg_%s\'% count %\'%s?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'
Instead of trying to parse or edit URLs using raw strings, one should use the dedicated module, urllib2 (or urllib, depending on the python version).
Here is a simple example, using the OP's url :
from urllib2 import urlparse
original_url = (
"""http://www.amazon.com/s/ref=sr_pg_2?rh=n%3A2858778011%2"""
"""Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date"""
"""%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491""")
parsed = urlparse.urlparse(original_url)
This returns something like that :
ParseResult(
scheme='http', netloc='www.amazon.com', path='/s/ref=sr_pg_2',
params='',
query='rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491', fragment='')
Then we edit the path part of the url
scheme, netloc, path, params, query, fragment = parsed
path = '/s/ref=sr_pg_%d' % (count, )
And we "unparse" the url :
new_url = urlparse.urlunparse((scheme, netloc, path, params, query, fragment))
And we have a new url with path edited :
'http://www.amazon.com/s/ref=sr_pg_423?rh=n%3A2858778011%2Cp_drm_rights%3APurchase%7CRental%2Cn%3A2858905011%2Cp_n_date%3A2693527011&page=3&sort=csrank&ie=UTF8&qid=1403073491'

Using urllib.urllibencode values to complete search form

I've used this community a number of times, and the answers for questions I search are awesome. I have searched around for a solution to this one, but I am having problems. I think it has to do with my lack of knowledge about html code and structure. Right now I am trying to use urllib.urlencode to fill out a form on a website. Unfortunately, no matter what combinations of values I add to the dictionary, the html data returned as 'soup' is the same webpage with a list of the search options. I'm guessing that means that it is not passing the search data properly with urllib.urlencode.
An example of the webpage is:
http://www.mci.mndm.gov.on.ca/Claims/Cf_Claims/clm_cls.cfm?Div=80
which is the url I will go to, where the end DIV=80 or Div=70 etc is made in first two lines with a reference to another function 'urlData(division)'. After these lines is where the problem is happening. I've tried to include a value for each input line under the search form, but I am definitely missing something.
Code:
def searchHolder(Name, division):
url = ('http://www.mci.mndm.gov.on.ca/Claims/Cf_Claims/clm_cls.cfm'+
'?Div='+str(urlData(division)))#creates url given above
print url#checked its same url as the url given above for the case I am having problems with
values = ({'HolderName': Name, 'action':'clm_clr.cfm', 'txtDiv' : 80,
'submit': 'Start Search'})
data = urllib.urlencode(values)
html = urllib.urlopen(url, data)
soup = bs4.BeautifulSoup(html)
soup.unicode
print soup.text
return soup
The form "action" isn't a parameter you pass. Rather, it's the URL you need to send your request to in order to get results. Give this a try:
def searchHolder(Name, division):
url = ('http://www.mci.mndm.gov.on.ca/Claims/Cf_Claims/clm_clr.cfm')
values = ({'HolderName': Name, 'txtDiv' : 80})
data = urllib.urlencode(values)
html = urllib.urlopen(url, data)
soup = bs4.BeautifulSoup(html)
soup.unicode
print soup.text
return soup

Categories