I know many people encountered this error before but I couldn't find the solution to my problem.
I have a URL that I want to normalize:
url = u"http://www.dgzfp.de/Dienste/Fachbeitr%C3%A4ge.aspx?EntryId=267&Page=5"
scheme, host_port, path, query, fragment = urlsplit(url)
path = urllib.unquote(path)
path = urllib.quote(path,safe="%/")
This gives an error message:
/usr/lib64/python2.6/urllib.py:1236: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
res = map(safe_map.__getitem__, s)
Traceback (most recent call last):
File "url_normalization.py", line 246, in <module>
logging.info(get_canonical_url(url))
File "url_normalization.py", line 102, in get_canonical_url
path = urllib.quote(path,safe="%/")
File "/usr/lib64/python2.6/urllib.py", line 1236, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xc3'
I tried to remove the unicode indicator "u" from the URL string and I do not get the error message. But How can I get rid of the unicode automatically because I read it directly from a database.
urllib.quote() does not properly parse Unicode. To get around this, you can call the .encode() method on the url when reading it (or on the variable you read from the database). So run url = url.encode('utf-8'). With this you get:
import urllib
import urlparse
from urlparse import urlsplit
url = u"http://www.dgzfp.de/Dienste/Fachbeitr%C3%A4ge.aspx?EntryId=267&Page=5"
url = url.encode('utf-8')
scheme, host_port, path, query, fragment = urlsplit(url)
path = urllib.unquote(path)
path = urllib.quote(path,safe="%/")
and then your output for the path variable will be:
>>> path
'/Dienste/Fachbeitr%C3%A4ge.aspx'
Does this work?
Related
I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.
What is expected:
*If the output of the script below contains the string 5378, it should email me with the line the string appears.
#! /usr/bin/env python
from bs4 import BeautifulSoup
from lxml import html
import urllib2,re
import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)
BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"
webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name
I tried using findall(5378, name), but it returns to empty braces like this [].
I am struggling with Unicode issues if I am trying to use it along with grep.
$ python dell.py | grep 5378
Traceback (most recent call last):
File "dell.py", line 18, in <module>
print name
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)
Can someone tell me what am I doing wrong in both cases?
The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:
re.findall("5378", name)
When printed this will output [u'5378'] when it found something or [] when it didn't.
I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.
for input_element in findcolumn.find_all("div"):
name = unicode(input_element.text.strip())
if re.search("5378", name) != None:
print unicode(name)
As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().
Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)
I had a script in Python2 that was working great.
def _generate_signature(data):
return hmac.new('key', data, hashlib.sha256).hexdigest()
Where data was the output of json.dumps.
Now, if I try to run the same kind of code in Python 3, I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/hmac.py", line 144, in new
return HMAC(key, msg, digestmod)
File "/usr/lib/python3.4/hmac.py", line 42, in __init__
raise TypeError("key: expected bytes or bytearray, but got %r" %type(key).__name__)
TypeError: key: expected bytes or bytearray, but got 'str'
If I try something like transforming the key to bytes like so:
bytes('key')
I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding
I'm still struggling to understand the encodings in Python 3.
You can use bytes literal: b'key'
def _generate_signature(data):
return hmac.new(b'key', data, hashlib.sha256).hexdigest()
In addition to that, make sure data is also bytes. For example, if it is read from file, you need to use binary mode (rb) when opening the file.
Not to resurrect an old question but I did want to add something I feel is missing from this answer, to which I had trouble finding an appropriate explanation/example of anywhere else:
Aquiles Carattino was pretty close with his attempt at converting the string to bytes, but was missing the second argument, the encoding of the string to be converted to bytes.
If someone would like to convert a string to bytes through some other means than static assignment (such as reading from a config file or a DB), the following should work:
(Python 3+ only, not compatible with Python 2)
import hmac, hashlib
def _generate_signature(data):
key = 'key' # Defined as a simple string.
key_bytes= bytes(key , 'latin-1') # Commonly 'latin-1' or 'ascii'
data_bytes = bytes(data, 'latin-1') # Assumes `data` is also an ascii string.
return hmac.new(key_bytes, data_bytes , hashlib.sha256).hexdigest()
print(
_generate_signature('this is my string of data')
)
try
codecs.encode()
which can be used both in python2.7.12 and 3.5.2
import hashlib
import codecs
import hmac
a = "aaaaaaa"
b = "bbbbbbb"
hmac.new(codecs.encode(a), msg=codecs.encode(b), digestmod=hashlib.sha256).hexdigest()
for python3 this is how i solved it.
import codecs
import hmac
def _generate_signature(data):
return hmac.new(codecs.encode(key), codecs.encode(data), codecs.encode(hashlib.sha256)).hexdigest()
I want to open a connection to a ldap directory using ldap url that will be given at run time. For example :
ldap://192.168.2.151/dc=directory,dc=example,dc=com
It is valid as far as I can tell. Python-ldap url parser ldapurl.LDAPUrl accepts it.
url = 'ldap://192.168.2.151/dc=directory,dc=example,dc=com'
parsed_url = ldapurl.LDAPUrl(url)
parsed_url.dn
'dc=directory,dc=example,dc=com'
But if I use it to initialize a LDAPObject, I get a ldap.LDAPError exception
ldap.initialize(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/ldap/functions.py", line 91, in initialize
return LDAPObject(uri,trace_level,trace_file,trace_stack_limit)
File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 70, in __init__
self._l = ldap.functions._ldap_function_call(ldap._ldap_module_lock,_ldap.initialize,uri)
File "/usr/lib/python2.7/dist-packages/ldap/functions.py", line 63, in _ldap_function_call
result = func(*args,**kwargs)
ldap.LDAPError: (0, 'Error')
I found that if I manually encode the dn part of the url, it works :
url = 'ldap://192.168.2.151/dc=directory%2cdc=example%2cdc=com'
#url still valid
parsed_url = ldapurl.LDAPUrl(url)
parsed_url.dn
'dc=directory,dc=example,dc=com'
#and will return a valid connection
ldap.initialize(url)
<ldap.ldapobject.SimpleLDAPObject instance at 0x1400098>
How can I ensure robust url handling in ldap.initialize without encoding parts of the url myself ? (which, I'm afraid, won't be that robust anyway).
You can programatically encode the last part of the URL:
from urllib import quote # works in Python 2.x
from urllib.parse import quote # works in Python 3.x
url = 'ldap://192.168.2.151/dc=directory,dc=paralint,dc=com'
idx = url.rindex('/') + 1
url[:idx] + quote(url[idx:], '=')
=> 'ldap://192.168.2.151/dc=directory%2Cdc=paralint%2Cdc=com'
One can use LDAPUrl.unparse() method to get a properly encoded version of the URI, like this :
>>> import ldapurl
>>> url = ldapurl.LDAPUrl('ldap://192.168.2.151/dc=directory,dc=example,dc=com')
>>> url.unparse()
'ldap://192.168.2.151/dc%3Ddirectory%2Cdc%3Dparalint%2Cdc%3Dcom???'
>>> ldap.initialize(url.unparse())
<ldap.ldapobject.SimpleLDAPObject instance at 0x103d998>
And LDAPUrl.unparse() will not reencode an already encoded url :
>>> url = ldapurl.LDAPUrl('ldap://example.com/dc%3Dusers%2Cdc%3Dexample%2Cdc%3Dcom%2F???')
>>> url.unparse()
'ldap://example.com/dc%3Dusers%2Cdc%3Dexample%2Cdc%3Dcom%2F???'
So you can use it blindly on any ldap uri your program must handle.
I do geocoding with python and I think I need to encode the variable region with urllencode so that it works with content that has whitespace and other special characters:
url = urllib.urlencode('http://maps.googleapis.com/maps/api/geocode/json?address='+region+'&sensor=false')
logging.info('url:'+url)
result = urlfetch.fetch(url)
It generates an error log when the variable region contains a whitespace
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~montaoproject/pricehandling.355268396595012751/in.py", line 153, in get
url = urllib.urlencode('http://maps.googleapis.com/maps/api/geocode/json?address='+region+'&sensor=false')
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib.py", line 1275, in urlencode
raise TypeError
TypeError: not a valid non-string sequence or mapping object
The background is another question I asked where I had I problem that I'm troublehhotting to be that the code works but not for regions that are two or more words ie names with whitespaces.
https://stackoverflow.com/questions/8441063/how-should-i-use-urlfetch-here
On production I used another variable. I thought it did not matter that it had whitespace. When I try variables that do not contain whitespace it works.
So could you please tell me how I should encode the url variable to admit whitespace and other "special" characters?
Thank you
Just encode your querystring part
Like:
param = {"address" : region,
"sensor" : "false"
}
or
param = [("address", region), ("sensor", "false")]
then
encoded_param = urllib.urlencode(param)
url = 'http://maps.googleapis.com/maps/api/geocode/json'
url = url + '?' + encoded_param
result = urlfetch.fetch(url)
Use urllib.pathname2url, it works directly on a single string value, no need for a dictionary