How to read text file into pandas and convert utf8 chars to string / latin? [duplicate] - python

I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8

Using urllib package (import urllib) :
Python 2.7
From official documentation :
urllib.unquote(string)
Replace %xx escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/') yields '/~connolly/'.
Python 3
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example: unquote('/El%20Ni%C3%B1o/') yields '/El Niño/'.

And if you are using Python3 you could use:
import urllib.parse
urllib.parse.unquote(url)

or urllib.unquote_plus
>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'

You can use urllib.unquote

import re
def unquote(url):
return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)

Related

Converting %25C3%25BC in string to Umlaut python [duplicate]

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?
I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.
The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:
from urllib.parse import unquote
url = unquote(url)
Demo:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'
The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:
from urllib import unquote
url = unquote(url).decode('utf8')
If you are using Python 3, you can use urllib.parse.unquote:
url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""
import urllib.parse
urllib.parse.unquote(url)
gives:
'example.com?title=правовая+защита'
You can achieve an expected result with requests library as well:
import requests
url = "http://www.mywebsite.org/Data%20Set.zip"
print(f"Before: {url}")
print(f"After: {requests.utils.unquote(url)}")
Output:
$ python3 test_url_unquote.py
Before: http://www.mywebsite.org/Data%20Set.zip
After: http://www.mywebsite.org/Data Set.zip
Might be handy if you are already using requests, without using another library for this job.
In HTML the URLs can contain html entities.
This replaces them, too.
#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.
So I quickly wrote my own.
Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], #, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.
URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.
So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.
Here is the code:
def url_parse(url):
l = len(url)
data = bytearray()
i = 0
while i < l:
if url[i] != '%':
d = ord(url[i])
i += 1
else:
d = int(url[i+1:i+3], 16)
i += 3
data.append(d)
return data.decode('utf8')
I have tested it and it works perfectly.

Some characters replaced with % sign python [duplicate]

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?
I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.
The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:
from urllib.parse import unquote
url = unquote(url)
Demo:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'
The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:
from urllib import unquote
url = unquote(url).decode('utf8')
If you are using Python 3, you can use urllib.parse.unquote:
url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""
import urllib.parse
urllib.parse.unquote(url)
gives:
'example.com?title=правовая+защита'
You can achieve an expected result with requests library as well:
import requests
url = "http://www.mywebsite.org/Data%20Set.zip"
print(f"Before: {url}")
print(f"After: {requests.utils.unquote(url)}")
Output:
$ python3 test_url_unquote.py
Before: http://www.mywebsite.org/Data%20Set.zip
After: http://www.mywebsite.org/Data Set.zip
Might be handy if you are already using requests, without using another library for this job.
In HTML the URLs can contain html entities.
This replaces them, too.
#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.
So I quickly wrote my own.
Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], #, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.
URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.
So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.
Here is the code:
def url_parse(url):
l = len(url)
data = bytearray()
i = 0
while i < l:
if url[i] != '%':
d = ord(url[i])
i += 1
else:
d = int(url[i+1:i+3], 16)
i += 3
data.append(d)
return data.decode('utf8')
I have tested it and it works perfectly.

Converting url encoded string(utf-8) to string in python?

I have a url encoded token(utf-8)= "EC%2d2EC7" and I want to convert it to "EC-2EC7" i.e convert %2d to -.
>>> token = "EC%2d2EC7"
>>> token.encode("utf-8")
'EC%2d2EC7'
I also tried urllib.quote but same result. Is the problem that token is already in utf-8 so it can't convert? What can I do?
My python version: 2.7.10
You can use urllib.unquote:
from urllib import unquote
print unquote("EC%2d2EC7")
Another way is to use requests.utils.unquote:
from requests.utils import unquote
print unquote("EC%2d2EC7")
Output:
EC-2EC7
You are looking for unquote instead of decode.
urllib.unquote('EC%2d2EC7')

Approximately converting unicode string to ascii string in python

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco
Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"
import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/
b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.
There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm
Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

Python urllib urlencode problem with æøå

How can I urlencode a string with special chars æøå?
ex.
urllib.urlencode('http://www.test.com/q=testæøå')
I get this error :(..
not a valid non-string sequence or
mapping object
urlencode is intended to take a dictionary, for example:
>>> q= u'\xe6\xf8\xe5' # u'æøå'
>>> params= {'q': q.encode('utf-8')}
>>> 'http://www.test.com/?'+urllib.urlencode(params)
'http://www.test.com/?q=%C3%A6%C3%B8%C3%A5'
If you just want to URL-encode a single string, the function you're looking for is quote:
>>> 'http://www.test.com/?q='+urllib.quote(q.encode('utf-8'))
'http://www.test.com/?q=%C3%A6%C3%B8%C3%A5'
I'm guessing UTF-8 is the right encoding (it should be, for modern sites). If what you actually want ?q=%E6%F8%E5, then the encoding you want is probably cp1252 (similar to iso-8859-1).
You should pass dictionary to urlencode, not a string. See the correct example below:
from urllib import urlencode
print 'http://www.test.com/?' + urlencode({'q': 'testæøå'})

Categories