remove strange characters from a string in python - python

I have to parse some web data that is fetched from web. It is quite possible that web content can be of different regional languages that I am handling witout any problem. But there are some invalid characters appearing in some string like
I am wokring
8qîÚ4½-ôMºÝCQ´Dɬ)Q+R±}Ûýï7üÛ²ëlY&53|8ïôóg/^ÿûêþ?ï¯a #ï?¼ºy{5­+B^ß¿ß~¾¿½¦ÓûÆk.c¹~WÚ#ë¤KÈh4rF-G¦!¹ÿ¬¦a~µuÓñµ_»|þì
daily statstistics
I have to remove such strange character and onyl extract valid string. I am using python. I am encoding each string with utf-8.

If you mean not-ascii by strange, you could try:
import string
"".join(filter(lambda char: char in string.printable, s))
Where s is your string.
Here are some string constants you can filter for:
https://docs.python.org/3/library/string.html

Related

UTF-8 decoding doesn't decode special characters in python

Hi I have the following data (abstracted) that comes from an API.
"Product" : "T\u00e1bua 21X40"
I'm using the following code to decode the data byte:
var = json.loads(cleanhtml(str(json.dumps(response.content.decode('utf-8')))))
The cleanhtml is a regex function that I've created to remove html tags from the returned data (It's working correctly). Although, decode(utf-8) is not removing characters like \u00e1. My expected output is:
"Product" : "Tábua 21X40"
I've tried to use replace("\\u00e1", "á") but with no success. How can I replace this type of character and what type of character is this?
\u00e1 is another way of representing the á character when displaying the contents of a Python string.
If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"}) you'll see output of {'Product': 'Tábua 21X40'}. The \u00e1 doesn't exist in the string as those individual characters.
The \u escape sequence indicates that the following numbers specify a Unicode character.
Attempting to replace \u00e1 with á won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don't actually exist in the string in that way.
If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.
what type of character is this
Here
"Product" : "T\u00e1bua 21X40"
you might observe \u escape sequence, it is followed by 4 hex digits: 00e1, note that this is different represenation of same character, so
print("\u00e1" == "á")
output
True
These type of characters are called character entities. There are different types of entities and this is JSON entity. For demonstration, enter your string here and click unescape.
For your question, if you are using python then you can solve the issue by importing json module. Then you have to decode it as follows.
import json
string = json.loads('"T\u00e1bua 21X40"')
print(string)

Python: how to split a string by a delimiter that is invalid <0x0c>

I have a document which contain the charterer <0x0c>.
Using re.split.
The problem that it look like that:
import re
re.split('',text)
When although it works, you CAN'T see the charterer and except of living a nice comment it is a great candidate to be one of this legacy code that only I would understand.
How can I write it in a different, readable way?
You can express any character using escape codes. The 0x0C Form Feed ASCII codepoint can be expressed as \f or as \x0c:
re.split('\f', text)
See the Python string and byte literals syntax for more details on what escape sequences Python supports when defining a string literal value.
Note: you don't need to use the regex module to split on straight-up character sequences, you can just as well use str.split() here:
text.split('\f')

Lxml trying to extract data with windows-1250 characters

Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Replace numeric character references in XML document using Python

I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:
<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>
but each # sign is preceded by a & sign and hence the output looks like: 𐌰𐌽𐌳𐌰𐌿𐍂𐌰
This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")
I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:
RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')
This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).
I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.
Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?
Eurgh. You've got surrogates (UTF-16 code units in the range D800-DFFF), which some fool has incorrectly encoded individually instead of using a pair of code units for a single character. It would be ideal to replace this mess with what it should look like:
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Or, just as valid, in literal characters (if you've got a font that can display the Gothic alphabet):
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Usually, it would be best to do replacement operations like this on parsed text nodes, to avoid messing up non-character-reference sequences in other places like comments or PIs. However of course that's not possible in this case since this isn't really XML at all. You could try to fix it up with a crude regex, though it would be better to find out where the invalid input is coming from and kick the person responsible until they fix it.
>>> def lenient_deccharref(m):
... return unichr(int(m.group(1)))
...
>>> tag= '<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>'
>>> re.sub('&#(\d+);', lenient_deccharref, tag).encode('utf-8')
'<tag>\xf0\x90\x8c\xb0\xf0\x90\x8c\xbd\xf0\x90\x8c\xb3\xf0\x90\x8c\xb0\xf0\x90\x8c\xbf\xf0\x90\x8d\x82\xf0\x90\x8c\xb0</tag>'
This is the correct UTF-8 encoding of 𐌰𐌽𐌳𐌰𐌿𐍂𐌰. The utf-8 codec allows you to encode a sequence of surrogates to correct UTF-8 even on a wide-Unicode platform where the surrogates should not have appeared in the string in the first place.
>>> _.decode('utf-8')
u'<tag>\U00010330\U0001033d\U00010333\U00010330\U0001033f\U00010342\U00010330</tag>'

Categories