Python string / bytes encoding with non ascii characters

Python string / bytes encoding with non ascii characters - python

I need to write a very simple function that reads from a json file and writes some of the content back to csv file.
The trouble is that the input json file has weird encoding format, for example :
{
"content": "b\"Comment minimiser l'impact environnemental d\\xe8s la R&D des proc\\xe9d\\xe9s micro\\xe9lectroniques."
}
I would like to write back
Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.
The first problem is the 'b' so the content should read as a byte array but it is read as a string.
The second one is how to replace the weird characters ?
Thank you

You could use something like this:
json_file_path = 'your_json_file.json'
with open(json_file_path, 'r', encoding='utf-8') as j:
# Remove problematic "b\ character
j = j.read().replace('\"b\\',"");
# Process json
contents = json.loads(j)
# Decode string to process correctly double backslashes
output = contents['content'].encode('utf-8').decode('unicode_escape')
print(output)
# Output
# Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.

Related

More pythonic to convert bytes to string while processing urllib response instead of chr(int(x))

I am late convert to Python 3. I am trying to process output from a REST api for protein sequences using urllib.
In legacy python I could use:
self.seq_fileobj = urllib2.urlopen("http://www.uniprot.org/uniprot/{}.fasta".format(uniprot_id))
self.seq_header = self.seq_fileobj.next()
print "Read in sequence information for {}.".format(self.seq_header[:-1])
self.sequence = [achar for a_line in self.seq_fileobj for achar in a_line if achar != "\n"]
print("Sequence:{}\n".format("".join(self.sequence)))
For the same section of code in python 3, I use:
context = ssl._create_unverified_context()
self.seq_fileobj = urllib.request.urlopen("https://www.uniprot.org/uniprot/{}.fasta".format(uniprot_id),context=context)
self.seq_header = next(self.seq_fileobj)
print("Read in sequence information for {}.".format(self.seq_header.rstrip()))
self.b_sequence = [str(achar).encode('utf-8') for a_line in self.seq_fileobj for achar in a_line]
self.sequence = [chr(int(x)) for x in self.b_sequence]
I have read a little about string encoding and decoding to modify my list comprehension for python 3:
self.b_sequence = [str(achar).encode('utf-8') for a_line in self.seq_fileobj for achar in a_line]
self.sequence = [chr(int(x)) for x in self.b_sequence]
Although my code is working- is this the best way to achieve this result where I go from an array of bytes of ascii characters encoded with utf-8 to their resulting strings?. The chr(int(x)) bit is what seems un pythonic to me and I fear I may be missing something.

You don't need to convert the bytes to strings on a character-to-character basis. Since you want to strip out the newline characters, you can instead read the entire file as bytes, convert the bytes to strings with the decode method (which defaults to the utf-8 encoding as you are using) and remove the newline characters using the str.replace method:
self.sequence = list(self.seq_fileobj.read().decode().replace('\n', ''))

Formatting json data fetched from URL by removing escape character

I have fetched json data from url and write it to in a file name urljson.json
i want to format the json data removing '\' and result [] key for requirment purpose
In my json file the data are arranged like this
{\"result\":[{\"BldgID\":\"1006AVE \",\"BldgName\":\"100-6th Avenue SW (Oddfellows) \",\"BldgCity\":\"Calgary \",\"BldgState\":\"AB \",\"BldgZip\":\"T2G 2C4 \",\"BldgAddress1\":\"100-6th Avenue Southwest \",\"BldgAddress2\":\"ZZZ None\",\"BldgPhone\":\"4035439600 \",\"BldgLandlord\":\"1006AV\",\"BldgLandlordName\":\"100-6 TH Avenue SW Inc. \",\"BldgManager\":\"AVANDE\",\"BldgManagerName\":\"Alyssa Van de Vorst \",\"BldgManagerType\":\"Internal\",\"BldgGLA\":\"34242\",\"BldgEntityID\":\"1006AVE \",\"BldgInactive\":\"N\",\"BldgPropType\":\"ZZZ None\",\"BldgPropTypeDesc\":\"ZZZ None\",\"BldgPropSubType\":\"ZZZ None\",\"BldgPropSubTypeDesc\":\"ZZZ None\",\"BldgRetailFlag\":\"N\",\"BldgEntityType\":\"REIT \",\"BldgCityName\":\"Calgary \",\"BldgDistrictName\":\"Downtown \",\"BldgRegionName\":\"Western Canada \",\"BldgAccountantID\":\"KKAUN \",\"BldgAccountantName\":\"Kendra Kaun \",\"BldgAccountantMgrID\":\"LVALIANT \",\"BldgAccountantMgrName\":\"Lorretta Valiant \",\"BldgFASBStartDate\":\"2012-10-24\",\"BldgFASBStartDateStr\":\"2012-10-24\"}]}
I want it like this format
[
{
"BldgID":"1006AVE",
"BldgName":"100-6th Avenue SW (Oddfellows) ",
"BldgCity":"Calgary ",
"BldgState":"AB ",
"BldgZip":"T2G 2C4 ",
"BldgAddress1":"100-6th Avenue Southwest ",
"BldgAddress2":"ZZZ None",
"BldgPhone":"4035439600 ",
"BldgLandlord":"1006AV",
"BldgLandlordName":"100-6 TH Avenue SW Inc. ",
"BldgManager":"AVANDE",
"BldgManagerName":"Alyssa Van de Vorst ",
"BldgManagerType":"Internal",
"BldgGLA":"34242",
"BldgEntityID":"1006AVE ",
"BldgInactive":"N",
"BldgPropType":"ZZZ None",
"BldgPropTypeDesc":"ZZZ None",
"BldgPropSubType":"ZZZ None",
"BldgPropSubTypeDesc":"ZZZ None",
"BldgRetailFlag":"N",
"BldgEntityType":"REIT ",
"BldgCityName":"Calgary ",
"BldgDistrictName":"Downtown ",
"BldgRegionName":"Western Canada ",
"BldgAccountantID":"KKAUN ",
"BldgAccountantName":"Kendra Kaun ",
"BldgAccountantMgrID":"LVALIANT ",
"BldgAccountantMgrName\":" Lorretta Valiant ",
"BldgFASBStartDate":"2012-10-24",
"BldgFASBStartDateStr":"2012-10-24"
} `
]
i have tried replace("\","") but nothing changed
Here is my code
import json
import urllib2
urllink=urllib2.urlopen("url").read()
print urllink -commented out
with open('urljson.json','w')as outfile:
json.dump(urllink,outfile)
jsonfile='urljson.json'
jsondata=open(jsonfile)
data=json.load(jsondata)
data.replace('\'," ") --commented out
print (data)
but it is saying fileobject has no replace attribute, I didnt find any idea how to remove 'result' and most outer "{}"
kindly guide me
i think the file object is not parsed in string somehow .i am beginner in python
thank you

JSON is a serialized encoding for data. urllink=urllib2.urlopen("url").read() read that serialized string. With json.dump(urllink,outfile) you serialized that single serialized JSON string again. You double-encoded it and that's why you see those extra "\" escape characters. json needs to escape those characters so as not to confuse them with the quotes it uses to demark strings.
If you wanted the file to hold the original json, you wouldn't need to encode it again, just do
with open('urljson.json','w')as outfile:
outfile.write(urllink)
But it looks like you want to grab the "result" list and only save that. So, decode the JSON into python, grab the bits you want, and encode it again.
import json
import codecs
import urllib2
# read a json string from url
urllink=urllib2.urlopen("url").read()
# decode and grab result list
result = json.loads(urllink)['result']
# write the json to a file
with open('urljson.json','w')as outfile:
json.dump(result, outfile)

\ is escape character in json:
you can load json string to python dict:

Tidy up the JSON object before writing it to file. It has lot of whitespace noise. Try like this:
urllink = {a.strip():b.strip() for a,b in json.loads(urllink).values()[0][0].items()}
jsonobj = json.loads(json.dumps(urllink))
with open('urljson.json','w') as outfile:
json.dump(jsonobj, outfile)
For all objects:
jsonlist = []
for dirtyobj in json.loads(urllink)['result']:
jsonlist.append(json.loads(json.dumps({a.strip():b.strip() for a,b in dirtyobj.items()})))
with open('urljson.json','w') as outfile:
json.dump(json.loads(json.dumps(jsonlist)), outfile)
Don't wanna tidy up? Then simply do this:
jsonobj = json.loads(urllink)
And you can't do '\', it's syntax error. The second ' is escaped and is not considered as closing quote.
data.replace('\'," ")
Why can't Python's raw string literals end with a single backslash?

Remove Weird Characters using python

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.
Ex:
INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la dÃƒ?Ã‚Â©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');
i want to remove only the weird characters and leave all of the normal ones
I tried using the following code to do so:
import fileinput
import string
fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")
for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
print(line)
s = line
printable = set(string.printable)
filter(lambda x: x in printable, s)
print(s)
file.write(s)
but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.
Any advice or tips on how to solve this would be useful

import string
strg = "'2268032019', Petite histoire de la dÃƒ?Ã‚Â©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;: """
for x in strg:
if x in string.ascii_letters or x in string.digits or x in acc:
newstrg += x
print (newstrg)
Output;
'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>
You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.
Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.

You can use regular expressions sub() to do simple string replacements.
https://docs.python.org/2/library/re.html#re.sub
# -*- coding: utf-8 -*-
import re
dirty_string = u'Ã‚Â©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)
print clean_string
# Outputs
>>> sinformation

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

I'm trying to save concrete content of the dictionary to a file but when I try to write it, I get the following error:
Traceback (most recent call last):
File "P4.py", line 83, in <module>
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)
And here is the code:
from collections import Counter
with open("corpus.txt") as inf:
wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)
with open("lexic.txt", "w") as outf:
outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
for word,count in wordtagcount.iteritems():
outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
for keyword in diccionario:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))
The desired output will look like this:
keyword(String) tagSugerido(String):
Hello NC
Friend N
Run V
...etc
The conflictive line is:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
Thank you.

Like zmo suggested:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
A note on unicode in Python 2
Your software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.

As you're not giving a simple concise code to illustrate your question, I'll just give you a general advice on what should be the error:
If you're getting a decode error, it's that tagSugerido is read as ASCII and not as Unicode. To fix that, you should do:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))
to store it as an unicode.
Then you're likely to get an encode error at the write() stage, and you should fix your write the following way:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
I litterally answered a very similar question moments ago. And when working with unicode strings, switch to python3, it'll make your life easier!
If you cannot switch to python3 just yet, you can make your python2 behave like it is almost python3, using the python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
N.B.: instead of doing:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
which will fail to close properly the file descriptor upon failure during readlines, you should better do:
with open("lexic.txt", "r") as f:
data=f.readlines()
which will take care of always closing the file even upon failure.
N.B.2: Avoid using file as this is a python type you're shadowing, but use f or lexic_file…

Converting a utf16 csv to array

I've tried to convert a CSV file coded in UTF-16 (exported by another program) to a simple array in Python 2.7 with very few luck.
Here's the nearest solution I've found:
from io import BytesIO
with open ('c:\\pfm\\bdh.txt','rb') as f:
x = f.read().decode('UTF-16').encode('UTF-8')
for line in csv.reader(BytesIO(x)):
print line
This code returns:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
['1\tnom1\tetq1\text1 ...
What I'm trying to get it's something like this:
[['','Nombre','Etiqueta','Extensión de archivo','Tamaño lógico','Categoría']
['1','nom1','etq1','ext1','123','cat1']
['2','nom2','etq2','ext2','456','cat2']]
So, I'd need to convert those hexadecimal chars to latin typos (as: á,é,í,ó,ú, or ñ), and those tab-separated strings into arrays fields.
Do I really need to use dictionaries for the first part? I think there should be an easier solution, as I can see and write all these characater by keyboard.
For the second part, I think the CSV library won't help in this case, as I read it can't manage UTF-16 yet.
Could you give me a hand? Thank you!

ITEM #1: The hexadecimal characters
You are getting the:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
output because you are printing a list. The behaviour of the list is to print the representation of each item. That is, it is the equivalent of:
print('[{0}]'.format(','.join[repr(item) for item in lst]))
If you use print(line[0]) you will get the output of the line.
ITEM #2: The output
The problem here is that the csv parser is not parsing the content as a tab-separated file, but a comma-separated file. You can fix this by using:
for line in csv.reader(BytesIO(s), delimiter='\t'):
print(line)
instead.
This will give you the desired result.

Processing a UTF-16 file with the csv module in Python 2 can indeed be a pain. Re-encoding to UTF-8 works, but you then still need to decode the resulting columns to produce unicode values.
Note also that your data appears to be tab delimited; the csv.reader() by default uses commas, not tabs, to separate columns. You'll need to configure it to use tabs instead by setting delimiter='\t' when constructing the reader.
Use io.open() to read UTF-16 and produce unicode lines. You can then use codecs.iterencode() to translate the decoded unicode values from the UTF-16 file to UTF-8.
To decode the rows back to unicode values, you could use an extra generator to do so as you iterate:
import csv
import codecs
import io
def row_decode(reader, encoding='utf8'):
for row in reader:
yield [col.decode('utf8') for col in row]
with io.open('c:\\pfm\\bdh.txt', encoding='utf16') as f:
wrapped = codecs.iterencode(f, 'utf8')
reader = csv.reader(wrapped, delimiter='\t')
for row in row_decode(reader):
print row
Each line will still use repr() on each contained value, which means that you'll see Python string literal syntax to represent strings. Any non-printable or non-ASCII codepoint will be represented by an escape code:
>>> [u'', u'Nombre', u'Etiqueta', u'Extensión de archivo', u'Tamaño lógico', u'Categoría']
[u'', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
This is normal; the output is meant to be useful as a debugging aid and can be pasted back into any Python session to reproduce the original value, without worrying about terminal encodings.
For example, ó is represented as \xf3, representing the Unicode codepoint U+00F3 LATIN SMALL LETTER O WITH ACUTE. If you were to print this one column, Python will encode the Unicode string to bytes matching your terminal encoding, resulting in your terminal showing you the correct string again:
>>> u'Extensi\xf3n de archivo'
u'Extensi\xf3n de archivo'
>>> print u'Extensi\xf3n de archivo'
Extensión de archivo
Demo:
>>> import csv, codecs, io
>>> io.open('/tmp/demo.csv', 'w', encoding='utf16').write(u'''\
... \tNombre\tEtiqueta\tExtensi\xf3n de archivo\tTama\xf1o l\xf3gico\tCategor\xeda
... ''')
63L
>>> def row_decode(reader, encoding='utf8'):
... for row in reader:
... yield [col.decode('utf8') for col in row]
...
>>> with io.open('/tmp/demo.csv', encoding='utf16') as f:
... wrapped = codecs.iterencode(f, 'utf8')
... reader = csv.reader(wrapped, delimiter='\t')
... for row in row_decode(reader):
... print row
...
[u' ', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
>>> # the row is displayed using repr() for each column; the values are correct:
...
>>> print row[3], row[4], row[5]
Extensión de archivo Tamaño lógico Categoría

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string / bytes encoding with non ascii characters - python

Related

More pythonic to convert bytes to string while processing urllib response instead of chr(int(x))

Formatting json data fetched from URL by removing escape character

Remove Weird Characters using python

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

Converting a utf16 csv to array

Categories

Resources