Converting a utf16 csv to array - python

I've tried to convert a CSV file coded in UTF-16 (exported by another program) to a simple array in Python 2.7 with very few luck.
Here's the nearest solution I've found:
from io import BytesIO
with open ('c:\\pfm\\bdh.txt','rb') as f:
x = f.read().decode('UTF-16').encode('UTF-8')
for line in csv.reader(BytesIO(x)):
print line
This code returns:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
['1\tnom1\tetq1\text1 ...
What I'm trying to get it's something like this:
[['','Nombre','Etiqueta','Extensión de archivo','Tamaño lógico','Categoría']
['1','nom1','etq1','ext1','123','cat1']
['2','nom2','etq2','ext2','456','cat2']]
So, I'd need to convert those hexadecimal chars to latin typos (as: á,é,í,ó,ú, or ñ), and those tab-separated strings into arrays fields.
Do I really need to use dictionaries for the first part? I think there should be an easier solution, as I can see and write all these characater by keyboard.
For the second part, I think the CSV library won't help in this case, as I read it can't manage UTF-16 yet.
Could you give me a hand? Thank you!

ITEM #1: The hexadecimal characters
You are getting the:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
output because you are printing a list. The behaviour of the list is to print the representation of each item. That is, it is the equivalent of:
print('[{0}]'.format(','.join[repr(item) for item in lst]))
If you use print(line[0]) you will get the output of the line.
ITEM #2: The output
The problem here is that the csv parser is not parsing the content as a tab-separated file, but a comma-separated file. You can fix this by using:
for line in csv.reader(BytesIO(s), delimiter='\t'):
print(line)
instead.
This will give you the desired result.

Processing a UTF-16 file with the csv module in Python 2 can indeed be a pain. Re-encoding to UTF-8 works, but you then still need to decode the resulting columns to produce unicode values.
Note also that your data appears to be tab delimited; the csv.reader() by default uses commas, not tabs, to separate columns. You'll need to configure it to use tabs instead by setting delimiter='\t' when constructing the reader.
Use io.open() to read UTF-16 and produce unicode lines. You can then use codecs.iterencode() to translate the decoded unicode values from the UTF-16 file to UTF-8.
To decode the rows back to unicode values, you could use an extra generator to do so as you iterate:
import csv
import codecs
import io
def row_decode(reader, encoding='utf8'):
for row in reader:
yield [col.decode('utf8') for col in row]
with io.open('c:\\pfm\\bdh.txt', encoding='utf16') as f:
wrapped = codecs.iterencode(f, 'utf8')
reader = csv.reader(wrapped, delimiter='\t')
for row in row_decode(reader):
print row
Each line will still use repr() on each contained value, which means that you'll see Python string literal syntax to represent strings. Any non-printable or non-ASCII codepoint will be represented by an escape code:
>>> [u'', u'Nombre', u'Etiqueta', u'Extensión de archivo', u'Tamaño lógico', u'Categoría']
[u'', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
This is normal; the output is meant to be useful as a debugging aid and can be pasted back into any Python session to reproduce the original value, without worrying about terminal encodings.
For example, ó is represented as \xf3, representing the Unicode codepoint U+00F3 LATIN SMALL LETTER O WITH ACUTE. If you were to print this one column, Python will encode the Unicode string to bytes matching your terminal encoding, resulting in your terminal showing you the correct string again:
>>> u'Extensi\xf3n de archivo'
u'Extensi\xf3n de archivo'
>>> print u'Extensi\xf3n de archivo'
Extensión de archivo
Demo:
>>> import csv, codecs, io
>>> io.open('/tmp/demo.csv', 'w', encoding='utf16').write(u'''\
... \tNombre\tEtiqueta\tExtensi\xf3n de archivo\tTama\xf1o l\xf3gico\tCategor\xeda
... ''')
63L
>>> def row_decode(reader, encoding='utf8'):
... for row in reader:
... yield [col.decode('utf8') for col in row]
...
>>> with io.open('/tmp/demo.csv', encoding='utf16') as f:
... wrapped = codecs.iterencode(f, 'utf8')
... reader = csv.reader(wrapped, delimiter='\t')
... for row in row_decode(reader):
... print row
...
[u' ', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
>>> # the row is displayed using repr() for each column; the values are correct:
...
>>> print row[3], row[4], row[5]
Extensión de archivo Tamaño lógico Categoría

Related

Python string / bytes encoding with non ascii characters

I need to write a very simple function that reads from a json file and writes some of the content back to csv file.
The trouble is that the input json file has weird encoding format, for example :
{
"content": "b\"Comment minimiser l'impact environnemental d\\xe8s la R&D des proc\\xe9d\\xe9s micro\\xe9lectroniques."
}
I would like to write back
Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.
The first problem is the 'b' so the content should read as a byte array but it is read as a string.
The second one is how to replace the weird characters ?
Thank you
You could use something like this:
json_file_path = 'your_json_file.json'
with open(json_file_path, 'r', encoding='utf-8') as j:
# Remove problematic "b\ character
j = j.read().replace('\"b\\',"");
# Process json
contents = json.loads(j)
# Decode string to process correctly double backslashes
output = contents['content'].encode('utf-8').decode('unicode_escape')
print(output)
# Output
# Comment minimiser l'impact environnemental dès la R&D des procédés microélectroniques.

Remove Weird Characters using python

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.
Ex:
INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');
i want to remove only the weird characters and leave all of the normal ones
I tried using the following code to do so:
import fileinput
import string
fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")
for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
print(line)
s = line
printable = set(string.printable)
filter(lambda x: x in printable, s)
print(s)
file.write(s)
but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.
Any advice or tips on how to solve this would be useful
import string
strg = "'2268032019', Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;: """
for x in strg:
if x in string.ascii_letters or x in string.digits or x in acc:
newstrg += x
print (newstrg)
Output;
'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>
You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.
Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.
You can use regular expressions sub() to do simple string replacements.
https://docs.python.org/2/library/re.html#re.sub
# -*- coding: utf-8 -*-
import re
dirty_string = u'©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)
print clean_string
# Outputs
>>> sinformation

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

I'm trying to save concrete content of the dictionary to a file but when I try to write it, I get the following error:
Traceback (most recent call last):
File "P4.py", line 83, in <module>
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)
And here is the code:
from collections import Counter
with open("corpus.txt") as inf:
wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)
with open("lexic.txt", "w") as outf:
outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
for word,count in wordtagcount.iteritems():
outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
for keyword in diccionario:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))
The desired output will look like this:
keyword(String) tagSugerido(String):
Hello NC
Friend N
Run V
...etc
The conflictive line is:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
Thank you.
Like zmo suggested:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
A note on unicode in Python 2
Your software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
As you're not giving a simple concise code to illustrate your question, I'll just give you a general advice on what should be the error:
If you're getting a decode error, it's that tagSugerido is read as ASCII and not as Unicode. To fix that, you should do:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))
to store it as an unicode.
Then you're likely to get an encode error at the write() stage, and you should fix your write the following way:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
I litterally answered a very similar question moments ago. And when working with unicode strings, switch to python3, it'll make your life easier!
If you cannot switch to python3 just yet, you can make your python2 behave like it is almost python3, using the python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
N.B.: instead of doing:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
which will fail to close properly the file descriptor upon failure during readlines, you should better do:
with open("lexic.txt", "r") as f:
data=f.readlines()
which will take care of always closing the file even upon failure.
N.B.2: Avoid using file as this is a python type you're shadowing, but use f or lexic_file…

CSV to dict, dict not finding the item

I am converting a CSV to dict, all the values are loaded correctly but with one issue.
CSV :
Testing testing\nwe are into testing mode
My\nServer This is my server.
When I convert the CSV to dict and if I try to use dict.get() method it is returning None.
When I debug, I get the following output:
{'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
The My\nServer key is having an extra backslash.
If I do .get("My\nServer"), I am getting the output as None.
Can anyone help me?
#!/usr/bin/env python
import os
import codecs
import json
from csv import reader
def get_dict(path):
with codecs.open(path, 'r', 'utf-8') as msgfile:
data = msgfile.read()
data = reader([r.encode('utf-8') for r in data.splitlines()])
newdata = []
for row in data:
newrow = []
for val in row:
newrow.append(unicode(val, 'utf-8'))
newdata.append(newrow)
return dict(newdata)
thanks
You either need to escape the newline properly, using \\n:
>>> d = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> d.get('My\\nServer')
'This is my server.'
or you can use a raw string literal which doesn't need extra escaping:
>>> d.get(r'My\nServer')
'This is my server.'
Note that raw string will treat all the backslash escape sequences this way, not just the newline \n.
In case you are getting the values dynamically, you can use str.encode with string_escape or unicode_escape encoding:
>>> k = 'My\nServer' # API call result
>>> k.encode('string_escape')
'My\\nServer'
>>> d.get(k.encode('string_escape'))
'This is my server.'
"\n" is newline.
If you want to represent a text like "---\n---" in Python, and not having there newline, you have to escape it.
The way you write it in code and how it gets printed differs, in code, you will have to write "\" (unless u use raw string), when printed, the extra slash will not be seen
So in your code, you shall ask:
>>> dct = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> dct.get("My\\nServer")
'This is my server.'

iterate through unicode strings and compare with unicode in python dictionary

I have two python dictionaries containing information about japanese words and characters:
vocabDic : contains vocabulary, key: word, value: dictionary with information about it
kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it
Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
My Python version is 2.6
My code is as following:
kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
kanjiVocabJoinCount = 1
#loop through dictionary
for key, val in vocabDic.iteritems():
if val['lang'] is 'jpn': # only check japanese words
vocab = val['text']
print vocab
# loop through vocab string
for v in vocab:
test = kanjiDic.get(v)
print v
print test
if test is not None:
print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
kanjiVocabJoinCount = kanjiVocabJoinCount+1
If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None
To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.
From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.
For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:
In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8') # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'
If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.
However if you loop over the unicode object, you'd get one unicode character at a time:
In [45]: for v in u'債務の天井':
....: print(v)
債
務
の
天
井
Note that the first unicode character, encoded in utf-8, is 3 bytes:
In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'
That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.
So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:
vocab=val['text'].decode('utf-8')
If you are not sure what encoding val['text'] is in, post the output of
print(repr(vocab))
and maybe we can guess the encoding.

Categories