Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError

Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError - python

I am trying to parse each line of a database file to get it ready for import. It has fixed width lines, but in characters, not in bytes. I have coded something based in Martineau's answer, but I am having trouble with the especial characters.
Sometimes they will break the expected width, some other times they will just throw UnicodeDecodeError. I believe the decode error could be fixed, but can I continue doing this struct.unpack and correctly decode the especial characters? I think the problem is that they are encoded in multiple bytes, messing up with the expected field widths, which I understand to be in bytes and not in characters.
import os, csv
def ParseLine( arquivo):
import struct, string
format = "1x 12s 1x 18s 1x 16s"
expand = struct.Struct(format).unpack_from
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
for line in arquivo:
fields = unpack(line)
yield [x.strip() for x in fields]
Caminho = r"C:\Sample"
os.chdir(Caminho)
with open("Sample data.txt", 'r') as arq:
with open("Out" + ".csv", "w", newline ='') as sai:
Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
for line in ParseLine(arq):
Write([line])
Sample data:
| field 1| field 2 | field 3 |
| sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
| resaodra | rôn. 2x 17/220V | sreao.tttra v |
| esarod sê | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |
Actual output:
field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x 17/22;V | sreao.tttra
In the output we see lines 1 and 2 are as expected. Line 3 got wrong widths, probably due to the multibyte ô. Line 4 throws the following exception:
Traceback (most recent call last):
File "C:\Sample\FindSample.py", line 18, in <module>
for line in ParseLine(arq):
File "C:\Sample\FindSample.py", line 9, in ParseLine
fields = unpack(line)
File "C:\Sample\FindSample.py", line 7, in <lambda>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
File "C:\Sample\FindSample.py", line 7, in <genexpr>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data
I will need to to perform especific operations on each field, so I can't use a re.sub on the whole file as I was doing before. I would like to keep this code, as it seems efficient and is in the brink of working. If there is some much more efficient way to parse, I could give it a try, though. I need to keep the special characters.

Indeed, the struct approach falls down here because it expects fields to be a fixed number of bytes wide, while your format uses a fixed number of codepoints.
I'd not use struct here at all. Your lines are already decoded to Unicode values, just use slicing to extract your data:
def ParseLine(arquivo):
slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
for line in arquivo:
yield [line[s].strip() for s in slices]
This deals entirely in characters in an already decoded line, rather than bytes. If you have field widths instead of indices, the slice() objects could also be generated:
def widths_to_slices(widths):
pos = 0
for width in widths:
pos += 1 # delimiter
yield slice(pos, pos + width)
pos += width
def ParseLine(arquivo):
widths = (12, 18, 16)
for line in arquivo:
yield [line[s].strip() for s in widths_to_slices(widths)]
Demo:
>>> sample = '''\
... | field 1| field 2 | field 3 |
... | sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
... | resaodra | rôn. 2x 17/220V | sreao.tttra v |
... | esarod sê | raesodaso t.thl o| .tdosadot. osa |
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa |
... '''.splitlines()
>>> def ParseLine(arquivo):
... slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
... for line in arquivo:
... yield [line[s].strip() for s in slices]
...
>>> for line in ParseLine(sample):
... print(line)
...
['field 1', 'field 2', 'field 3']
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa']
['resaodra', 'rôn. 2x 17/220V', 'sreao.tttra v']
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa']
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa']

Related

Python's exif module and Umlauts in JPEG Metadata

I am writing a little script that should help me edit the EXIF metadata of JPEG files in Python, especially the 'artist' field, using the exif module in Python3. However, as I am German, I have to work on a few files where the author field contains an Umlaut, such as 'ü'. If I now open one of these files in 'rb' mode, create an exif Image object with myimgobj=Image(myfile) and try to access myimgobj.artist, I get a long list of multiple (!) UnicodeDecodeErrrors which are basically all the same:
'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
For some of the error messages, it is not position 9, but 0, but I guess this can all be traced back to the same reason - the Umlaut. Everything works fine if there is no Umlaut in the field.
Is there any way I can work with the exif package and extract the artist, even if it contains an Umlaut?
Edit: To provide a minimal example, please consider any JPEG image where you set the artist field to 'ä' ( I'd upload one, but the EXIF tags get removed during the upload). It then fails for example when I try to print the artist like this:
from exif import Image
with open('Umlaut.jpg','rb') as imgfile:
my_image=Image(imgfile)
print(my_image.artist)

Use the following:
import exifread
with open('Umlaut.jpg','rb') as imgfile:
tags = exifread.process_file(imgfile)
print(tags) # all tags
for i,tag in enumerate(tags):
print(i,tag, tags[tag]) # tag by tag
Result, tested with string äüÃ (== b'\xc3\xa4\xc3\xbc\xc3\x83'.decode('utf8')) inserted manually to the Authors: .\SO\65720067.py
{'Image Artist': (0x013B) ASCII=äüÃ # 2122, 'Image ExifOffset': (0x8769) Long=2130 # 30, 'Image XPAuthor': (0x9C9D) Byte=äüÃ # 4210, 'Image Padding': (0xEA1C) Undefined=[] # 62, 'EXIF Padding': (0xEA1C) Undefined=[] # 2148}
0 Image Artist äüÃ
1 Image ExifOffset 2130
2 Image XPAuthor äüÃ
3 Image Padding []
4 EXIF Padding []
In the light of these facts, you can change your code to
from exif import Image
with open('Umlaut.jpg','rb') as imgfile:
my_image=Image(imgfile)
# print(my_image.artist) # error described below
print(my_image.xp_author) # äüÃ as expected
BTW, running your code unchanged, the following occurs (where every … Horizontal Ellipsis represents a bunch of messages in the full error traceback):
…
+--------+-----------+-------+----------------------+------------------+
| Offset | Access | Value | Bytes | Type |
+--------+-----------+-------+----------------------+------------------+
| | | | | AsciiZeroTermStr |
| | [0:0] | '' | | |
| 0 | --error-- | | c3 a4 c3 bc c3 83 00 | |
+--------+-----------+-------+----------------------+------------------+
UnicodeDecodeError occurred during unpack operation:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
…
+--------+-----------+-------+-------------------+----------+
| Offset | Access | Value | Bytes | Type |
+--------+-----------+-------+-------------------+----------+
| | | | | AsciiStr |
| | [0:0] | '' | | |
| 0 | --error-- | | c3 a4 c3 bc c3 83 | |
+--------+-----------+-------+-------------------+----------+
UnicodeDecodeError occurred during unpack operation:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Print multiline strings side-by-side

I want to print the items from a list on the same line.
The code I have tried:
dice_art = ["""
-------
| |
| N |
| |
------- ""","""
-------
| |
| 1 |
| |
------- """] etc...
player = [0, 1, 2]
for i in player:
print(dice_art[i], end='')
output =
ASCII0
ASCII1
ASCII2
I want output to =
ASCII0 ASCII1 ASCII2
This code still prints the ASCII art representation of my die on a new line. I would like to print it on the same line to save space and show each player's roll on one screen.

Since the elements of dice_art are multiline strings, this is going to be harder than that.
First, remove newlines from the beginning of each string and make sure all lines in ASCII art have the same length.
Then try the following
player = [0, 1, 2]
lines = [dice_art[i].splitlines() for i in player]
for l in zip(*lines):
print(*l, sep='')
If you apply the described changes to your ASCII art, the code will print
------- ------- -------
| || || |
| N || 1 || 2 |
| || || |
------- ------- -------

The fact that your boxes are multiline changes everything.
Your intended output, as I understand it, is this:
------- -------
| || |
| N || 1 | ...and so on...
| || |
------- -------
You can do this like so:
art_split = [art.split("\n") for art in dice_art]
zipped = zip(*art_split)
for elems in zipped:
print("".join(elems))
# ------- -------
# | || |
# | N || 1 |
# | || |
# ------- -------
N.B. You need to guarantee that each line is the same length in your output. If the lines of hyphens are shorter than the other, your alignment will be off.
In the future, if you provide the intended output, you can get much better responses.

Change print(dice_art[i], end='') to:
print(dice_art[i], end=' '), (Notice the space inbetween the two 's and the , after your previous code)
If you want to print the data dynamically, use the following syntax:
print(dice_art[i], sep=' ', end='', flush=True),

A join command should do it.
dice_art = ['ASCII0', 'ASCII1', 'ASCII2']
print(" ".join(dice_art))
The output would be:
ASCII0 ASCII1 ASCII2

Convert Python Pretty table to CSV using shell or batch command line

Whats an easy way convert the output of Python Pretty table to grammatically usable format such as CSV.
The output looks like this :
C:\test> nova list
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+

Perhaps this will get you close:
nova list | grep -v '\-\-\-\-' | sed 's/^[^|]\+|//g' | sed 's/|\(.\)/,\1/g' | tr '|' '\n'
This will strip the --- lines
Remove the leading |
Replace all but the last | with ,
Replace the last | with \n

Here's a real ugly one-liner
import csv
s = """\
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
result = [tuple(filter(None, map(str.strip, splitline))) for line in s.splitlines() for splitline in [line.split("|")] if len(splitline) > 1]
with open('output.csv', 'wb') as outcsv:
writer = csv.writer(outcsv)
writer.writerows(result)
I can unwrap it a bit to make it nicer:
splitlines = s.splitlines()
splitdata = line.split("|")
splitdata = filter(lambda line: len(line) > 1, data)
# toss the lines that don't have any data in them -- pure separator lines
header, *data = [[field.strip() for field in line if field.strip()] for line in splitdata]
result = [header] + data
# I'm really just separating these, then re-joining them, but sometimes having
# the headers separately is an important thing!
Or possibly more helpful:
result = []
for line in s.splitlines():
splitdata = line.split("|")
if len(splitdata) == 1:
continue # skip lines with no separators
linedata = []
for field in splitdata:
field = field.strip()
if field:
linedata.append(field)
result.append(linedata)

#AdamSmith's answer has a nice method for parsing the raw table string. Here are a few additions to turn it into a generic function (I chose not to use the csv module so there are no additional dependencies)
def ptable_to_csv(table, filename, headers=True):
"""Save PrettyTable results to a CSV file.
Adapted from #AdamSmith https://stackoverflow.com/questions/32128226
:param PrettyTable table: Table object to get data from.
:param str filename: Filepath for the output CSV.
:param bool headers: Whether to include the header row in the CSV.
:return: None
"""
raw = table.get_string()
data = [tuple(filter(None, map(str.strip, splitline)))
for line in raw.splitlines()
for splitline in [line.split('|')] if len(splitline) > 1]
if table.title is not None:
data = data[1:]
if not headers:
data = data[1:]
with open(filename, 'w') as f:
for d in data:
f.write('{}\n'.format(','.join(d)))

Here's a solution using a regular expression. It also works for an arbitrary number of columns (the number of columns is determined by counting the number of plus signs in the first input line).
input_string = """spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
import re, csv, sys
def pretty_table_to_tuples(input_str):
lines = input_str.split("\n")
num_columns = len(re.findall("\+", lines[0])) - 1
line_regex = r"\|" + (r" +(.*?) +\|"*num_columns)
for line in lines:
m = re.match(line_regex, line.strip())
if m:
yield m.groups()
w = csv.writer(sys.stdout)
w.writerows(pretty_table_to_tuples(input_string))

Writing list in CSV to create a model file

I have this CSV that I have to modify using Python.
The number of files varies each time. The input CSV files i have are just a list of coordinates (x, y, z) and i have to modify the file into a 'model' which contains the same coordinates but also some information/headers.
The model looks like this :
Number | 1 | |
Head | N | E | El
Begin | list | list | list
| . | . | .
| . | . | .
| . | . | .
| . | . | .
End | . | . | .
| . | . | .
BeginR | Ok | |
EndR | | |
The dots are the coordinates that are in the lists.
So far I've managed to write almost everything.
What's left is to write the Begin and the End in the first column.
Because the size of the list varies, I have difficulties to place it where they should be : Begin at the same line with the first coordinates and End at the second to last coordinate line.
This is my updated code :
for i in ficList:
with open(i, newline='') as f:
reader = csv.reader(f, delimiter = ';')
next(reader) # skip the header
for row in reader:
coord_x.append(row[0]) # X
coord_y.append(row [1]) # Y
coord_z.append(row[2]) # Z
list_list = [coord_x, coord_y, coord_z] # list of coordinates
len_x = len(coord_x) # length of list
with open(i, 'w', newline='') as fp:
writer = csv.writer(fp, delimiter = ';')
writer.writerow(['Number', number])
writer.writerow(['Head','N', 'E', 'El'])
for l in range(len_x):
if l == 0:
writer.writerow(['Begin',list_list[0][l], list_list[1][l], list_list[2][l]])
if l == len_x-2 :
writer.writerow(['End',list_list[0][l], list_list[1][l], list_list[2][l]])
writer.writerow(['',list_list[0][l], list_list[1][l], list_list[2][l]]) # write the coordinates
writer.writerow(['BeginR', 'Ok'])
writer.writerow(['EndR'])
coord_x.clear() # empty list x
coord_y.clear() # empty list y
coord_z.clear() # empty list z

You're probably better off to define the row labels in advance in a map, then look them up for each row. Also list_list is not really needed, you should just stick to the separate vectors:
...
with open(i, 'w', newline='') as fp:
writer = csv.writer(fp, delimiter = ';')
writer.writerow(['Number', number])
writer.writerow(['Head','N', 'E', 'El'])
row_label_map = {0:'Begin',len_x-2:'End'}
for l in range(len_x):
row_label = row_label_map.get(l,"")
writer.writerow([row_label, coord_x[l], coord_y[l], coord_z[l]])
writer.writerow(['BeginR', 'Ok'])
writer.writerow(['EndR'])
...
Also you don't need to clear the vectors coord_x etc. afterwards as they will be deleted when they go out of scope.

With your latest code, I am guessing the issue is because you are first writing the line with BEGIN tag and then without it, move the logic into a if..elif..else part -
for l in range(len_x):
if l == 0:
writer.writerow(['Begin',list_list[0][l], list_list[1][l], list_list[2][l]])
elif l == len_x-2 :
writer.writerow(['End',list_list[0][l], list_list[1][l], list_list[2][l]])
else:
writer.writerow(['',list_list[0][l], list_list[1][l], list_list[2][l]]) # write the coordinates

To me it seems like it would be easier to first modify the input CSV to include and extra column that has the Begin and End tags with sed like this:
sed -e 's/^/,/' -e '1s/^/Begin/' -e '$ s/^/End/' -e 's/^,/ ,/' test.csv
Then you can simply print the columns as they are without having to add logic for when to add the additional tags in python. This assumes that the input CSV is called test.csv

Python syntax: using loops inside a timeit statement

Two of these statements run while the other fails with a syntax error. What am I doing wrong?
>>> Timer('for i in xrange(10): oct(i)').repeat(3)
[2.7091379165649414, 2.6934919357299805, 2.689150094985962]
>>> Timer('n = [] ; n = [oct(i) for i in xrange(10)]').repeat(3)
[4.0500171184539795, 3.6979520320892334, 3.701982021331787]
>>> Timer('n = [] ; for i in xrange(10): n.append(oct(i))').repeat(3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py", line 136, in __init__
code = compile(src, dummy_src_name, "exec")
File "<timeit-src>", line 6
n = [] ; for i in xrange(10): n.append(oct(i))
^
SyntaxError: invalid syntax

Your failing statement is syntactically incorrect. If you need to time multiple statement's define it in a function and call Timer, after importing the function from main
>>> def foo():
n = []
for i in xrange(10): n.append(oct(i))
>>> Timer("foo()","from __main__ import foo")
Now you need to understand why the failing statement is incorrect
Excerpt from the docs for Compound Statement
A suite can be one or more semicolon-separated simple statements on the same line as the header, following the header’s colon, or it can be one or more indented statements on subsequent lines.
stmt_list ::= simple_stmt (";" simple_stmt)* [";"]
and similarly, a simple statement is
simple_stmt ::= expression_stmt
| assert_stmt
| assignment_stmt
| augmented_assignment_stmt
| pass_stmt
| del_stmt
| print_stmt
| return_stmt
| yield_stmt
| raise_stmt
| break_stmt
| continue_stmt
| import_stmt
| global_stmt
| exec_stmt
It should now be clear to you when a semi-colon can (not should) be used.

Timer('n = []\nfor i in xrange(10): n.append(oct(i))').repeat(3)
[2.026008492408778, 2.065228002189059, 2.048982731136192]

You can use triple quotes as well:
statement = '''n = []
for i in xrange(10):
n.append(oct(i))'''
Timer(statement).repeat(3)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError - python

Related

Python's exif module and Umlauts in JPEG Metadata

Print multiline strings side-by-side

Convert Python Pretty table to CSV using shell or batch command line

Writing list in CSV to create a model file

Python syntax: using loops inside a timeit statement

Categories

Resources