Failed to decode bytes - IronPython - python

I have some files with unicode data, the following code works fine when working with CPython to read those files, whereas the code crashes on IronPython saying "failed to decode bytes at index 67"
for f in self.list_of_files:
all_words_in_file = []
with codecs.open(f,encoding="utf-8-sig") as file_obj:
for line in file_obj:
all_words_in_file.extend(line.split(" "))
#print "Normalising unicode strings"
normal_list = []
#gets all the words and remove duplicate words
#the list will contain unique normalized words
for l in all_words_in_file:
normal_list.append(normalize('NFKC',l))
file_listing.update({f:normal_list})
return file_listing
I cannot understand the reason, is there another way to read unicode data in ironpython?

How about this one:
def lines(filename):
f = open(filename, "rb")
yield f.readline()[3:].strip().decode("utf-8")
for line in f:
yield line.strip().decode("utf-8")
f.close()
for line in lines("text-utf8-with-bom.txt"):
all_words_in_file.extend(line.split(" "))
I have also filed a IronPython bug https://ironpython.codeplex.com/workitem/34951
As long as you are feeding entire lines to decode, things will be ok.

Related

How do I write a list to a file?

how to write a list with elements with words from different languages ​​and with numbers to a file?
s = ['привет', 'hi', 235, 235, 45]
with open('test.txt', 'wb') as f:
f.write(s)
TypeError: a bytes-like object is required, not 'list'
Well since you are opening a .txt file and you mentioned different languages, I will assume that you want to write to a .txt file some values with different encoding.
(If that's not the case and you want to write binary, you can check the answers above)
You should specify the encoding for the file depending on the values you have.
s = ['привет', 'hi', 235, 235, 45]
with open("filename.txt", "w", encoding="utf-8") as f:
f.write("\n".join(str(val) for val in s))
Using 'wb' to open the file causes you to write in binary mode. I'm not sure if this is what you want, since you are writing to a .txt file.
If you are sure you want to write in binary mode, I would suggest you use .bin or .dat as the file extension for your file.
If you are sure you want to write the output as binary, here is one way to do so (I make several assumptions here, because you don't give details in your question):
import sys
s = ['привет', 'hi', 235, 235, 45]
# Create an empty bytearray
byte_array = bytearray()
# For each item in the list s, convert it to a bytearray, and append it to the
# output bytearray
for item in s:
if isinstance(item, str):
byte_array.extend(bytearray(item, 'utf-8'))
elif isinstance(item, int):
# Assumption: all ints in the list can be represented with only 2 bytes
# Assumption: we want to output with the system's byte order
byte_array.extend(item.to_bytes(2, sys.byteorder, signed=True))
# Here I use the .bin extension, because it is a binary file
with open('test.bin', 'wb') as f:
f.write(byte_array)
If you instead do not want to write as binary, but want to write as text, see Spiros Gkogkas's answer.
with open('filename.txt', 'w') as file:
file.write(', '.join([str(item) for item in s]))
It writes every item in a list (first changing every item to a string of course) separated by a comma.
Maybe convert (parse) to json string
Like this:
import json
dataList = ["po", "op", "oo"]
dataJson = json.dumps(dataList)
# Then your code
# with open('test.txt', 'wb') as f: f.write(dataJson)

Alternatives to `tell()` while iterating over lines of a file in Python3?

How can I find out the location of the file cursor when iterating over a file in Python3?
In Python 2.7 it's trivial, use tell(). In Python3 that same call throws an OSError:
Traceback (most recent call last):
File "foo.py", line 113, in check_file
pos = infile.tell()
OSError: telling position disabled by next() call
My use case is making a progress bar for reading large CSV files. Computing a total line count is too expensive and requires an extra pass. An approximate value is plenty useful, I don't care about buffers or other sources of noise, I want to know if it'll take 10 seconds or 10 minutes.
Simple code to reproduce the issue. It works as expected on Python 2.7, but throws on Python 3:
file_size = os.stat(path).st_size
with open(path, "r") as infile:
reader = csv.reader(infile)
for row in reader:
pos = infile.tell() # OSError: telling position disabled by next() call
print("At byte {} of {}".format(pos, file_size))
This answer https://stackoverflow.com/a/29641787/321772 suggests that the problem is that the next() method disables tell() during iteration. Alternatives are to manually read line by line instead, but that code is inside the CSV module so I can't get at it. I also can't fathom what Python 3 gains by disabling tell().
So what is the preferred way to find out your byte offset while iterating over the lines of a file in Python 3?
The csv module just expects the first parameter of the reader call to be an iterator that returns one line on each next call. So you can just use a iterator wrapper than counts the characters. If you want the count to be accurate, you will have to open the file in binary mode. But in fact, this is fine because you will have no end of line conversion which is expected by the csv module.
So a possible wrapper is:
class SizedReader:
def __init__(self, fd, encoding='utf-8'):
self.fd = fd
self.size = 0
self.encoding = encoding # specify encoding in constructor, with utf8 as default
def __next__(self):
line = next(self.fd)
self.size += len(line)
return line.decode(self.encoding) # returns a decoded line (a true Python 3 string)
def __iter__(self):
return self
You code would then become:
file_size = os.stat(path).st_size
with open(path, "rb") as infile:
szrdr = SizedReader(infile)
reader = csv.reader(szrdr)
for row in reader:
pos = szrdr.size # gives position at end of current line
print("At byte {} of {}".format(pos, file_size))
The good news here is that you keep all the power of the csv module, including newlines in quoted fields...
If you are comfortable without the csv module in particular. You can do something like:
import os, csv
file_size = os.path.getsize('SampleCSV.csv')
pos = 0
with open('SampleCSV.csv', "r") as infile:
for line in infile:
pos += len(line) + 1 # 1 for newline character
row = line.rstrip().split(',')
print("At byte {} of {}".format(pos, file_size))
But this might not work in cases where fields themselves contain \".
Ex: 1,"Hey, you..",22:04 Though these can also be taken care of using regular expressions.
As your csvfile is too large, there is also another solution according to the page you mentioned:
Using offset += len(line) instead of file.tell(). For example,
offset = 0
with open(path, mode) as file:
for line in file:
offset += len(line)

filtering a weird text file in python

I have a text file in which each ID line starts with > and the next line(s) are the a sequence of characters. And the next line after the sequence of characters would be an other ID line starting with >. but in some of them, instead of sequence I have “Sequence unavailable”. The sequence after the ID line can be one or more lines.
like this example:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
>ENSG00000004139|ENST00000003834
Sequence unavailable
I want to filter out those IDs with “Sequence unavailable”. The output should look like this:
output:
>ENSG00000173153|ENST00000000442|64073050;64074640|64073208;64074651
AAGCAGCCGGCGGCGCCGCCGAGTGAGGGGACGCGGCGCGGTGGGGCGGCGCGGCCCGAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCGGGCCGCTC
do you know how to do that in python?
Unlike the other answers, I’d strongly recommand against parsing the FASTA format manually. It’s not too hard but there are pitfalls, and it’s completely unnecessary since efficient, well-tested implementations exist:
Use Bio.SeqIO from BioPython; for example:
from Bio import SeqIO
for record in SeqIO.parse(filename, 'fasta'):
if record.seq != 'Sequenceunavailable':
SeqIO.write(record, outfile, 'fasta')
Note the missing space in 'Sequenceunavailable': reading the sequences in FASTA format will omit spaces.
How about this:
with open(filename, 'r+') as f:
data = f.read()
data = data.split('>')
result = ['>{}'.format(item) for item in data if item and 'Sequence unavailable' not in item]
f.seek(0)
for line in result:
f.write(line)
def main():
filename = open('text.txt', 'rU').readlines()
filterFile(filename)
def filterFile(SequenceFile):
outfile = open('outfile', 'w')
for line in SequenceFile:
if line.startswith('>'):
sequence = line.next()
if sequence.startswith('Sequence unavailable'):
//nothing should happen I suppose?
else:
outfile.write(line + "\n" + sequence + "\n")
main()
I unfortunately can't test this code right now but I made this out of the top of my head! Please test it and let me know what the outcome is so I can adjust the code :-)
So I don't exactly know how large these files will get, just in case, I'm doing it without mapping the file in memory:
with open(filename) as fh:
with open(filename+'.new', 'w+') as fh_new:
for idline, geneseq in zip(*[iter(fh)] * 2):
if geneseq.strip() != 'Sequence unavailable':
fh_new.write(idline)
fh_new.write(geneseq)
It works by creating a new file, then the zip thing is some magic to read the 2 lines of the file, the idline will be the first part and the geneseq the second part.
This solution should be relatively cheap in computer power but will create an extra output file.

Same value in list keeps getting repeated when writing to text file

I'm a total noob to Python and need some help with my code.
The code is meant to take Input.txt [http://pastebin.com/bMdjrqFE], split it into seperate Pokemon (in a list), and then split that into seperate values which I use to reformat the data and write it to Output.txt.
However, when I run the program, only the last Pokemon gets outputted, 386 times. [http://pastebin.com/wkHzvvgE]
Here's my code:
f = open("Input.txt", "r")#opens the file (input.txt)
nf = open("Output.txt", "w")#opens the file (output.txt)
pokeData = []
for line in f:
#print "%r" % line
pokeData.append(line)
num = 0
tab = """ """
newl = """NEWL
"""
slash = "/"
while num != 386:
current = pokeData
current.append(line)
print current[num]
for tab in current:
words = tab.split()
print words
for newl in words:
nf.write('%s:{num:%s,species:"%s",types:["%s","%s"],baseStats:{hp:%s,atk:%s,def:%s,spa:%s,spd:%s,spe:%s},abilities:{0:"%s"},{1:"%s"},heightm:%s,weightkg:%s,color:"Who cares",eggGroups:["%s"],["%s"]},\n' % (str(words[2]).lower(),str(words[1]),str(words[2]),str(words[3]),str(words[4]),str(words[5]),str(words[6]),str(words[7]),str(words[8]),str(words[9]),str(words[10]),str(words[12]).replace("_"," "),str(words[12]),str(words[14]),str(words[15]),str(words[16]),str(words[16])))
num = num + 1
nf.close()
f.close()
There are quite a few problems with your program starting with the file reading.
To read the lines of a file to an array you can use file.readlines().
So instead of
f = open("Input.txt", "r")#opens the file (input.txt)
pokeData = []
for line in f:
#print "%r" % line
pokeData.append(line)
You can just do this
pokeData = open("Input.txt", "r").readlines() # This will return each line within an array.
Next you are misunderstanding the uses of for and while.
A for loop in python is designed to iterate through an array or list as shown below. I don't know what you were trying to do by for newl in words, a for loop will create a new variable and then iterate through an array setting the value of this new variable. Refer below.
array = ["one", "two", "three"]
for i in array: # i is created
print (i)
The output will be:
one
two
three
So to fix alot of this code you can replace the whole while loop with something like this.
(The code below is assuming your input file has been formatted such that all the words are split by tabs)
for line in pokeData:
words = line.split (tab) # Split the line by tabs
nf.write ('your very long and complicated string')
Other helpers
The formatted string that you write to the output file looks very similar to the JSON format. There is a builtin python module called json that can convert a native python dict type to a json string. This will probably make things alot easier for you but either way works.
Hope this helps

writing data into file with binary packed format in python

I am reading some value for file and wants to write modified value into file. My file is .ktx format [binary packed format].
I am using struct.pack() but seems that something is going wrong with that:
bytes = file.read(4)
bytesAsInt = struct.unpack("l",bytes)
number=1+(bytesAsInt[0])
number=hex(number)
no=struct.pack("1",number)
outfile.write(no)
I want to write in both ways little-endian and big-endian.
no_little =struct.pack(">1",bytesAsInt)
no_big =struct.pack("<1",bytesAsInt) # i think this is default ...
again you can check the docs and see the format characters you need
https://docs.python.org/3/library/struct.html
>>> struct.unpack("l","\x05\x04\x03\03")
(50529285,)
>>> struct.pack("l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack("<l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack(">l",50529285)
'\x03\x03\x04\x05'
also note that it is a lowercase L , not a one (as also covered in the docs)
I haven't tested this but the following function should solve your problem. At the moment it reads the file contents completely, creates a buffer and then writes out the updated contents. You could also modify the file buffer directly using unpack_from and pack_into but it might be slower (again, not tested). I'm using the struct.Struct class since you seem to want to unpack the same number many times.
import os
import struct
from StringIO import StringIO
def modify_values(in_file, out_file, increment=1, num_code="i", endian="<"):
with open(in_file, "rb") as file_h:
content = file_h.read()
num = struct.Struct(endian + num_code)
buf = StringIO()
try:
while len(content) >= num.size:
value = num.unpack(content[:num.size])[0]
value += increment
buf.write(num.pack(value))
content = content[num.size:]
except Exception as err:
# handle
else:
buf.seek(0)
with open(out_file, "wb") as file_h:
file_h.write(buf.read())
An alternative is to use the array which makes it quite easy. I don't know how to implement endianess with an array.
def modify_values(filename, increment=1, num_code="i"):
with open(filename, "rb") as file_h:
arr = array("i", file_h.read())
for i in range(len(arr)):
arr[i] += increment
with open(filename, "wb") as file_h:
arr.tofile(file_h)

Categories