Read file from sys.stdin in 'rb' mode : Python - python

I have written below code to convert a csv file to a xml file. I am reading the file from sys.stdin and writing the output back to sys.stdout. I am getting below error while reading a file.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 7652: invalid continuation byte
I have researched the error and found that reading the input file in 'rb' mode may resolve the error. Now how do I change the below code to read the input file from sys.stdin in 'rb' mode. I could not find answer yet.
import csv
import sys
import os
from xml.dom.minidom import Document
filename = sys.argv[1]
filename = os.path.splitext(filename)[0]+'.xml'
pathname = "/tmp/"
output_file = pathname + filename
f = sys.stdin
reader = csv.reader(f)
fields = next(reader)
fields = [x.lower() for x in fields]
fieldsR = fields
doc = Document()
dataRoot = doc.createElement("rowset")
dataRoot.setAttribute('xmlns:xsi', "http://www.w3.org/2001/XMLSchema-instance")
dataRoot.setAttribute('xsi:schemaLocation', "./schema.xsd")
doc.appendChild(dataRoot)
for line in reader:
dataElt = doc.createElement("row")
for i in range(len(fieldsR)):
dataElt.setAttribute(fieldsR[i], line[i])
dataRoot.appendChild(dataElt)
xmlFile = open(output_file,'w')
xmlFile.write(doc.toprettyxml(indent = '\t'))
xmlFile.close()
sys.stdout.write(output_file)

In Python 3, both stdin, stout and stderr are wrapped in IO buffers that apply on the fly text encoding/decoding to the streams.
If you want direct access to the underlying binary stream, it is available as attributes in these wrappers.
For stdin, instead of calling .read in sys.stdin do sys.stdin.buffer.raw.read() -
(and likewise for stderr and stdout, just use ...buffer.raw to get to the underlying binary stream).

Related

Can't reopen Django File as rb

Why does reopening a django.core.files File as binary not work?
from django.core.files import File
f = open('/home/user/test.zip')
test_file = File(f)
test_file.open(mode="rb")
test_file.read()
This gives me the error 'utf-8' codec can't decode byte 0x82 in position 14: invalid start byte so opening in 'rb' obviously didn't work. The reason I need this is because I want to open a FileField as binary
You need to open(…) [Python-doc] the underlying file handler in binary mode, so:
with open('/home/user/test.zip', mode='rb') as f:
test_file = File(f)
test_file.open(mode='rb')
test_file.read()
Without opening it in binary mode, the underlying reader will try to read this as text, and thus error on bytes that are not a utf-8 code points.

GZip and output file

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.
You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Change file encoding scheme in Python

I'm trying to open a file using latin-1 encoding in order to produce a file with a different encoding. I get a NameError stating that unicode is not defined. Here the piece of code I use to this:
sourceEncoding = "latin-1"
targetEncoding = "utf-8"
source = open(r'C:\Users\chsafouane\Desktop\saf.txt')
target = open(r'C:\Users\chsafouane\Desktop\saf2.txt', "w")
target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
I'm not used at all to handling files so I don't know if there is a module I should import to use "unicode"
The fact that you see unicode not defined suggests that you're in Python3. Here's a code snippet that'll generate a latin1-encoded file, then does what you want to do, slurp the latin1-encoded file and spit out a UTF8-encoded file:
# Generate a latin1-encoded file
txt = u'U+00AxNBSP¡¢£¤¥¦§¨©ª«¬SHY­®¯U+00Bx°±²³´µ¶·¸¹º»¼½¾¿U+00CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏU+00DxÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßU+00ExàáâãäåæçèéêëìíîïU+00Fxðñòóôõö÷øùúûüýþÿ'
latin1 = txt.encode('latin1')
with open('example-latin1.txt', 'wb') as fid:
fid.write(latin1)
# Read in the latin1 file
with open('example-latin1.txt', 'r', encoding='latin1') as fid:
contents = fid.read()
assert contents == latin1.decode('latin1') # sanity check
# Spit out a UTF8-encoded file
with open('converted-utf8.txt', 'w') as fid:
fid.write(contents)
If you want the output to be something other than UTF8, add an encoding argument to open, e.g.,
with open('converted-utf_32.txt', 'w', encoding='utf_32') as fid:
fid.write(contents)
The docs have a list of all supported codecs.

Open file from zip without extracting it in Python?

I am working on a script that fetches a zip file from a URL using tje request library. That zip file contains a csv file. I'm trying to read that csv file without saving it. But while parsing it's giving me this error: _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
import csv
import requests
from io import BytesIO, StringIO
from zipfile import ZipFile
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
files = zip_file.namelist()
with zip_file.open(files[0]) as csvfile:
csvreader = csv.reader(csvfile)
# _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
for row in csvreader:
print(row)
Try this:
import pandas as pd
import requests
from io import BytesIO, StringIO
from zipfile import ZipFile
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
files = zip_file.namelist()
with zip_file.open(files[0]) as csvfile:
print(pd.read_csv(csvfile, encoding='utf8', sep=","))
As #Aran-Fey alluded to:
import zipfile
import csv
import io
with open('/path/to/archive.zip', 'r') as f:
with zipfile.ZipFile(f) as zf:
csv_filename = zf.namelist()[0] # see namelist() for the list of files in the archive
with zf.open(csv_filename) as csv_f:
csv_f_as_text = io.TextIOWrapper(csv_f)
reader = csv.reader(csv_f_as_text)
csv.reader (and csv.DictReader) require a file-like object opened in text mode. Normally this is not a problem when simply open(...)ing file in 'r' mode, as the Python 3 docs say, text mode is the default: "The default mode is 'r' (open for reading text, synonym of 'rt')". But if you try rt with open on a ZipFile, you'll see an error that: ZipFile.open() requires mode "r" or "w":
with zf.open(csv_filename, 'rt') as csv_f:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: open() requires mode "r" or "w"
That's what io.TextIOWrapper is for -- for wrapping byte streams to be readable as text, decoding them on the fly.

python - writing hex digits to csv

I am having a the following string:
>>> line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I type the variable line in the python terminal it showing the following:
>>> line
'\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I am printing it, its showing the following:
>>> print line
7 Cardio Metabolic Care 12,788,528.04
In the variable line each word is separated using \t and I wanted to save it to a csv file. So I tried using the following code:
import csv
with open('test.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(line.split('\t'))
When I look into the test.csv file, I am getting only the following
,,,,,,
Is there any to get the words into the csv file. Kindly help.
Your input text is not corrupted, it's encoded - as UTF-16 (Big Endian in this case). And it's CSV itself, just with tab as the delimiter.
You must decode it into a string, after that you can use it normally.
Ideally you declare the proper byte encoding when you read it from a source. For example, when you open a file you can state the encoding the file uses so that the file reader will decode the contents for you.
If you have that byte string from a source where you can't declare an encoding while reading it, you can decode manually:
line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
decoded = line.decode('utf_16_be')
print decoded
# 7 Cardio Metabolic Care 12,788,528.04
But since I suppose that you are actually reading it from a file:
import csv
import codecs
with codecs.open('input.txt', 'r', encoding='utf16') as in_file, codecs.open('output.csv', 'w', encoding='utf8') as out_file:
reader = csv.reader(in_file, delimiter='\t')
writer = csv.writer(out_file, delimiter=',', quotechar='"')
writer.writerows(reader)

Categories