unable to read large bz2 file - python

I am trying to read a large bz2 file with this code:
import bz2
file= bz2.BZ2File("20150219.csv.bz2","rb")
print file.read()
file.close()
But after 4525 lines, it stops without an error message. The bz2 file is much bigger.
How can I read the whole file line by line?

Your file.read() call tries to read the entire file into memory and then and decompress all of it there, too. Try reading it a line at a time:
import bz2
with bz2.BZ2File("20150219.csv.bz2", "r") as file:
for line in file:
print(line)

Why do you want to print a binary file line by line? Read them to a bytes object instead:
bs = file.read()

Related

How to un-tar in-memory data using Python3?

I've got some tar data in bytes, and want to read it without writing it to the file system.
Writing it to the file system works:
with open('out.tar', 'wb') as f:
f.write(data)
then, in the shell: tar -xzvf out.tar
But the following errors:
import tarfile
tarfile.open(data, 'r')
'''
File ".../lib/python3.7/tarfile.py", line 1591, in open
return func(name, filemode, fileobj, **kwargs)
File ".../lib/python3.7/tarfile.py", line 1638, in gzopen
fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj)
File ".../lib/python3.7/gzip.py", line 163, in __init__
fileobj = self.myfileobj = builtins.open(fil
'''
what is the right way to read the tar in memory?
Update
The following works:
from io import BytesIO
tarfile.open(fileobj=BytesIO(data), 'r')
Why?
tarfile.open is supposed to be able to work with bytes. Converting the bytes to a file-like object myself and then telling tarfile.open to use the file-like object works, but why is the transformation necessary? When does the raw bytes-based API work vs. not work?
You can use the tarfile and from there you can read the data using Byte stream.
import tarfile
with tarfile.open(fileobj = BytesIO(your_file_name)) as tar:
for tar_file in tar:
if (tar_file.isfile()):
inner_data = tar.extractfile(tar_file).read().decode('utf-8')

Can I use python to open an XML file, format it, then save it again?

Hello Python Developers,
I want to write a python script that will search a directory for any files with the extension ".err", then format the file (which is an XML format), then overwrite the file with the correct formatting. Here is the code I have so far:
import xml.dom.minidom
import glob
import os
path = "/qond/apps/tomcat_qx/webapps/qxiqonddb/qxtendQueues/qdocqueue/responses/"
os.chdir(path)
for file in glob.glob("*.err"):
with open(path + file) as f:
xml_data = f.read()
xml = xml.dom.minidom.parse(xml_data)
xml_formatted = dom.toprettyxml()
f.write(xml_formatted)
f.close()
Many Thanks in advance!
Edit:
The current issue I face with the above code is:
Traceback (most recent call last):
File "qxtend_formatter.py", line 12, in <module>
xml = xml.dom.minidom.parse(xml_data)
File "/usr/lib64/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/xml/dom/expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 36] File name too long: '<?xml version="1.0" encoding="UTF-8"?>\n<soapenv:Envelope xmlns:soapenv="http://schemas.xm.....
It seems as though it tries to save the file name as the file contents, but I would like it to keep whatever filename is had.
I have resolved this issue by doing two things:
Ensuring that the OS user has full access to the file (chmod 777)
Creating an '.f' instance to read and a '.fl' instance to write the file
My code now looks like this:
from xml.dom import minidom
import glob
import os
path = "/qond/apps/tomcat_qx/webapps/qxiqonddb/qxtendQueues/qdocqueue/responses/"
os.chdir(path)
for file in glob.glob("*.err"):
with open(file, "r") as f:
xml_data = f.read()
doc = minidom.parseString(xml_data)
with open(file, "w") as fl:
fl.write(doc.toprettyxml(indent=" ", newl="\n"))

not able to replace string in a file using replace method

Not able to replace the string in file
with open("dc_setup.tcl",'r+') as file:
for line in file:
if str0 in line:
line1=line
print(line1)
contents=file.read()
contents=contents.replace(line1,new_str)
file.seek(0)
file.truncate()
file.write(contents)
I expect the code to replace string in that file , but I'm getting empty file
This section:
file.seek(0)
file.truncate()
file.write(contents)
Is overwriting the entire file, not just your current line. Editing text files in place is generally pretty hard, so the usual approach is to write to a new file. You can copy the new file back over the old file once you've finished if you like.
with open("dc_setup.tcl") as infile, open("new_dc_setup.tcl", "w") as outfile:
for line in infile:
if old_str in line:
line = line.replace(old_str, new_str)
outfile.write(line)

How do I create a PDF file from a binary code using Python?

I am trying to send myself PDF files per E-mail with Python. I am able to send myself the binary code of a PDF file, but I am not able to reconstruct the PDF file from this binary code.
Here is how I obtain the binary code of a PDF file:
file = open('code.txt', 'w')
for line in open('somefile.pdf', 'rb').readlines():
file.write(str(line))
file.close()
Here is how I try to create a PDF file from the binary code:
file = open('new.pdf', 'wb')
for line in open('code.txt', 'r').readlines():
file.write(bytes(line))
file.close()
I then recieve this error:
Traceback (most recent call last):
File "something.py", line 3, in
file.write(bytes(line))
TypeError: string argument without an encoding
What did I do wrong?
In your first block, open file in binary write mode (wb), since you are writing binary to it. Also, you don't need to convert it explicitly to str. It should look like this:
file = open('code.txt', 'wb')
for line in open('somefile.pdf', 'rb').readlines():
file.write(line)
file.close()
For second block, open file in read binary mode (rb). Here also, no need to explicitly convert to bytes. It should look like this:
file = open('new.pdf', 'wb')
for line in open('code.txt', 'rb').readlines():
file.write(line)
file.close()
This should do the trick. But why do you need to convert it in the first place? Keeping file intact will save your hardwork and computational power.
Just to add. In my case, I was downloading the pdf file from an API and the 'response.content' came in base64 format. I also didn't need to write line by line
I needed to convert the byte array first using:
import requests
import base64
response = requests.get(self.download_url,
allow_redirects=True,
headers=headers,
params=query_params)
bytes = base64.b64decode(response.content)
with open('file.pdf', 'wb') as f:
f.write(bytes)

Use codecs to read file with correct encoding: TypeError

I need to read from a file, linewise. Also also need to make sure the encoding is correctly handled.
I wrote the following code:
#!/bin/bash
import codecs
filename = "something.x10"
f = open(filename, 'r')
fEncoded = codecs.getreader("ISO-8859-15")(f)
totalLength = 0
for line in fEncoded:
totalLength+=len(line)
print("Total Length is "+totalLength)
This code does not work on all files, on some files I get a
Traceback (most recent call last):
File "test.py", line 11, in <module>
for line in fEncoded:
File "/usr/lib/python3.2/codecs.py", line 623, in __next__
line = self.readline()
File "/usr/lib/python3.2/codecs.py", line 536, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python3.2/codecs.py", line 480, in read
data = self.bytebuffer + newdata
TypeError: can't concat bytes to str
Im using python 3.3 and the script must work with this python version.
What am I doing wrong, I was not able to find out which files work and which not, even some plain ASCII files fail.
You are opening the file in non-binary mode. If you read from it, you get a string decoded according to your default encoding (http://docs.python.org/3/library/functions.html?highlight=open%20builtin#open).
codec's StreamReader needs a bytestream (http://docs.python.org/3/library/codecs#codecs.StreamReader)
So this should work:
import codecs
filename = "something.x10"
f = open(filename, 'rb')
f_decoded = codecs.getreader("ISO-8859-15")(f)
totalLength = 0
for line in f_decoded:
total_length += len(line)
print("Total Length is "+total_length)
or you can use the encoding parameter on open:
f_decoded = open(filename, mode='r', encoding='ISO-8859-15')
The reader returns decoded data, so I fixed your variable name. Also, consider pep8 as a guide for formatting and coding style.

Categories