GZip and output file

GZip and output file - python

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.

You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Related

os.path.getsize() returns "0"

Getting "0" output, when I am trying to use os.path.getsize()
Not sure what's wrong, using PyCharm, I see that the file was created and the "comments" were added to the file. But PyCharm shows the output "0" :(
Here is the code:
import os
def create_python_script(filename):
comments = "# Start of a new Python program"
with open(filename, "w") as file:
file.write(comments)
filesize = os.path.getsize(filename)
return(filesize)
print(create_python_script("program.py"))
Please, point what is the error I don't see.

You're getting the size 0, due to the peculiar behaviour of the write function.
When you call the write function, it writes the content to the internal buffer. An internal buffer is kept for performance constraints (to limit too frequent I/O calls).
So in this case, you can't ensure that the data/content has been actually dumped to the file on disk or not when you call the getsize function.
with open(filename, "w") as file:
file.write(comments)
filesize = os.path.getsize(filename)
In order to ensure that the content is dumped to the file before calling the getsize function, you can call flush method.
flush method clears the internal buffer and dumps all the content to the file on the disk.
with open(filename, "w") as file:
file.write(comments)
file.flush()
filesize = os.path.getsize(filename)
Or, a better way would be to first close the file and then call the getsize method.
with open(filename, "w") as file:
file.write(comments)
filesize = os.path.getsize(filename)

Not able to fix file handling issue in python

I wrote python code to search a pattern in a tcl file and replace it with a string, it prints the output but the same is not saved in the tcl file
import re
import fileinput
filename=open("Fdrc.tcl","r+")
for i in filename:
if i.find("set qa_label")!=-1:
print(i)
a=re.sub(r'REL.*','harsh',i)
print(a)
filename.close()
actual result
set qa_label
REL_ts07n0g42p22sadsl01msaA04_2018-09-11-11-01
set qa_label harsh
Expected result is that in my file it should reflect the same result as above but it is not

You need to actually write your changes back to disk if you want to see them affected there. As #ImperishableNight says, you don't want to do this by trying to write to a file you're also reading from...you want to write to a new file. Here's an expanded version of your code that does that:
import re
import fileinput
fin=open("/tmp/Fdrc.tcl")
fout=open("/tmp/FdrcNew.tcl", "w")
for i in fin:
if i.find("set qa_label")!=-1:
print(i)
a=re.sub(r'REL.*','harsh',i)
print(a)
fout.write(a)
else:
fout.write(i)
fin.close()
fout.close()
Input and output file contents:
> cat /tmp/Fdrc.tcl
set qa_label REL_ts07n0g42p22sadsl01msaA04_2018-09-11-11-01
> cat /tmp/FdrcNew.tcl
set qa_label harsh
If you wanted to overwrite the original file, then you would want to read the entire file into memory and close the input file stream, then open the file again for writing, and write modified content to the same file.
Here's a cleaner version of your code that does this...produces an in memory result and then writes that out using a new file handle. I am still writing to a different file here because that's usually what you want to do at least while you're testing your code. You can simply change the name of the second file to match the first and this code will overwrite the original file with the modified content:
import re
lines = []
with open("/tmp/Fdrc.tcl") as fin:
for i in fin:
if i.find("set qa_label")!=-1:
print(i)
i=re.sub(r'REL.*','harsh',i)
print(i)
lines.append(i)
with open("/tmp/FdrcNew.tcl", "w") as fout:
fout.writelines(lines)

Open a tempfile for writing the updated file contents and open the file for writing.
After modifying the lines, write it back in the file.
import re
import fileinput
from tempfile import TemporaryFile
with TemporaryFile() as t:
with open("Fdrc.tcl", "r") as file_reader:
for line in file_reader:
if line.find("set qa_label") != -1:
t.write(
str.encode(
re.sub(r'REL.*', 'harsh', str(line))
)
)
else:
t.write(str.encode(line))
t.seek(0)
with open("Fdrc.tcl", "wb") as file_writer:
file_writer.writelines(t)

Converting cloud-init logs to json using a script

I am trying to convert the cloud-init logs to json, so that the filebeat can pick it up and send it to the Kibana. I want to do this by using a shell script or python script. Is there any script that converts such logs to json?
My python script is below
import json
import subprocess
filename = "/home/umesh/Downloads/scripts/cloud-init.log"
def convert_to_json_log(line):
""" convert each line to json format """
log = {}
log['msg'] = line
log['logger-name'] = 'cloud-init'
log['ServiceName'] = 'Contentprocessing'
return json.dumps(log)
def log_as_json(filename):
f = subprocess.Popen(['cat','-F',filename],
stdout=subprocess.PIPE,stderr=subprocess.PIPE)
while True:
line = f.stdout.readline()
log = convert_to_json_log(line)
print log
with open("/home/umesh/Downloads/outputs/cloud-init-json.log", 'a') as new:
new.write(log + '\n')
log_as_json(filename)
The scripts returns a file with json format, but the msg filed returns empty string. I want to convert each line of the log as message string.

Firstly, try reading the raw log file using python inbuilt functions rather than running os commands using subprocess, because:
It will be more portable (work across OS'es)
Faster and less prone to errors
Re-writing your log_as_json function as follows worked for me:
inputfile = "cloud-init.log"
outputfile = "cloud-init-json.log"
def log_as_json(filename):
# Open cloud-init log file for reading
with open(inputfile, 'r') as log:
# Open the output file to append json entries
with open(outputfile, 'a') as jsonlog:
# Read line by line
for line in log.readlines():
# Convert to json and write to file
jsonlog.write(convert_to_json(line)+"\n")

After taking some time on preparing the customised script finally i made the below script. It might be helpful to many others.
import json
def convert_to_json_log(line):
""" convert each line to json format """
log = {}
log['msg'] = json.dumps(line)
log['logger-name'] = 'cloud-init'
log['serviceName'] = 'content-processing'
return json.dumps(log)
# Open the file with read only permit
f = open('/var/log/cloud-init.log', "r")
# use readlines to read all lines in the file
# The variable "lines" is a list containing all lines in the file
lines = f.readlines()
# close the file after reading the lines.
f.close()
jsonData = ''
for line in lines:
jsonLine = convert_to_json_log(line)
jsonData = jsonData + "\n" + jsonLine;
with open("/var/log/cloud-init/cloud-init-json.log", 'w') as new:
new.write(jsonData)

Open binary file in zip archive as ZipExtFile

I'm trying to access a binary stream (via a ZipExtFile object) from a data file contained in a Zip archive. To incrementally read in a text file object from the archive, this would be fairly straightforward:
with ziparchive as ZipFile("myziparchive.zip", 'r'):
with txtfile as ziparchive.open("mybigtextfile.txt", 'r'):
for line in txtfile:
....
Ideally the byte stream equivalent would be something like:
with ziparchive as ZipFile("myziparchive.zip", 'r'):
with binfile as ziparchive.open("mybigbinary.bin", 'rb'):
while notEOF
binchunk = binfile.read(MYCHUNKSIZE)
....
Unfortunately, ZipFile.open doesn't seem to support reading binary data to a ZipExtFile object. From the docs:
The mode parameter, if included, must be one of the following: 'r'
(the default), 'U', or 'rU'.
Given this constraint, how best to incrementally read in the binary file directly from the archive? Since the uncompressed file is quite large I'd like to avoid extracting it first.

I managed to solve the issue that I described in my comment to the OP. I have adapted it here, for your purpose, but I think that there is probably a way to just change the encoding of chunk_str, to avoid using ByteIO.
Anyway - here's my code if it helps:
from io import BytesIO
from zipfile import ZipFile
MYCHUNKSIZE = 10
archive_file = r"test_resources\0000232514_bom.zip"
src_file = r"0000232514_bom.xls"
no_of_chunks_to_read = 10
with ZipFile(archive_file,'r') as zf:
with zf.open(src_file) as src_f:
while no_of_chunks_to_read > 0:
chunk_str = src_f.read(MYCHUNKSIZE)
chunk_stream = BytesIO(chunk_str)
chunk_bytes = chunk_stream.read()
print type(chunk_bytes), len(chunk_bytes), chunk_bytes
if len(chunk_str) < MYCHUNKSIZE:
# End of file
break
no_of_chunks_to_read -= 1

For line by line reading:
with ZipFile("myziparchive.zip", 'r') as ziparchive:
with ziparchive.open("mybigtextfile.txt", 'r') as binfile:
for line in binfile:
line = line.decode() # bytes to str
...

AttributeError when rewriting code so it works in Python 3.4

I am trying to alter the code below so that it works in Python 3.4. However, I get the Error AttributeError: 'int' object has no attribute 'replace' in the line line.replace(",", "\t"). I am trying to understand how to rewrite this part of the code.
import os
import gzip
from io import BytesIO
import pandas as pd
try:
import urllib.request as urllib2
except ImportError:
import urllib2
baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file="
filename = "data/irt_euryld_d.tsv.gz"
outFilePath = filename.split('/')[1][:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = BytesIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read().decode("utf-8", errors="ignore"))
#Now have to deal with tsv file
import csv
csvout = 'C:/Sidney/ECB.tsv'
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
Thank You

You're writing ASCII (by default) with the 'w' mode, but the file you're getting that content from is being read as bytes with the 'rb' mode. Open that file with 'r'.
And then, as Sebastian suggests, just iterate over the file object with for line in f:. Using f.read() will read the entire thing into a single string, so if you iterate over that, you'll be iterating over each character of the file. Strictly speaking, since all you're doing is replacing a single character, the end result will be identical, but iterating over the file object is preferred (uses less memory).
Let's make better use of the with construct and go from this:
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
for line in f.read():
line.replace(",", "\t")
outfile.write(line)
outfile.close()
to this:
with open(outFilePath, "r") as f, open(csvout, 'w') as outfile:
for line in f:
outfile.write(line.replace(",", "\t"))
Also, I should note that this is much easier to do with find-and-replace in your text editor of choice (I like Notepad++).

Try rewriting it as this:
with open(outFilePath, "r") as f:
for line in f: #don't iterate over entire file at once, go line by line
line.replace(",", "\t")
outfile.write(line)
Originally you were opening it as a 'read-binary' rb file, which returns an integer (bytes), not a string as you were expecting. In Python, int objects do not have the .replace() method, however the string object does. This is the cause for your AttributeError. Ensuring that you open it as a regular 'read' r file, will then return a string which has the .replace() method available to call.
Related post on return type of .read() here and more information available from the docs here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

GZip and output file - python

Related

os.path.getsize() returns "0"

Not able to fix file handling issue in python

Converting cloud-init logs to json using a script

Open binary file in zip archive as ZipExtFile

AttributeError when rewriting code so it works in Python 3.4

Categories

Resources