Writing append only gzipped log files in Python - python

I am building a service where I log plain text format logs from several sources (one file per source). I do not intend to rotate these logs as they must be around forever.
To make these forever around files smaller I hope I could gzip them in fly. As they are log data, the files compress very well.
What is a good approach in Python to write append-only gzipped text files, so that the writing can be later resumed when service goes on and off? I am not that worried about losing few lines, but if gzip container itself breaks down and the file becomes unreadable that's no no.
Also, if it's no go, I can simply write them in as plain text without gzipping if it's not worth of the hassle.

Note: On unix systems you should seriously consider using an external program, written for this exact task:
logrotate (rotates, compresses, and mails system logs)
You can set the number of rotations so high, that the first file would be deleted in 100 years or so.
In Python 2, logging.FileHandler takes an keyword argument encoding that can be set to bz2 or zlib.
This is because logging uses the codecs module, which in turn treats bz2 (or zlib) as encoding:
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2") as fh:
... fh.write("Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World
Python 3 version (although the docs mention bz2 as alias, you'll actually have to use bz2_codec - at least w/ 3.2.3):
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2_codec") as fh:
... fh.write(b"Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World

Related

Decompressing bz2 files on Windows

I am trying to decompress a bz2 file with below code snippet which is provided in various places:
bz2_data = bz2.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)
However, I am getting a much smaller file than I expect.
When I extract the file with 7z GUI I am receiving a file with a size of 248MB. However, with above code the file I get is 879kb.
When I read the extracted XML file, I can see that rest of the file is missing as I expect.
I am running anaconda on Windows machine, and as far as understand bz2 reaches an EOF before file actually ends.
By the way, I already run into this and this both did no good.
If this is a multi-stream file, then Python's bz2 module (before 3.3) doesn't support it:
Note This class does not support input files containing multiple streams (such as those produced by the pbzip2 tool). When reading such an input file, only the first stream will be accessible. If you require support for multi-stream files, consider using the third-party bz2file module (available from PyPI). This module provides a backport of Python 3.3’s BZ2File class, which does support multi-stream files.
An alternative, drop-in replacement: bz2file should work though.
If it is a multistream file, you have to set mode to "r" or it will silently fail (e.g. output the compressed data as is).
This should do what you want:
with open(out_file_path) as out_file, BZ2File(bz2_file_path, "r") as bz2_file:
for data in iter(lambda: bz2_file.read(100 * 1024), b""):
out_file.write(data)
From the documentation:
If mode is 'r', the input file may be the concatenation of multiple compressed streams.
https://docs.python.org/3/library/bz2.html#bz2.BZ2File

subprocess.run does run properly pandoc

If you run pandoc directly with minimal example no problem:
$ cat workfile.md
This is a **test** see ![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png)
$ pandoc workfile.md
<p>This is a <strong>test</strong> see <img src="https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png" alt="alt text" /></p>
But if you call it via subprocess.run, then it fails. This minimal example:
import subprocess, os
path = 'workfile.md'
contents = "This is a **test** see ![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png)"
with open(path, 'w') as f:
f.write(contents)
pbody = subprocess.run(["pandoc", "{}".format(path)], check=True, stdout=subprocess.PIPE)
print("**** pbody: ", pbody)
gives us
**** pbody: CompletedProcess(args=['pandoc', 'workfile.md'], returncode=0, stdout=b'\n')
One of the things that Python (and all other programming languages) do to increase performance of common operations is to maintain a buffer for things like file printing. Depending on how much you write to a file, not all of it will get written immediately, which lets the language reduce the amount of (slow) operations it has to do to actually get things on the disk. If you call f.flush() after f.write(contents), you should see pandoc pick up the actual contents of the file.
There's a further layer of buffering that's also worth noting- your operating system may have an updated version of the file in memory, but may not have actually written it to disk. If you're writing a server, you may also want to call os.fsync, which will force the OS to write it to disk.

How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?

I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.
On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.
Here are my ideas so far:
Read the file in binary mode, throwing out bytes that equal chr(26). This would work, but it would take approximately forever.
Use something like sed to eliminate the EOF characters. Unfortunately, as far as I can tell, sed on Windows has the same problem and will quit when it sees the EOF.
Use some kind of Notepad program and do a find-and-replace. But it turns out that Notepad-type programs don't cope well with 15GB files.
My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?
It's easy to use Python to delete the DOS EOF chars; for example,
def delete_eof(fin, fout):
BUFSIZE = 2**15
EOFCHAR = chr(26)
data = fin.read(BUFSIZE)
while data:
fout.write(data.translate(None, EOFCHAR))
data = fin.read(BUFSIZE)
import sys
ipath = sys.argv[1]
opath = ipath + ".new"
with open(ipath, "rb") as fin, open(opath, "wb") as fout:
delete_eof(fin, fout)
That takes a file path as its first argument, and copies the file but without chr(26) bytes to the same file path with .new appended. Fiddle to taste.
By the way, are you sure that DOS EOF characters are your only problem? It's hard to conceive of a sane way in which they could end up in files intended to be treated as text files.

pickle.load Not Working

I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?
The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards
You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.

AppDailySales: Works, but the downloaded gzip file is corrupted

I am trying to use the appdailysales.py module to download daily our iPhone apps. I am a .NET developer, so I tried running this using IronPython in a C# solution using the following code:
using IronPython.Hosting;
var ipy = Python.CreateRuntime();
dynamic appSales = ipy.UseFile("appdailysales.py");
appSales.main();
Because I didn't have gzip, I took out the references to that module. I was going to use the GZipStream C# class to decompress the file (Apple, provides their downloads as .gz files). So, I commented out lines 75 and 429-435.
I have tried executing appdailysales.py in my C# solution, directly from IronPython and using Python 2.7 (installed ActivePython last night); all with the same results: When I try to open the .gz file using 7zip, I get the following error:
CRC Failed ... file is broken
When I try using the GZipStream class I get:
The CRC in GZip footer does not match the CRC calculated from the decompressed data
If I download the .gz file manually, I can decompress the file just fine using 7Zip or GZipStream.
I am fluent in C#, but new to Python. Any help you can provide would be much appreciated.
Thanks for your time.
Looks like line 444 is the problem. Here are lines 444-446:
downloadFile = open(filename, 'w')
downloadFile.write(filebuffer)
downloadFile.close()
At this stage, IF you have deleted lines 429-435 OR selected not to unzip, then filebuffer refers to the raw gzipped stream that you got from the web. The output file is opened in TEXT mode, and you are on Windows, so every \n in the BINARY gzipped stream will be converted to \r\n -- CORRUPTION, like the error message said.
So: for the module to be used portably on both Windows and other platforms, the open mode must be "wb" (b for binary). If the gunzipped result file is also a binary file, "wb" can be hardcoded in the open call. However if the gunzipped file is a text file (meant to be capable of being opened in a text editor), then you need just "w" for that purpose, and you should set a variable mode to either "wb" or "w" as appropriate, and use mode in the open call.
Big question: I understand why you removed the gzip references for IronPython usage. Did you remove those lines for Python 2.7? Or did you run it under Python 2.7 with those lines still in, but set options.unzipFile to False?

Categories