My program does a lot of file processing, and as the files are large I prefer to write them as GZIP. One challenge is that I often need to read files as they are being written. This is not a problem without GZIP compression, but when compression is on, the reading complains about failed CRC, which I presume might have something to do with compression info not being flushed properly when writing. Is there any way to use GZIP with Python such that, when I write and flush to a file (but not necessarily close the file), that it can be read as well?
I think flushing data to file (compressed) just writes the data into file, but headers are written only on close(), so you need to close the file first, and only after you can open it and read all data you need. If you need to write large data ammounts, you can try to use database, like PostgreSQL or MySQL where you can specify table with compression (archive, compressed), and you will be able to insert data into the table, and read it, database software will do all rest for you (compression, decompression on inserts, selects).
Related
I am trying to load a .csv.gz file into Dask.
Reading it this way will load it successfully but into one partition only
dd.read_csv(fp, compression="gzip")
My work around now is to unzip the file using gzip, load it into Dask, then remove it after I am finished. Is there a better way?
There is a very similar question here:
The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data.
The recommendation there is:
The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.
As an alternative option, if disk space is consideration, you can use .to_parquet instead of .to_csv upstream, since parquet data is compressed by default.
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.
It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]
I ve written a script that fetches bitcoin data and saves it in .txt files or in the case where the .txt files exist, it updates them. The .txt files are nodes and relationships connecting the nodes for neo4j.
At the beginning of the script:
It checks whether the files exist, so it opens them and appends new lines OR
In case the files do not exist, the script creates them and starts appending lines.
The .txt files are constantly open, the script writes the new data. The .txt files close when all the data are written or I terminate the execution.
My question is:
Should I open, write, close each .txt file for each iteration and for each .txt file?
or
Should I keep it the way it is now; open the .txt files, do all the writing, when the writing is done close the .txt file
I am saving data from 6013 blocks. Which way would minimize risk of corrupting the data written in the .txt files?
Keeping files open will be faster. In the comments you mentioned that "Loss of data previously written is not an option". The probability of corrupting files is higher for open files so open and close file on each iteration is more reliable.
There is also an option to keep data in some buffer and to write/append buffer to file when all data is received or on user/system interrupt or network timeout.
I think keeping the file open will be more efficient, because python won't need to search for the file and open it every time you want to read/write the file.
I guess it should look like this
with open(filename, "a") as file:
while True:
data = # get data
file.write(data)
Run a benchmark and see for yourself would the typical answer for this kind of question.
Nevertheless opening and closing a file does have a cost. Python needs to allocate memory for the buffer and data structures associated with the file and call some operating system functions, e.g. the open syscall which in turn would search the file in cache or on disk.
On the other hand there is a limit on the number of files a program, the user, the whole system, etc can open at the same time. For example on Linux, the value in /proc/sys/fs/file-max denotes the maximum number of file-handles that the kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit (source).
If your program runs in such a restrictive environment then it would be good to keep the file open only when needed.
So before I start I know this is not the proper way to go about doing this but this is the only method that I have for accessing the data I need on the fly.
I have a system which is writing telemetry data to a .csv file while it is running. I need to see some of this data while it is being written but it is not being broadcast in a manner which allows me to do this.
Question: How do I read from a CSV file which is being written to safely.
Typically I would open the file and look at the values but I am hoping to be able to write a python script which is able to examine the csv for me and report the most recent values written without compromising the systems ability to write to the file.
I have absolutely NO access to the system or the manner in which it is writing to the CSV I am only able to see that the CSV file is being updated as the system runs.
Again I know this is NOT the right way to do this but any help you could provide would be extremely helpful.
This is mostly being run in a Windows environment
You can do something like:
tailf csv_file | python your_script.py
and read from sys.stdin
I am writing a Python logger script which writes to a CSV file in the following manner:
Open the file
Append data
Close the file (I think this is necessary to save the changes, to be safe after every logging routine.)
PROBLEM:
The file is very much accessible through Windows Explorer (I'm using XP). If the file is opened in Excel, access to it is locked by Excel. When the script tries to append data, obviously it fails, then it aborts altogether.
OBJECTIVE:
Is there a way to lock a file using Python so that any access to it remains exclusive to the script? Or perhaps my methodology is poor in the first place?
Rather than closing and reopening the file after each access, just flush its buffer:
theloggingfile.flush()
This way, you keep it open for writing in Python, which should lock the file from other programs opening it for writing. I think Excel will be able to open it as read-only while it's open in Python, but I can't check that without rebooting into Windows.
EDIT: I don't think you need the step below. .flush() should send it to the operating system, and if you try to look at it in another program, the OS should give it the cached version. Use os.fsync to force the OS to really write it to the hard drive, e.g. if you're concerned about sudden power failures.
os.fsync(theloggingfile.fileno())
As far as I know, Windows does not support file locking. In other words, applications that don't know about your file being locked can't be prevented from reading a file.
But the remaining question is: how can Excel accomplish this?
You might want to try to write to a temporary file first (one that Excel does not know about) and replace the original file by it lateron.