uncompressing tar.Z file with python?

uncompressing tar.Z file with python? - python

I need to write a python script that retrieves tar.Z files from an FTP server, and uncompress them on a windows machine. tar.Z, if I understood correctly is the result of a compress command in Unix.
Python doesn't seem to know how to handle these, it's not gz, nor bz2 or zip. Does anyone know a library that would handle these ?
Thanks in advance

If GZIP -- the application -- can handle it, you have two choices.
Try the Python gzip library. It may work.
Use subprocess Popen to run gzip for you.
It may be an InstallShield .Z file. You may want to use InstallShield to unpack it and extract the .TAR file. Again, you may be able to use subprocess Popen to process the file.
It may also be a "LZW compressed file". Look at this library, it may help.
http://www.chilkatsoft.com/compression-python.asp

Since you target a specific platform (Windows), the simplest solution may be to run gzip in a system call: http://www.gzip.org/#exe
Are there other requirements in your project that the decompression needs to be done in Python?

A plain Python module that uncompresses is inexistant, AFAIK, but it's feasible to build one, given some knowledge:
the .Z format header specification
the .Z compression format
Almost all necessary information can be found the unarchiver CompressAlgorithm. Additional info from wikipedia for adaptive LZW and perhaps the compress man page.
Basically, you read the first three bytes (first two are magic bytes) to modify your algorithm, and then start reading and decompressing.
There's a lot of bit fiddling (.Z files begin having 9-bit tokens, up to 16-bit ones and then resetting the symbol table to the initial 256+2 values), which probably you'll deal with doing binary operations (&, <<= etc).

Related

Named memory-mapped files in Python?

I'm using OpenCV to process some video data in a web service. Before calling OpenCV, the video is already loaded to a bytearray buffer, which I would like to pass to VideoCapture object:
# The following raises cv2.error because it can't convert '_io.BytesIO' to 'str' for 'filename'
cap = cv2.VideoCapture(buffer)
Unfortunately, VideoCapture() expects a string filename, not a buffer. For now, I'm saving the bytearray to a temporary file, and pass its name to VideoCapture().
Questions:
Is there a way to create named in-memory files in Python, so I can pacify OpenCV?
Alternatively, is there another OpenCV API which does support buffers?

Note: POSIX-specific! As you haven't provided OS tag, I assume it's okay.
According to this answer (and this shm_overview manpage) there is /dev/shm always present on the system. That's a tmpfs mapped in a shared (not Python process memory) memory pool, as suggested here, but the plus is that you don't need to create it, so no funny inventing of:
os.system("mount ...") or
Popen(["mount", ...]) wrappers.
Simply use tempfile.NamedTemporaryFile() like this:
from tempfile import NamedTemporaryFile
with NamedTemporaryFile(dir="/dev/shm") as file:
print(file.name)
# /dev/shm/tmp2m86e0e0
which you could then feed into OpenCV's API wrapper. Alternatively, utilize pyfilesystem as a more extensive wrapper around that device/FS.
Also, multiprocessing.heap.Arena uses it too, so if it didn't work, there'd be much more trouble present. For Windows check this implementation which uses winapi.
For the size of /dev/shm:
this is one of the size "specifications" I found,
shm.h, shm_add_rss_swap(), newseg() from Linux source code may hold more details
Judging by sudo ipcs it's most likely the way you want to utilize when sharing stuff between processes if you don't use sockets, pipes or disk.
As it's POSIX, it should work on POSIX-compliant systems, thus also on MacOS(no) or Solaris, but I have no means to try it.

Partially to answer the question: there is no way I know of in python to create named file-like objects which point to memory: that's something for an operating system to do. There is a very easy way to do something very like creating named memory mapped files in most modern *nixs: save the file to /tmp. These days /tmp is almost always a ramdisk. But of course it might be zram (basically a compressed ramdisk) and you likely want to check that first. At any rate it's better than thrashing your disk or depending on os caching.
Incidentally making a dedicated ramdisk is as easy as mount -t tmpfs -o size=1G tmpfs /path/to/tmpfs or similarly with ramfs.
Looking into it I don't think you're going to have much luck with alternative apis either: the use of filenames goes right down to cap.cpp, where we have things like:
VideoCapture::VideoCapture(const String& filename, int apiPreference) : throwOnFail(false)
{
CV_TRACE_FUNCTION();
open(filename, apiPreference);
}
It seems the python bindings are just a thin layer on top of this. But I'm willing to be proven wrong!
References
https://github.com/opencv/opencv/blob/master/modules/videoio/src/cap.cpp#L72

If VideoCapture was a regular Python object, and it accepted "file-like objects" in addition to paths, you could feed it a "file-like object", and it could read from that.
Python's StringIO and BytesIO are file-like objects in memory. Something useful to remember ;)
OpenCV specifically expects a file system path there, so that's out of the question.
OpenCV is a library for computer vision. It's not a library for handling video files.
You should look into PyAV. It's a (proper!) wrapper for ffmpeg's libraries. You can feed data directly in there and it will decode. Here are some examples and here are its tests that demonstrate further functionality. Its documentation is thin because most usage is (or should have been...) documented by ffmpeg itself.

You might be able to get away with a named pipe. You can use os.mkfifo to create one, then use the multiprocess module to spawn a background process that feeds the video file into it. Note that mkfifo is not supported on Windows.
The most important limitation is that a pipe does not support seeking, so your video won't be seekable or rewindable either. And whether it actually works might depend on the video format and on the backend (gstreamer, v4l2, ...) that OpenCV is using.

Is there a good way to store large amounts of similar scraped HTML files in Python?

I've written a web scraper in Python and I have a ton (thousands) of files that are extremely similar, but not quite identical. The disk space used currently used by the files is 1.8 GB, but if I compress them into a tar.xz, they compress to 14.4 MB. I want to be closer to that 14.4 MB than the 1.8 GB.
Here are some things I've considered:
I could just use tarfile in Python's standard library and store the files there. The problem with that is I wouldn't be able to modify the files within the tar without recompressing all of the files which would take a while.
I could just use the difflib in Python's standard library, but I've found that this library doesn't offer any way of applying "patches" to recreate the new file.
I could use Google's diff-match-patch Python library, but when I was reading the documentation, they said "Attempting to feed HTML, XML or some other structured content through a fuzzy match or patch may result in problems.", well considering I wanted to use this library to more efficiently store HTML files, that doesn't sound like it'll help me.
So is there a way of saving disk space when storing a large amount of similar HTML files?

You can use a dictionary.
Python's zlib interface supports dictionaries. The compressobj and decompressobj functions both take an optional zdict argument, which is a "dictionary". A dictionary in this case is nothing more than 32K of data with sequences of bytes that you expect will appear in the data you are compressing.
Since your files are about 30K each, this works out quite well for your application. If indeed your files are "extremely similar", then you can take one of those files and use it as the dictionary to compress all of the other files.
Try it, and measure the improvement in compression over not using a dictionary.

Hadoop: Process image files in Python code

I'm working on a side project where we want to process images in a hadoop mapreduce program (for eventual deployment to Amazon's elastic mapreduce). The input to the process will be a list of all the files, each with a little extra data attached (the lat/long position of the bottom left corner - these are aerial photos)
The actual processing needs to take place in Python code so we can leverage the Python Image Library. All the Python streaming examples I can find use stdin and process text input. Can I send image data to Python through stdin? If so, how?
I wrote a Mapper class in Java that takes the list of files and saves the names, the extra data, and the binary contents to a sequence file. I was thinking maybe I need to write a custom Java mapper that takes in the sequence file and pipes it to Python. Is that the right approach? If so, what should the Java to pipe the images out and the Python to read them in look like?
In case it's not obvious, I'm not terribly familiar with Java OR Python, so it's also possible I'm just biting off way more than I can chew with this as my introduction to both languages...

There are a few possible approaches that I can see:
Use both the extra data and the file contents as input to your python program. The tricky part here will be the encoding. I frankly have no idea how streaming works with raw binary content, and I'm assuming that basic answer is "not well." The main issue is that the stdin/stdout communication between processes is very text-based, relying on delimiting input with tabs and newlines, and things like that. You would need to worry about the encoding of the image data, and probably have some sort of pre-processing step, or a custom InputFormat so that you could represent the image as text.
Use only the extra data and the file location as input to your python program. Then the program can independently read the actual image data from the file. The hiccup here is making sure that the file is available to the python script. Remember this is a distributed environment, so the files would have to be in HDFS or somewhere similar, and I don't know if there are good libraries for reading files from HDFS in python.
Do the java-python interaction yourself. Write a java mapper that uses the Runtime class to start the python process itself. This way you get full control over exactly how the two worlds communicate, but obviously its more code and a bit more involved.

Open an lzo file in python, without decompressing the file

I'm currently working on a 3rd year project involving data from Twitter. The department have provided me with .lzo's of a months worth of Twitter. The smallest is 4.9gb and when decompressed is 29gb so I'm trying to open the file and read as I'm going. Is this possible or do I need to decompress and work with the data that way?
EDIT: Have attempted to read it line by line and decompress the read line
UPDATE: Found a solution - reading the STDOUT of lzop -dc works like a charm

How about starting an lzop binary in a subprocess with -c switch and then read its STDOUT line by line?

I know only one library for LZO with Python — https://github.com/jd-boyd/python-lzo and it requires full decompression (moreover — it decompress contents in memory).
So I think you'll need to decompress files before work with them.

I know this is a very old question and the answer is really good. I enchountered a samilar problem, google brought me here.
I just write down my experience on lzo compression and lzop program. Hope I can help someone like me encounter the same quesion. And I write a simple python module to deal with lzo file, you can find it on https://github.com/ir193/python-lzo/
Regarding the quesion, reading lzo compressed file in place (without decompress the whole file) can be done by reading one block at one time. The lzo file is divided into serveral blocks and there is a maximum size of the block about serveral MB. In my module, you can just using read(4096) or so.
Actually *.lzo is created by lzop and has little to do with the python-lzo provided by another answer (https://github.com/jd-boyd/python-lzo). This module is used for compress/decompress string, not handle lzop file header and checksum. Don't use it if you want to open some exist lzo file.

unrar archive while downloading it

I've got a program that downloads part01, then part02 etc of a rar file split across the internet.
My program downloads part01 first, then part02 and so on.
After some tests, I found out that using, on example, UnRAR2 for python I can extract the first part of the file (an .avi file) contained in the archive and I'm able to play it for the first minutes. When I add another file it extracts a bit more and so on. What I wonder is: is it possible to make it extract single files WHILE downloading them?
I'd need it to start extracting part01 without having to wait for it to finish downloading... is that possible?
Thank you very much!
Matteo

You are talking about an .avi file inside the rar archives. Are you sure the archives are actually compressed? Video files released by the warez scene do not use compression:
Ripped movies are still packaged due to the large filesize, but compression is disallowed and the RAR format is used only as a container. Because of this, modern playback software can easily play a release directly from the packaged files, and even stream it as the release is downloaded (if the network is fast enough).
(I'm thinking VLC, BSPlayer, KMPlayer, Dziobas Rar Player, rarfilesource, rarfs,...)
You can check for the compression as follows:
Open the first .rar archive in WinRAR. (name.part01.rar or name.rar for old style volumes names)
Click the info button.
If Version to extract indicates 2.0, then the archive uses no compression. (unless you have decade old rars) You can see Total size and Packed size will be equal.
is it possible to make it extract
single files WHILE downloading them?
Yes. When no compression is used, you can write your own program to extract the files. (I know of someone who wrote a script to directly download the movie from external rar files; but it's not public and I don't have it.) Because you mentioned Python I suggest you take a look at rarfile 2.2 by Marko Kreen like the author of pyarrfs did. The archive is just the file chopped up with headers (rar blocks) added. It will be a copy operation that you need to pause until the next archive is downloaded.
I strongly believe it is also possible for compressed files. Your approach here will be different because you must use unrar to extract the compressed files. I have to add that there is also a free RARv3 implementation to extract rars implemented in The Unarchiver.
I think this parameter for (un)rar will make it possible:
-vp Pause before each volume
By default RAR asks for confirmation before creating
or unpacking next volume only for removable disks.
This switch forces RAR to ask such confirmation always.
It can be useful if disk space is limited and you wish
to copy each volume to another media immediately after
creation.
It will give you the possibility to pause the extraction until the next archive is downloaded.
I believe that this won't work if the rar was created with the 'solid' option enabled.
When the solid option is used for rars, all packed files are treated as one big file stream. This should not cause any problems if you always start from the first file even if it doesn't contain the file you want to extract.
I also think it will work with passworded archives.

I highly doubt it. By nature of compression (from my understanding), every bit is needed to uncompress it. It seems that the source of where you are downloading from has intentionally broken the avi into pieces before compression, but by the time you apply compression, whatever you compressed is now one atomic unit. So they kindly broke the whole avi into Parts, but each Part is still an atomic nit.
But I'm not an expert in compression.
The only test I can currently think of is something like: curl http://example.com/Part01 | unrar.

I don't know if this was asked with a specific language in mind, but it is possible to stream a compressed RAR directly from the internet and have it decompressed on the fly. I can do this with my C# library http://sharpcompress.codeplex.com/
The RAR format is actually kind of nice. It has headers preceding each entry and the compressed data itself does not require random access on the stream of bytes.
Do it multi-part files, you'd have to fully extract part 1 first, then continue writing when part 2 is available.
All of this is possible with my RarReader API. Solid archive are also streamable (in fact, they're only streamable. You can't randomly access files in a solid archive. You pretty much have to extract them all at once.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.