Compressing a video file in python with the standard library - python

Is there a way to effectively compress a video file with the standard library of python? I wrote a quick script to accomplish this, but it barely compresses the video file. Take a look:
import sys
import zlib
with open('Some_Video.mp4', 'rb') as f:
original_data = f.read()
original_size = sys.getsizeof(original_data)
compress_data = zlib.compress(original_data, level=5)
compressed_size = sys.getsizeof(compress_data)
print(original_size)
print(compressed_size)
This was the output:
2793876
2788282
Why is the difference so small, and how can I compress further?

Video files are already compressed. You cannot compress them further, at least not significantly.
Your only option would be to decompress them, and then recompress them with a more effective compressor, e.g. HEVC.

I believe the small reduction in file size is due zlib being a lossless compression library, and mp4 is already a compressed format so there's little margin to improvement.
From the standard library, lzma claims to have the best compression ratio. But keep in mind it's also lossless so I would not expect that much difference.
I recommend you use the third-party lib ffmpeg-python. It's a wrapper for the command line application ffmpeg, which would let you transcode your mp4 using better encoders like h265.

Related

Named memory-mapped files in Python?

I'm using OpenCV to process some video data in a web service. Before calling OpenCV, the video is already loaded to a bytearray buffer, which I would like to pass to VideoCapture object:
# The following raises cv2.error because it can't convert '_io.BytesIO' to 'str' for 'filename'
cap = cv2.VideoCapture(buffer)
Unfortunately, VideoCapture() expects a string filename, not a buffer. For now, I'm saving the bytearray to a temporary file, and pass its name to VideoCapture().
Questions:
Is there a way to create named in-memory files in Python, so I can pacify OpenCV?
Alternatively, is there another OpenCV API which does support buffers?
Note: POSIX-specific! As you haven't provided OS tag, I assume it's okay.
According to this answer (and this shm_overview manpage) there is /dev/shm always present on the system. That's a tmpfs mapped in a shared (not Python process memory) memory pool, as suggested here, but the plus is that you don't need to create it, so no funny inventing of:
os.system("mount ...") or
Popen(["mount", ...]) wrappers.
Simply use tempfile.NamedTemporaryFile() like this:
from tempfile import NamedTemporaryFile
with NamedTemporaryFile(dir="/dev/shm") as file:
print(file.name)
# /dev/shm/tmp2m86e0e0
which you could then feed into OpenCV's API wrapper. Alternatively, utilize pyfilesystem as a more extensive wrapper around that device/FS.
Also, multiprocessing.heap.Arena uses it too, so if it didn't work, there'd be much more trouble present. For Windows check this implementation which uses winapi.
For the size of /dev/shm:
this is one of the size "specifications" I found,
shm.h, shm_add_rss_swap(), newseg() from Linux source code may hold more details
Judging by sudo ipcs it's most likely the way you want to utilize when sharing stuff between processes if you don't use sockets, pipes or disk.
As it's POSIX, it should work on POSIX-compliant systems, thus also on MacOS(no) or Solaris, but I have no means to try it.
Partially to answer the question: there is no way I know of in python to create named file-like objects which point to memory: that's something for an operating system to do. There is a very easy way to do something very like creating named memory mapped files in most modern *nixs: save the file to /tmp. These days /tmp is almost always a ramdisk. But of course it might be zram (basically a compressed ramdisk) and you likely want to check that first. At any rate it's better than thrashing your disk or depending on os caching.
Incidentally making a dedicated ramdisk is as easy as mount -t tmpfs -o size=1G tmpfs /path/to/tmpfs or similarly with ramfs.
Looking into it I don't think you're going to have much luck with alternative apis either: the use of filenames goes right down to cap.cpp, where we have things like:
VideoCapture::VideoCapture(const String& filename, int apiPreference) : throwOnFail(false)
{
CV_TRACE_FUNCTION();
open(filename, apiPreference);
}
It seems the python bindings are just a thin layer on top of this. But I'm willing to be proven wrong!
References
https://github.com/opencv/opencv/blob/master/modules/videoio/src/cap.cpp#L72
If VideoCapture was a regular Python object, and it accepted "file-like objects" in addition to paths, you could feed it a "file-like object", and it could read from that.
Python's StringIO and BytesIO are file-like objects in memory. Something useful to remember ;)
OpenCV specifically expects a file system path there, so that's out of the question.
OpenCV is a library for computer vision. It's not a library for handling video files.
You should look into PyAV. It's a (proper!) wrapper for ffmpeg's libraries. You can feed data directly in there and it will decode. Here are some examples and here are its tests that demonstrate further functionality. Its documentation is thin because most usage is (or should have been...) documented by ffmpeg itself.
You might be able to get away with a named pipe. You can use os.mkfifo to create one, then use the multiprocess module to spawn a background process that feeds the video file into it. Note that mkfifo is not supported on Windows.
The most important limitation is that a pipe does not support seeking, so your video won't be seekable or rewindable either. And whether it actually works might depend on the video format and on the backend (gstreamer, v4l2, ...) that OpenCV is using.

How to read compressed(.gz) file faster using Pandas/Dask?

I have couple (each of 3.5GB) of gzip files, as of now I am using Pandas to read those files, but it is very slow, I have tried Dask also, but it seems it does not support gzip file breaking. Is there any better way to quickly load these massive gzip files?
Dask and Pandas code:
df = dd.read_csv(r'file', sample = 200000000000,compression='gzip')
I expect it to read the whole file as quickly as possible.
gzip is, inherently, a pretty slow compression method, and (as you say) does not support random access. This means, that the only way to get to position x is to scan through the file from the start, which is why Dask does not support trying to parallelise in this case.
Your best best, if you want to make use of parallel parsing at least, is first to decompress the whole file, so that the chunking mechanism does make sense. You could also break it into several files, and compress each one, so that the total space required is similar.
Note that, in theory, some compression mechanisms that support block-wise random access, but we have not found any with sufficient community support to implement them in Dask.
The best answer, though, is to store your data in parquet or orc format, which has internal compression and partitioning.
One option is to use package datatable for python:
https://github.com/h2oai/datatable
It can read/write significantly faster than pandas (to gzip) using the function fread, for example
import datatable as dt
df = dt.fread('file.csv.gz')
Later, one can convert it to pandas dataframe:
df1 = df.to_pandas()
Currently datatable is only available on Linux/Mac.
You can try using gzip library:
import gzip
f = gzip.open('Your File', 'wb')
file_content = f.read()
print (file_content)
python: read lines from compressed text files

Exporting Audio for Google Speech using pydub

I'm trying to export audio files to LINEAR16 for Google Speech and I notice that they specify little-endian byte ordering. I'm using pydub to export to 'raw' format, but I can't tell from the documentation (or the source) whether the exported files are in little or big endian format?
I'm using the following command for exporting:
audio = pydub.from_file(self.mFilePathName, "mp4")
fullFileNameRaw = "audio.raw"
audio.export(fullFileNameRaw, format='raw')
Thank you.
-K
According to this answer, standard (RIFF) wave files are little endian. Pydub uses the stdlib wavemodule to write wave files, so I'm guessing it is little endian. (if you write the file with the wave headers it does in fact have RIFF at the beginning).
Looking into it a little further though, it seems like it may depend on the hardware platform's endianness. x86 and AMD64 are both little endian though so that covers basically all the places people would run pydub (I think?)

Importing audio track (wav or aiff) in Python

I have an audio track in AIFF format. I would like to open this audio file with Python, and import the amplitudes of the sound and perform some mathematical analysis such as Fourier Transform, etc.
Is this possible in Python?
Are there libraries or modules, which allow me to acquire an audio file?
Throughout my search, I have found scipy.io.wavfile, which works for WAV audio files.
Are there other libraries to import audio files in Python?
Is there something similar for AIFF files?
Obviously, I can convert the AIFF into a WAV file, but I would like to import the AIFF file directly, if possible.
As a side question: are there some more specific (by specific, I mean better than Python) programming languages to perform such kind of analysis and acquisition of audio files?
Python comes with AIFF support as part of the standard library -- see the aifc module.
This module provides support for reading and writing AIFF and AIFF-C
files. AIFF is Audio Interchange File Format, a format for storing
digital audio samples in a file. AIFF-C is a newer version of the
format that includes the ability to compress the audio data.
Depending on what your end goals are, you may be more productive using a tool like PureData that's designed just for working with audio and has things like reading audio files and performing ffts as primitives.
Yes, I also came across this problem using scipy.io.wavfile. I looked up the problem and see that Scikits might be interesting to get around this wave only solution.
https://sites.google.com/site/ldpyproject/scikits-audiolab
As for Pure Data I use this a lot, but of course it does depend on what you wishing to do with your sound file...?

Dealing with huge (potentially over 30000x30000) images in Python?

I'm trying to use a python script called deepzoom.py to convert large overhead renders (often over 1GP) to the Deep Zoom image format (ie, google maps-esque tile format), but unfortunately it's powered by PIL, which usually ends up crashing due to memory limitations. The creator has said he's delving into VIPS, but even nip2 (the GUI frontend for VIPS) fails to open the image. In another question by someone else (though on the same topic), someone suggested OpenImageIO, which looks like it has the ability, and has Python wrappers, but there aren't any proper binaries provided, and trying to compile it on Windows is a nightmare.
Are there any alternative libraries for Python I can use? I've tried PythonMagickWand (wrapper for ImageMagick) and PythonMagick (wrapper for GraphicsMagick), but both of those also run into memory problems.
I had a very similar problem and I ended up solving it by using netpbm, which works fine on windows. Netpbm had no problem with converting huge .png files and then slicing, cropping, re-combining (using pamcrop, pamdice, and pamundice) and converting back to .png without using much memory at all. I just included the necessary netpbm binaries and dlls with my application and called them from python.
It sounds like you're trying to use georeferenced imagery or something similar, for which a GIS solution sounds more appropriate. I'd use GDAL -- it's an excellent library and comes with easy-to-use Python bindings via Swig.
On Windows, the easiest way to install it is via Frank Warmerdam's FWTools package.
I'm able to use pyvips to read images with size (50000, 50000, 3):
img = pyvips.Image.new_from_file('xxx.jpg')
arr = np.ndarray(buffer=img.write_to_memory(),
dtype=np.uint8,
shape=[img.height, img.width, img.bands])
Is a partial load useful? If you use PIL and the image format is .BMP: you can open() an image file (which doesn't load it), then do a crop(), and then load - which will only actually load the part of the image which you've selected by crop. Will probably also work with TGA, maybe even for JPG and less efficiently for PNG and other formats.
libvips comes with a very fast DeepZoom creator that can work with images of any size. Try:
$ vips dzsave huge.tif mydz
Will write the tiles to mydz_files and also write a mydz.dzi info file for you. It's typically 10x faster than deepzoom.py and has no size limit.
See this chapter in the manual for an introduction to dzsave.
You can do the same thing from Python using pyvips like this:
import pyvips
my_image = pyvips.Image.new_from_file("huge.tif", access="sequential")
my_image.dzsave("mydz")
The access="sequential" tells pyvips it can stream the image rather than having to read the whole thing into memory.

Categories