Access in-memory unzipped file with codecs.open() - python

I'm trying to open in-memory unzipped files with codecs.open(). I've figured out how to unzip a file in memory, but I don't know how to create a file object and open it with codecs. I've experimented with different ZipFile properties, but no luck.
So, here how I opened the zip in memory:
import zipfile, io
f = 'somezipfile.zip'
memory_object = io.BytesIO(f.read())
zip_in_memory = zipfile.ZipFile(memory_object)

You don't need codecs.open() to access data in memory -- it is meant for loading files from disk. You can extract the file contents from your zipfile obbject using its extract() method and decode the resulting string using decode(). If you insist on using the codecs module, you can also get a file-like object by zip_in_memory.open(...) and wrapping the returned object with codecs.EncodedFile.

Related

How do you convert ZipFile object in to binary in python?

Let say I create zipfile object like so:
with ZipFile(StringIO(), mode='w', compression=ZIP_DEFLATED) as zf:
zf.writestr('data.json', 'data_json')
zf.writestr('document.docx', "Some document")
zf.to_bytes() # there is no such method
Can I convert zf in to bytes?
Note: I'm saying to get a bytes of zipfile it self, not the content files of inside zip archive?
I also prefer to do it in memory without dumping to disk.
Need it to test mocked request that I get from requests.get when downloading a zip file.
The data is stored to the StringIO object, which you didn't save a reference to. You should have saved a reference. (Also, unless you're on Python 2, you need a BytesIO, not a StringIO.)
memfile = io.BytesIO()
with ZipFile(memfile, mode='w', compression=ZIP_DEFLATED) as zf:
...
data = memfile.getvalue()
Note that it's important to call getvalue outside the with block (or after the close, if you want to handle closing the ZipFile object manually). Otherwise, your output will be corrupt, missing final records that are written when the ZipFile is closed.

Python: Direct conversion from download to mp3 without saving files (requests/moviepy)

I need to convert a MP4 downloaded with the Requests module, into a MP3 with the Moviepy module.
The 2 operations work perfectly.
However, in order to convert the MP4 into MP3 (using the moviepy audio.write_audiofile() method),
I need to save the MP4 onto the disk (using the requests write() method)
which basically is useless since I will delete right after.
Please do you know if there's a method that takes the content downloaded with Requests, and converts it directly into MP3 file?
Thank you in advance!
I am not so familiar with this, but i think you can use io.BytesIO (https://docs.python.org/3/library/io.html#io.BytesIO), with this you can write data (from requets) into a BytesIO-Object instead of a file (you can use it on every file read/write actions instead of a real file). An example:
import io
b = io.BytesIO()
with open("file.dat", "br") as f:
b.write(f.read())
b.seek(0)
with open("new_file.dat", "bw") as f:
f.write(b.read())
b.close()

Python: Read compressed (.gz) HDF file without writing and saving uncompressed file

I have a large number of compressed HDF files, which I need to read.
file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...
I can read in uncompressed HDF files with the following method
from pyhdf.SD import SD, SDC
import os
os.system('gunzip < file1.HDF.gz > file1.HDF')
HDF = SD('file1.HDF')
and repeat this for each file. However, this is more time consuming than I want.
I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.
Am I correct in this thinking? And if so, is there a way to do what I want?
According to the pyhdf package documentation, this is not possible.
__init__(self, path, mode=1)
SD constructor. Initialize an SD interface on an HDF file,
creating the file if necessary.
There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.
Unzipping it as a file is likely your most performant option.
If you would like to stay in Python, use the gzip module (docs):
import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can't control how the hdf files are stored you're looking for the gzip python modulue (docs) it can get the data from these files.

how to decompress .tar.bz2 in memory with python

How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()

Adding a file-like object to a Zip file in Python

The Python ZipFile API seems to allow the passing of a file path to ZipFile.write or a byte string to ZipFile.writestr but nothing in between. I would like to be able to pass a file like object, in this case a django.core.files.storage.DefaultStorage but any file-like object in principle. At the moment I think I'm going to have to either save the file to disk, or read it into memory. Neither of these is perfect.
You are correct, those are the only two choices. If your DefaultStorage object is large, you may want to go with saving it to disk first; otherwise, I would use:
zipped = ZipFile(...)
zipped.writestr('archive_name', default_storage_object.read())
If default_storage_object is a StringIO object, it can use default_storage_object.getvalue().
While there's no option that takes a file-like object, there is an option to open a zip entry for writing (ZipFile.open). [doc]
import zipfile
import shutil
with zipfile.ZipFile('test.zip','w') as archive:
with archive.open('test_entry.txt','w') as outfile:
with open('test_file.txt','rb') as infile:
shutil.copyfileobj(infile, outfile)
You can use your input stream as the source instead, and not have to copy the file to disk first. The downside is that if something goes wrong with your stream, the zip file will be unusable. In my application, we bypass files with errors, so we end up getting a local copy of the file anyway to ensure integrity and keep a usable zip file.

Categories