I am working with MNIST data in ML(for digit recognistion) and I want to convert my 'mnist.pkl' to 'mnist.pkl.gz' because the turtorial I am watching uses that extension.
also if possible please tell me what are those ..... that he has before the file name('.../data/mnist.pkl.gz', 'rb') if you are familiar with it Thank You
The extension .gz indicates that the file was compressed using gzip which you can do by invoking
gzip mnist.pkl
on the command line. The command will remove the original file and replace it with a compressed version named mnist.pkl.gz.
That said, you don't have to compress/decompress the file in your particular case. Just use
f = open('../data.mnist.pkl', 'rb')
instead of
f = gzip.open('../data.mnist.pkl.gz', 'rb')
Related
What im trying to do is extract png images from files, so by reading the hex data its easy to find where they are hidden. They always start and end with certain values concerning png images. I wrote a script that would open a .bin file and search for those values and export as png. The problem is, in python 2.7 nothing happens, and in python 3, I get errors about the encoding of the file. Ive tried ignorerrors and utf-8 encoding flags but probelms still persist. The code in question:
import binascii
import re
import os
for directory, subdirectories, files in os.walk('.'):
for file in files:
if not file.endswith('.bin'):
continue
filenumber = 0
with open(os.path.join(directory, file)) as f:
hexaPattern = re.compile(
r'(89504E47.*?AE426082)',
re.IGNORECASE
)
for match in hexaPattern.findall(binascii.hexlify(f.read())):
with open('{}-{}.png'.format(file, filenumber), 'wb+') as f:
f.write(binascii.unhexlify(match))
filenumber += 1
So as you can see, extract hex values beginning with "89504E47" from imported file, anything in between that and "AE426082". I think the entire code for getting these values is fine, but I'm having trouble with python actually reading the file as hexidecimal. Thoughts?
Thank you #Thierry Lathuille that fixed it. I used python 3.9, then did the changes with:
with open(os.path.join(directory, file), 'rb+') as f:
and everything output correctly!
I am currently writing json files to disk using
print('writing to disk .... ')
f = open('mypath/myfile, 'wb')
f.write(getjsondata.read())
f.close()
Which works perfectly, except that the json files are very large and I would like to compress them. How can I do that automatically? What should I do?
Thanks!
Python has a standard module for zlib, which can compress and decompress data for you. You can use this immediately on your data and write (and read) a custom format, or use the module gzip, which wraps the inner workings of zlib to read and write gzip compatible files, while
automatically compressing or decompressing the data so that it looks like an ordinary file object.
It thus neatly replaces the default open format to interact with files, and all you need is this:
import gzip
print('writing to disk .... ')
with gzip.open('mypath/myfile', 'wb') as f:
f.write(getjsondata.read())
(with a change in the open line because I highly recommend using the with syntax to handle file objects.)
How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()
I have another doubt related to reading the dat file.
The file format is DAT file (.dat)
The content inside the file is in bytes.
When I tried the run open file code, the program built and ran successfully. However, the python shell has no output (I can't see the contents from the file).
Since the content inside the file is in bytes, should I modify the code ? What is the code to use for bytes?
Thank you.
There is no "DAT" file format and, as you say, the file contains bytes - as do all files.
It's possible that the file contains binary data for which it's best to open the file in binary mode. You do that by specifying b as part of the mode parameter to open(), like this:
f = open('file.dat', 'rb')
data = f.read() # read the entire file into data
print(data)
f.close()
Note that the full mode parameter is set to rb which means open the file in binary mode for reading.
A better way is to use with:
with open('file.dat', 'rb') as f:
data = f.read()
print(data)
No need to explicitly close the file.
If you know that the file contains text, possibly encoded in some specific encoding, e.g. UTF8, then you can specify the encoding when you open the file (Python 3):
with open('file.dat', encoding='UTF8') as f:
for line in f:
print(line)
In Python 2 you can use io.open().
I am a python newbie here...
I have a gzipped file (C:\sample.gz) that I have downloaded off the internet and I need help with code that will extract the csv file inside to its own file (C:\sample.csv) all I see is code to load it into memory... is there any way to do this?
You could use python's gzip library.
import gzip
with gzip.open('/sample.gz', 'rb') as my_gz:
file_content = my_gz.read()
with open('/sample.csv', 'wb') as my_csv:
my_csv.write(file_content)