I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?
You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))
Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.
Related
So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:
#REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65
#REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
.
.
.
For the sake of this question, just focus on the header lines, starting with a '#'.
So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '#REC1', the next should be '#REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:
file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
f.seek(start_point)
# look for the next beginning of record, never mind how
But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?
Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!
Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:
Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.
To linearize, you can run your FASTQ file through awk:
$ awk '\
BEGIN { \
n = 0; \
} \
{ \
a[n % 4] = $0; \
if ((n+1) % 4 == 0) { \
print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
} \
n++; \
}' example.fq > example.fq.linear
To compare a pair of files:
$ diff example_1.fq.linear example_2.fq.linear
If there's any difference, diff will find it and tell you which FASTQ record is different.
You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.
So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.
If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:
$ awk -f linearize_fq.awk example.fq > example.fq.linear
This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:
$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)
Or you can use named pipes:
$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear
Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.
Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.
You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.
Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:
with open(file1) as f1, open(file2) as f2:
for l1, l2 in zip(f1,f2):
if l1.startswith("#REC"):
if l1 != l2:
print("Difference at record", l1)
break
else:
print("No differences")
This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.
Have you looked into using the rdiff command.
The upsides of rdiff are:
with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program
The downsides of rdiff are:
it's not part of standard Linux/UNIX distribution – you have to
install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The
first one produces a special signature file. In the second step, a
delta is created using another rdiff call (all shown below). While
the 2-step process may seem annoying, it has the benefits of
providing faster deltas than when using diff.
See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/
import sys
import re
""" To find of the difference record in two HUGE files. This is expected to
use of minimal memory. """
def get_rec_num(fd):
""" Look for the record number. If not found return -1"""
while True:
line = fd.readline()
if len(line) == 0: break
match = re.search('^#REC(\d+)', line)
if match:
num = int(match.group(1))
return(num)
return(-1)
f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')
hf1 = dict()
hf2 = dict()
while f1 or f2:
if f1:
r = get_rec_num(f1)
if r < 0:
f1.close()
f1 = None
else:
# if r is found in f2 hash, no need to store in f1 hash
if not r in hf2:
hf1[r] = 1
else:
del(hf2[r])
pass
pass
if f2:
r = get_rec_num(f2)
if r < 0:
f2.close()
f2 = None
else:
# if r is found in f1 hash, no need to store in f2 hash
if not r in hf1:
hf2[r] = 1
else:
del(hf1[r])
pass
pass
print('Records found only in f1:')
for r in hf1:
print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
print('{}, '.format(r));
Both answers from #AlexReynolds and #TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:
Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).
The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).
I'm making a dump file of a postgres base.
By using pg_dump base > base.sql, postgres creates the base.sql but until the process is not done, the file appears with 0 bytes.
Is there a way to acess the size of the file to make a kind of progress bar?
I tried it with python with the os.stat() but it shows 0 bytes too:
>>> import os
>>> file = os.stat('base.sql')
>>> file.st_size
>>> 0L
One non-python option is to use the pv command line program (man page), which is available in the universe repository:
apt-get install pv
Now you can pipe through pv to get continuous updates on how much data has been written to the output file so far:
pg_dump base | pv > base.sql
Of course, pv has no way of knowing the final size in this case, so it will only display the amount of data piped so far. If you've got a guesstimate you may provide it with the -s argument, and you'll get a proper progress bar:
pg_dump base | pv -s 500M > base.sql
Here is a blog post I found with install instructions, examples and screenshots.
I don't think you can, until all the file has been written to the HDD.
In order to make a progress bar you have to know what is the space occupied at 100% by the file. Only then, by using that number and the current space used by the file you're writing you can calculate the current percentage of written file.
I am building a service where I log plain text format logs from several sources (one file per source). I do not intend to rotate these logs as they must be around forever.
To make these forever around files smaller I hope I could gzip them in fly. As they are log data, the files compress very well.
What is a good approach in Python to write append-only gzipped text files, so that the writing can be later resumed when service goes on and off? I am not that worried about losing few lines, but if gzip container itself breaks down and the file becomes unreadable that's no no.
Also, if it's no go, I can simply write them in as plain text without gzipping if it's not worth of the hassle.
Note: On unix systems you should seriously consider using an external program, written for this exact task:
logrotate (rotates, compresses, and mails system logs)
You can set the number of rotations so high, that the first file would be deleted in 100 years or so.
In Python 2, logging.FileHandler takes an keyword argument encoding that can be set to bz2 or zlib.
This is because logging uses the codecs module, which in turn treats bz2 (or zlib) as encoding:
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2") as fh:
... fh.write("Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World
Python 3 version (although the docs mention bz2 as alias, you'll actually have to use bz2_codec - at least w/ 3.2.3):
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2_codec") as fh:
... fh.write(b"Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World
I'm writing a simple MP3 cataloguer to keep track of which MP3's are on my various devices. I was planning on using MD5 or SHA2 keys to identify matching files even if they have been renamed/moved, etc. I'm not trying to match MP3's that are logically equivalent (i.e.: same song but encoded differently). I have about 8000 MP3's. Only about 6700 of them generated unique keys.
My problem is that I'm running into collisions regardless of the hashing algorithm I choose. In one case, I have two files that happen to be tracks #1 and #2 on the same album, they are different file sizes yet produce identical hash keys whether I use MD5, SHA2-256, SHA2-512, etc...
This is the first time I'm really using hash keys on files and this is an unexpected result. I feel something fishy is going on here from the little I know about these hashing algorithms. Could this be an issue related to MP3's or Python's implementation?
Here's the snippet of code that I'm using:
data = open(path, 'r').read()
m = hashlib.md5(data)
m.update(data)
md5String = m.hexdigest()
Any answers or insights to why this is happening would be much appreciated. Thanks in advance.
--UPDATE--:
I tried executing this code in linux (with Python 2.6) and it did not produce a collision. As demonstrated by the stat call, the files are not the same. I also downloaded WinMD5 and this did not produce a collision(8d327ef3937437e0e5abbf6485c24bb3 and 9b2c66781cbe8c1be7d6a1447994430c). Is this a bug with Python hashlib on Windows? I tried the same under Python 2.7.1 and 2.6.6 and both provide the same result.
import hashlib
import os
def createMD5( path):
fh = open(path, 'r')
data = fh.read()
m = hashlib.md5(data)
md5String = m.hexdigest()
fh.close()
return md5String
print os.stat(path1)
print os.stat(path2)
print createMD5(path1)
print createMD5(path2)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=6617216L, st_atime=1303808346L, st_mtime=1167098073L, st_ctime=1290222341L)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=4921346L, st_atime=1303808348L, st_mtime=1167098076L, st_ctime=1290222341L)
>>> a7a10146b241cddff031eb03bd572d96
>>> a7a10146b241cddff031eb03bd572d96
I sort of have the feeling that you are reading a chunk of data which is smaller than the expected, and this chunk happens to be the same for both files. I don't know why, but try to open the file in binary with 'rb'. read() should read up to end of file, but windows behaves differently. From the docs
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.
The files you're having a problem with are almost certainly identical if several different hashing algorithms all return the same hash results on them, or there's a bug in your implementation.
As a sanity test write your own "hash" that just returns the file's contents wholly, and see if this one generates the same "hashes".
As others have stated, a single hash collision is unlikely, and multiple nigh on impossible, unless the files are identical. I would recommend generating the sums with an external utility as something of a sanity check. For example, in Ubuntu (and most/all other Linux distributions):
blair#blair-eeepc:~$ md5sum Bandwagon.mp3
b87cbc2c17cd46789cb3a3c51a350557 Bandwagon.mp3
blair#blair-eeepc:~$ sha256sum Bandwagon.mp3
b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0 Bandwagon.mp3
A quick Google search shows there are similar utilities available for Windows machines. If you see the collisions with the external utilities, then the files are identical. If there are no collisions, you are doing something wrong. I doubt the Python implementation is wrong, as I get the same results when doing the hash in Python:
>>> import hashlib
>>> hashlib.md5(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b87cbc2c17cd46789cb3a3c51a350557'
>>> hashlib.sha256(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0'
Like #Delan Azabani said, there is something fishy here; collisions are bound to happen, but not that often. Check if the songs are the same, and update your post please.
Also, if you feel that you don't have enough keys, you can use two (or even more) hashing algorithms at the same time: by using MD5 for example, you have 2**128, or 340282366920938463463374607431768211456 keys. By using SHA-1, you have 2**160 or 1461501637330902918203684832716283019655932542976 keys. By combining them, you have 2**128 * 2**160, or 497323236409786642155382248146820840100456150797347717440463976893159497012533375533056.
(But if you ask me, MD5 is more than enough for your needs.)
I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.