Telegram documentation says the following about files ID:
The file’s binary content is then split into parts. All parts must
have the same size (part_size) and the following conditions must be
met:
part_size % 1024 = 0 (divisible by 1KB)
524288 % part_size = 0 (512KB must be evenly divisible by part_size)
The last part does not have to satisfy these conditions, provided its
size is less than part_size. Each part should have a sequence number,
file_part, with a value ranging from 0 to 2,999.
My code:
def check_conditions(file_name):
b = False
file_binary_data = open("D:\\" + file_name, "br").read()
length = len(bytearray(file_binary_data))
print(file_name + ", size: " + str(length) + " bytes")
for i in range(1, 3000):
part = length // i
if part % 1024 == 0 and 524288 % part == 0:
print("i: " + str(i) + " | part size: " + str(part))
b = True
if not b:
print("No mathces")
print()
check_conditions("The White Stripes - Truth Doesn't Make A Noise.mp3")
check_conditions("Depeche Mode - Precious.mp3")
check_conditions("Placebo - Meds.mp3")
Output:
The White Stripes - Truth Doesn't Make A Noise.mp3, size: 7782220 bytes
No mathces
Depeche Mode - Precious.mp3, size: 10298248 bytes
i: 1257 | part size: 8192
i: 2514 | part size: 4096
Placebo - Meds.mp3, size: 11808625 bytes
No mathces
Where is mistake? Or if all's ok, what to do with files that don't meet?
You are getting it wrong.
You are simply to divide your file into pieces of equal sizes.
"The White Stripes - Truth Doesn't Make A Noise.mp3" , size: 7782220
bytes
Say you are using the MAX piece size of 512k (ie. 524288), then you simply have:
7782220 / 524288 = 14 rem 442188
Hence you have 14 pieces of 512k bytes and the last piece of 442188 bytes.
Apply same logic to the other files.
Related
I'm trying to understand what's going in internally with python in the following.
Situation (Python3 on debian):
A (large) dict that has integers as keys (running from zero) and tuples as values.
The elements of the tuple are ALL integers (randomly from zero to the number of the largest key).
All tuples have exactly 30 elements.
Problem:
The pickled dict is significantly (approx. 10 times!) smaller on my harddisk than the size of the single elements should represent in memory.
Details:
The size of an integer is 28 bytes (except < 0 > which is just 24 bytes).
The size of a tuple is dependent on the number of elements it contains; assuming 30 elements it is 288 bytes.
The size of a dictionary is dependent on the number of elements it contains; assuming 1000 elements it is 49248 bytes.
Given the situation above, 1000 elements in the dict and assuming the number < 0 > appears 29 times in the tuples I get:
size of the integers in the tuples: 28 x 30 x 1000 - 4 x 29 = 839,884 bytes
size of the tuples: 288 x 1000 = 288,000 bytes
size of the keys: 28 x 1000 - 4 (the first key is zero) = 27,996 bytes
size of the dict with 1000 elements: 49,248 bytes
Sum of this all = 1,205,128 bytes
Now I pickle this dict to harddisk as a binary file and I actually get 91,207 bytes as the size of the file.
So my question is now: what is going on here?
Is the pickling "compressing" the integers to just what the bits (or something like that) are? The number < 1000 > for example can be represented with just 10 bits and would fit into 2 bytes (instead of 28).
Code that might be useful:
import os
import sys
import random
max_key = 1000
zeros = 0
theoretical_size = 0
the_dict = {}
for i in range(max_key):
the_tuple = tuple()
ii = 0
while ii < 30:
number = random.randint(0, (max_key - 1))
if number not in the_tuple:
the_tuple += (number, )
theoretical_size += sys.getsizeof(number)
ii += 1
if not number:
zeros += 1
theoretical_size += sys.getsizeof(the_tuple)
theoretical_size += sys.getsizeof(i)
the_dict[i] = the_tuple
theoretical_size += sys.getsizeof(the_dict)
outfile = '/path/to/outfile/outfilename'
with open(outfile, 'wb') as f:
pickle.dump(the_dict, f)
print(" zeros:", zeros)
print("theoretical size:", theoretical_size)
print(" Calculated:", 28*30*max_key - 4*zeros + 288*max_key + 28*max_key - 4 + sys.getsizeof(the_dict))
print(" On disk:", os.path.getsize(outfile))
I have to implement a BCH error-correcting code. I have found some codes in Python BCH library Python and MATLAB BCH encoder in MATLAB. However, codes have different performance, BCH(127,70) in Python can correct up to 70 bitflips in a block size of 127. However, the MATLAB code can correct up to only 15 bits in 127 bits in BCH(127,15).
Why do these implementation perform differently?
Python Code
import bchlib
import hashlib
import os
import random
# create a bch object
BCH_POLYNOMIAL = 8219
BCH_BITS = 72
bch = bchlib.BCH(BCH_POLYNOMIAL, BCH_BITS)
# random data
data = bytearray(os.urandom(127))
# encode and make a "packet"
ecc = bch.encode(data)
packet = data + ecc
# print length of ecc, data, and packet
print('data size: %d' % (len(data)))
print('ecc size: %d' % (len(ecc)))
print('packet size: %d' % (len(packet)))
# print hash of packet
sha1_initial = hashlib.sha1(packet)
print('sha1: %s' % (sha1_initial.hexdigest(),))
def bitflip(packet):
byte_num = random.randint(0, len(packet) - 1)
bit_num = random.randint(0, 7)
packet[byte_num] ^= (1 << bit_num)
# make BCH_BITS errors
for _ in range(BCH_BITS):
bitflip(packet)
# print hash of packet
sha1_corrupt = hashlib.sha1(packet)
print('sha1: %s' % (sha1_corrupt.hexdigest(),))
# de-packetize
data, ecc = packet[:-bch.ecc_bytes], packet[-bch.ecc_bytes:]
# correct
bitflips = bch.decode_inplace(data, ecc)
print('bitflips: %d' % (bitflips))
# packetize
packet = data + ecc
# print hash of packet
sha1_corrected = hashlib.sha1(packet)
print('sha1: %s' % (sha1_corrected.hexdigest(),))
if sha1_initial.digest() == sha1_corrected.digest():
print('Corrected!')
else:
print('Failed')
This outputs
data size: 127
ecc size: 117
packet size: 244
sha1: 4ee71f947fc5d561b211a551c87fdef18a83404b
sha1: a072664312114fe59f5aa262bed853e35d70d349
bitflips: 72
sha1: 4ee71f947fc5d561b211a551c87fdef18a83404b
Corrected!
MATLAB code
%% bch params
M = 7;
n = 2^M-1; % Codeword length
k = 15; % Message length
nwords = 2; % Number of words to encode
% create a msg
msgTx = gf(randi([0 1],nwords,k));
%disp(msgTx)
%Find the error-correction capability.
t = bchnumerr(n,k)
% Encode the message.
enc = bchenc(msgTx,n,k);
%Corrupt up to t bits in each codeword.
noisycode = enc + randerr(nwords,n,1:t);
%Decode the noisy code.
msgRx = bchdec(noisycode,n,k);
% Validate that the message was properly decoded.
isequal(msgTx,msgRx)
which outputs:
t = 27
ans = logical 1
Increasing k>15 in MATLAB code gives following error:
Error using bchnumerr (line 72)
The values for N and K do not produce a valid narrow-sense BCH code.
Error in bchTest (line 10)
t = bchnumerr(n,k)
I discovered this question today (24 January 2021) as I searched for other information about BCH codes.
See Appendix A: Code Generators for BCH Codes (pdf) of Error-Correction Coding for Digital Communications by George C. Clark and J. Bibb Cain:
For n = 127 and k = 15, t = 27 is the number of errors that can be corrected.
For n = 127, the next option with larger k is k = 22 and t = 23.
Your use of the Python library is confusing. For standard usage of BCH codes, the length of a codeword is equal to 2m - 1 for some positive integer m. The codeword in your example is not of this form.
I have not used that Python library, so I cannot write with certainty. If ecc is of length 127, then I suspect that it is a codeword. Concatenating ecc and data yields a packet that has a copy of the original message data as well as a copy of the codeword. This is not how BCH codes are used. When you have the codeword, you don't need to send it and a separate copy of the original message.
If you do read the reference linked above, be aware of the notation used to describe the polynomials. For the n = 127 table, the polynomial g1(x) is denoted by 211, which is octal notation. The nonzero bits in the binary expressions indicate the nonzero coefficients of the polynomial.
octal: 211
binary: 010 001 001
polynomial: x7 + x3 + 1
The polynomial g2(x) is equal to g1(x) multiplied by another polynomial:
octal: 217
binary: 010 001 111
polynomial: x7 + x3 + x2 + x + 1
This means that
g2(x) = (x7 + x3 + 1)(x7 + x3 + x2 + x + 1)
Each gt+1(x) is equal to gt(x) multiplied by another polynomial.
I am trying to create a list of size 1 MB. while the following code works:
dummy = ['a' for i in xrange(0, 1024)]
sys.getsizeof(dummy)
Out[1]: 9032
The following code does not work.
import os
import sys
dummy = []
dummy.append((os.urandom(1024))
sys.getsizeof(dummy)
Out[1]: 104
Can someone explain why?
If you're wondering why I am not using the first code snippet, I am writing a program to benchmark my memory by writing a for loop that writes blocks (of size 1 B, 1 KB and 1 MB) into memory.
start = time.time()
for i in xrange(1, (1024 * 10)):
dummy.append(os.urandom(1024)) #loop to write 1 MB blocks into memory
end = time.time()
If you check the size of a list, it will be provide the size of the list data structure, including the pointers to its constituent elements. It won't consider the size of elements.
str1_size = sys.getsizeof(['a' for i in xrange(0, 1024)])
str2_size = sys.getsizeof(['abc' for i in xrange(0, 1024)])
int_size = sys.getsizeof([123 for i in xrange(0, 1024)])
none_size = sys.getsizeof([None for i in xrange(0, 1024)])
str1_size == str2_size == int_size == none_size
The size of empty list: sys.getsizeof([]) == 72
Add an element: sys.getsizeof([1]) == 80
Add another element: sys.getsizeof([1, 1]) == 88
So each element adds 4 bytes.
To get 1024 bytes, we need (1024 - 72) / 8 = 119 elements.
The size of the list with 119 elements: sys.getsizeof([None for i in xrange(0, 119)]) == 1080.
This is because a list maintains an extra buffer for inserting more items, so that it doesn't have to resize every time. (The size comes out to be same as 1080 for number of elements between 107 and 126).
So what we need is an immutable data structure, which doesn't need to keep this buffer - tuple.
empty_tuple_size = sys.getsizeof(()) # 56
single_element_size = sys.getsizeof((1,)) # 64
pointer_size = single_element_size - empty_tuple_size # 8
n_1mb = (1024 - empty_tuple_size) / pointer_size # (1024 - 56) / 8 = 121
tuple_1mb = (1,) * n_1mb
sys.getsizeof(tuple_1mb) == 1024
So this is your answer to get a 1MB data structure: (1,)*121
But note that this is only the size of tuple and the constituent pointers. For the total size, you actually need to add up the size of individual elements.
Alternate:
sys.getsizeof('') == 37
sys.getsizeof('1') == 38 # each character adds 1 byte
For 1 MB, we need 987 characters:
sys.getsizeof('1'*987) == 1024
And this is the actual size, not just the size of pointers.
Disclaimer: This is a section from a uni assignment
I have been given the following AES-128-CBC key and told that up to 3 bits in the key have been changed/corrupt.
d9124e6bbc124029572d42937573bab4
The original key's SHA-1 hash is provided;
439090331bd3fad8dc398a417264efe28dba1b60
and I have to find the original key by trying all combinations of up to 3 bit flips.
Supposedly this is possible in 349633 guesses however I don't have a clue where that number came from; I would have assumed it would be closer to 128*127*126 which would be over 2M combinations, that's where my first problem lies.
Secondly, I created the python script below containing a triple nested loop (I know, far from the best code...) to iterate over all 2M possibilities however, after completion an hour later, it hadn't found any matches which I really don't understand.
Hoping someone can atleast point me in the right direction, cheers
#!/usr/bin/python2
import sys
import commands
global binary
def inverseBit(index):
global binary
if binary[index] == "0":
return "1"
return "0"
if __name__ == '__main__':
if len(sys.argv) != 3:
print "Usage: bitflip.py <hex> <sha-1>"
sys.exit()
global binary
binary = ""
sha = str(sys.argv[2])
binary = str(bin(int(sys.argv[1], 16)))
binary = binary[2:]
print binary
b2 = binary
tries = 0
file = open("shas", "w")
for x in range(-2, 128):
for y in range(-1,128):
for z in range(0,128):
if x >= 0:
b2 = b2[:x] + inverseBit(x) + b2[x+1:]
if y >= 0:
b2 = b2[:y] + inverseBit(y) + b2[y+1:]
b2 = b2[:z] + inverseBit(z) + b2[z+1:]
#print b2
hexOut = hex(int(b2,2))
command = "echo -n \"" + hexOut + "\" | openssl sha1"
cmdOut = str(commands.getstatusoutput(command))
cmdOut = cmdOut[cmdOut.index('=')+2:]
cmdOut = cmdOut[:cmdOut.index('\'')]
file.write(str(hexOut) + " | " + str(cmdOut) + "\n")
if len(cmdOut) != 40:
print cmdOut
if cmdOut == sha:
print "Found bit reversals in " + str(tries) + " tries. Corrected key:"
print hexOut
sys.exit()
b2 = binary
tries = tries + 1
if tries % 10000 == 0:
print tries
EDIT:
Changing for loop to
for x in range(-2, 128):
for y in range(x+1,128):
for z in range(y+1,128):
drastically cuts down on the number of guesses while (I think?) still covering the whole space. Still getting some duplicates and still no luck finding the match though..
Your code, if not very efficient, looks fine except for one thing:
hexOut = hex(int(b2,2))
as the output of hex
>>> hex(int('01110110000101',2))
'0x1d85'
starts with 'Ox', which shouldn't be part of the key. So, you should be fine by removing these two characters.
For the number of possible keys to try, you have:
1 with no bit flipped
128 with 1 bit flipped
128*127/2 = 8128 with 2 bits flipped (128 ways to choose the first one, 127 ways to choose the second, and each pair will appear twice)
128*127*126/6 = 341376 with 3 bits flipped (each triplet appears 6 times). This is the number of combinations of 128 bits taken 3 at a time.
So, the total is 1 + 128 + 8128 + 341376 = 349633 possibilities.
Your code tests each of them many times. You could avoid a the useless repetitions by looping like this (for 3 bits):
for x in range (0, 128):
for y in range(x+1, 128):
for z in range(y+1, 128):
.....
You could adapt your trick of starting at -2 with:
for x in range (-2, 128):
for y in range(x+1, 128):
for z in range(y+1, 128):
.... same code you used ...
You could also generate the combinations with itertools.combinations:
from itertools import combinations
for x, y, z in combinations(range(128), 3): # for 3 bits
......
but you'd need a bit more work to manage the cases with 0, 1, 2 and 3 flipped bits in this case.
I have a big text file of 13 GB with 158,609,739 lines and I want to randomly select 155,000,000 lines.
I have tried to scramble the file and then cut the 155000000 first lines, but it's seem that my ram memory (16GB) isn't enough big to do this. The pipelines i have tried are:
shuf file | head -n 155000000
sort -R file | head -n 155000000
Now instead of selecting lines, I think is more memory efficient delete 3,609,739 random lines from the file to get a final file of 155000000 lines.
As you copy each line of the file to the output, assess its probability that it should be deleted. The first line should have a 3,609,739/158,609,739 chance of being deleted. If you generate a random number between 0 and 1 and that number is less than that ratio, don't copy it to the output. Now the odds for the second line are 3,609,738/158,609,738; if that line is not deleted, the odds for the third line are 3,609,738/158,609,737. Repeat until done.
Because the odds change with each line processed, this algorithm guarantees the exact line count. Once you've deleted 3,609,739 the odds go to zero; if at any time you would need to delete every remaining line in the file, the odds go to one.
You could always pre-generate which line numbers (a list of 3,609,739 random numbers selected without replacement) you plan on deleting, then just iterate through the file and copy to another, skipping lines as necessary. As long as you have space for a new file this would work.
You could select the random numbers with random.sample
E.g.,
random.sample(xrange(158609739), 3609739)
Proof of Mark Ransom's Answer
Let's use numbers easier to think about (at least for me!):
10 items
delete 3 of them
First time through the loop we will assume that the first three items get deleted -- here's what the probabilities look like:
first item: 3 / 10 = 30%
second item: 2 / 9 = 22%
third item: 1 / 8 = 12%
fourth item: 0 / 7 = 0 %
fifth item: 0 / 6 = 0 %
sixth item: 0 / 5 = 0 %
seventh item: 0 / 4 = 0 %
eighth item: 0 / 3 = 0 %
ninth item: 0 / 2 = 0 %
tenth item: 0 / 1 = 0 %
As you can see, once it hits zero, it stays at zero. But what if nothing is getting deleted?
first item: 3 / 10 = 30%
second item: 3 / 9 = 33%
third item: 3 / 8 = 38%
fourth item: 3 / 7 = 43%
fifth item: 3 / 6 = 50%
sixth item: 3 / 5 = 60%
seventh item: 3 / 4 = 75%
eighth item: 3 / 3 = 100%
ninth item: 2 / 2 = 100%
tenth item: 1 / 1 = 100%
So even though the probability varies per line, overall you get the results you are looking for. I went a step further and coded a test in Python for one million iterations as a final proof to myself -- remove seven items from a list of 100:
# python 3.2
from __future__ import division
from stats import mean # http://pypi.python.org/pypi/stats
import random
counts = dict()
for i in range(100):
counts[i] = 0
removed_failed = 0
for _ in range(1000000):
to_remove = 7
from_list = list(range(100))
removed = 0
while from_list:
current = from_list.pop()
probability = to_remove / (len(from_list) + 1)
if random.random() < probability:
removed += 1
to_remove -= 1
counts[current] += 1
if removed != 7:
removed_failed += 1
print(counts[0], counts[1], counts[2], '...',
counts[49], counts[50], counts[51], '...',
counts[97], counts[98], counts[99])
print("remove failed: ", removed_failed)
print("min: ", min(counts.values()))
print("max: ", max(counts.values()))
print("mean: ", mean(counts.values()))
and here's the results from one of the several times I ran it (they were all similar):
70125 69667 70081 ... 70038 70085 70121 ... 70047 70040 70170
remove failed: 0
min: 69332
max: 70599
mean: 70000.0
A final note: Python's random.random() is [0.0, 1.0) (doesn't include 1.0 as a possibility).
I believe you're looking for "Algorithm S" from section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981).
You can see several implementations at http://rosettacode.org/wiki/Knuth%27s_algorithm_S
The Perlmonks list has some Perl implementations of Algorithm S and Algorithm R that might also prove useful.
These algorithms rely on there being a meaningful interpretation of floating point numbers like 3609739/158609739, 3609738/158609738, etc. which might not have sufficient resolution with a standard Float datatype, unless the Float datatype is implemented using numbers of double precision or larger.
Here's a possible solution using Python:
import random
skipping = random.sample(range(158609739), 3609739)
input = open(input)
output = open(output, 'w')
for i, line in enumerate(input):
if i in skipping:
continue
output.write(line)
input.close()
output.close()
Here's another using Mark's method:
import random
lines_in_file = 158609739
lines_left_in_file = lines_in_file
lines_to_delete = lines_in_file - 155000000
input = open(input)
output = open(output, 'w')
try:
for line in input:
current_probability = lines_to_delete / lines_left_in_file
lines_left_in_file -= 1
if random.random < current_probability:
lines_to_delete -= 1
continue
output.write(line)
except ZeroDivisionError:
print("More than %d lines in the file" % lines_in_file)
finally:
input.close()
output.close()
I wrote this code before seeing that Darren Yin has expressed its principle.
I've modified my code to take the use of name skipping (I didn't dare to choose kangaroo ...) and of keyword continue from Ethan Furman whose code's principle is the same too.
I defined default arguments for the parameters of the function in order that the function can be used several times without having to make re-assignement at each call.
import random
import os.path
def spurt(ff,skipping):
for i,line in enumerate(ff):
if i in skipping:
print 'line %d excluded : %r' % (i,line)
continue
yield line
def randomly_reduce_file(filepath,nk = None,
d = {0:'st',1:'nd',2:'rd',3:'th'},spurt = spurt,
sample = random.sample,splitext = os.path.splitext):
# count of the lines of the original file
with open(filepath) as f: nl = sum(1 for _ in f)
# asking for the number of lines to keep, if not given as argument
if nk is None:
nk = int(raw_input(' The file has %d lines.'
' How many of them do you '
'want to randomly keep ? : ' % nl))
# transfer of the lines to keep,
# from one file to another file with different name
if nk<nl:
with open(filepath,'rb') as f,\
open('COPY'.join(splitext(filepath)),'wb') as g:
g.writelines( spurt(f,sample(xrange(0,nl),nl-nk) ) )
# sample(xrange(0,nl),nl-nk) is the list
# of the counting numbers of the lines to be excluded
else:
print ' %d is %s than the number of lines (%d) in the file\n'\
' no operation has been performed'\
% (nk,'the same' if nk==nl else 'greater',nl)
With the $RANDOM variable you can get a random number between 0 and 32,767.
With this, you could read in each line, and see if $RANDOM is less than 155,000,000 / 158,609,739 * 32,767 (which is 32,021), and if so, let the line through.
Of course, this wouldn't give you exactly 150,000,000 lines, but pretty close to it depending on the normality of the random number generator.
EDIT: Here is some code to get you started:
#!/bin/bash
while read line; do
if (( $RANDOM < 32021 ))
then
echo $line
fi
done
Call it like so:
thatScript.sh <inFile.txt >outFile.txt