Python: My script will not allow me to create large files - python

I'm making a small python script which will create random files in all shapes and sizes but it will not let me create large files. I want to be able to create files up to around 8GB in size, I know this would take a long amount of time but I'm not concerned about that.
The problem is that Python 2.7 will not handle the large numbers I am throwing at it in order to create the random text that will fill my files.
The aim of my code is to create files with random names and extentions, fill the files with a random amount of junk text and save the files. It will keep on repeating this until I close the command line window.
import os
import string
import random
ext = ['.zip', '.exe', '.txt', '.pdf', '.msi', '.rar', '.jpg', '.png', '.html', '.iso']
min = raw_input("Enter a minimum file size eg: 112 (meaning 112 bytes): ")
minInt = int(min)
max = raw_input("Enter a maximum file size: ")
maxInt = int(max)
def name_generator(chars=string.ascii_letters + string.digits):
return ''.join(random.choice(chars) for x in range(random.randint(1,10)))
def text_generator(chars=string.printable + string.whitespace):
return ''.join(random.choice(chars) for x in range(random.randint(minInt,maxInt)))
def main():
fileName = name_generator()
extension = random.choice(ext)
file = fileName + extension
print 'Creating ==> ' + file
fileHandle = open ( file, 'w' )
fileHandle.write ( text_generator() )
fileHandle.close()
print file + ' ==> Was born!'
while 1:
main()
Any help will be much appreciated!

Make it lazy, as per the following:
import string
import random
from itertools import islice
chars = string.printable + string.whitespace
# make infinite generator of random chars
random_chars = iter(lambda: random.choice(chars), '')
with open('output_file','w', buffering=102400) as fout:
fout.writelines(islice(random_chars, 1000000)) # write 'n' many

The problem is not that python cannot handle large numbers. It can.
However, you try to put the whole file contents in memory at once - you might not have enough RAM for this and additionally do not want to do this anyway.
The solution is using a generator and writing the data in chunks:
def text_generator(chars=string.printable + string.whitespace):
return (random.choice(chars) for x in range(random.randint(minInt,maxInt))
for char in text_generator():
fileHandle.write(char)
This is still horribly inefficient though - you want to write your data in blocks of e.g. 10kb instead of single bytes.

A comment about performance: you could improve it by using os.urandom() to generates random bytes and str.translate() to translate them into the range of input characters:
import os
import string
def generate_text(size, chars=string.printable+string.whitespace):
# make translation table from 0..255 to chars[0..len(chars)-1]
all_chars = string.maketrans('', '')
assert 0 < len(chars) <= len(all_chars)
result_chars = ''.join(chars[b % len(chars)] for b in range(len(all_chars)))
# generate `size` random bytes and translate them into given `chars`
return os.urandom(size).translate(string.maketrans(all_chars, result_chars))
Example:
with open('output.txt', 'wb') as outfile: # use binary mode
chunksize = 1 << 20 # 1MB
N = 8 * (1 << 10) # (N * chunksize) == 8GB
for _ in xrange(N):
outfile.write(generate_text(chunksize))
Note: to avoid skewing the random distribution, bytes larger than k*len(chars)-1 returned by os.urandom() should be discarded, where k*len(chars) <= 256 < (k+1)*len(chars).

Related

Is the for loop in my code the speed bottleneck?

The following code looks through 2500 markdown files with a total of 76475 lines, to check each one for the presence of two strings.
#!/usr/bin/env python3
# encoding: utf-8
import re
import os
zettelkasten = '/Users/will/Dropbox/zettelkasten'
def zsearch(s, *args):
for x in args:
r = (r"(?=.* " + x + ")")
p = re.search(r, s, re.IGNORECASE)
if p is None:
return None
return s
for filename in os.listdir(zettelkasten):
if filename.endswith('.md'):
with open(os.path.join(zettelkasten, filename),"r") as fp:
for line in fp:
result_line = zsearch(line, "COVID", "vaccine")
if result_line != None:
UUID = filename[-15:-3]
print(f'›[[{UUID}]] OR', end=" ")
This correctly gives output like:
›[[202202121717]] OR ›[[202003311814]] OR
, but it takes almost two seconds to run on my machine, which I think is much too slow. What, if anything, can be done to make it faster?
The main bottleneck is the regular expressions you're building.
If we print(f"{r=}") inside the zsearch function:
>>> zsearch("line line covid line", "COVID", "vaccine")
r='(?=.* COVID)'
r='(?=.* vaccine)'
The (?=.*) lookahead is what is causing the slowdown - and it's also not needed.
You can achieve the same result by searching for:
r=' COVID'
r=' vaccine'

How to insert random spaces in txt file?

I have a file with lines of DNA in a file called 'DNASeq.txt'. I need a code to read each line and split each line at random places (inserting spaces) throughout the line. Each line needs to be split at different places.
EX: I have:
AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ
And I need something like this:
AAA ADSF DFAFDDSAF ADF ADSF AFD AFAD
I have tried (!!!very new to python!!):
import random
for x in range(10):
print(random.randint(50,250))
but that prints me random numbers. Is there some way to get a random number generated as like a variable?
You can read a file line wise, write each line character-wise in a new file and insert spaces randomly:
Create demo file without spaces:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJG
ASDFJSDGFIJ
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGO""")
Read and rewrite demo file:
import random
max_no_space = 9 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDF SFD GHJEQWRJIJG
A SDFJ SDG FIJ
SADFJSD FJ JDSFJIDFJG I JSRGJSDJ FIDJFG
The pattern of spaces is random. You can influence it by settin max_no_spaces or remove the randomness to split after max_no_spaces all the time
Edit:
This way of writing 1 character at a time if you need to read 200+ en block is not very economic, you can do it with the same code like so:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGG
ASDFJSDGFIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJK
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJF
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG""")
import random
min_no_space = 10
max_no_space = 20 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if no_space > min_no_space:
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDFSFDGHJEQ WRJIJSADFJSDF JJDSFJIDFJGIJ SRGJSDJFIDJFGG
ASDFJSDGFIJSA DFJSDFJJDSFJIDF JGIJSRGJSDJFIDJ FGSADFJSDFJJ DSFJIDFJGIJK
SADFJ SDFJJDSFJIDFJG IJSRGJSDJFIDJ FGSADFJSDFJJDS FJIDFJGIJSRG JSDJFIDJF
SDFJG IKDSFGOROHPTLPASDMKFGD OKRAMGSADFJSDF JJDSFJIDFJGI JSRGJSDJFIDJFG
If you want to split your DNA fixed amount of times (10 in my example) here's what you could try:
import random
DNA = 'AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ'
splitted_DNA = ''
for split_idx in sorted(random.sample(range(len(DNA)), 10)):
splitted_DNA += DNA[len(splitted_DNA)-splitted_DNA.count(' ') :split_idx] + ' '
splitted_DNA += DNA[split_idx:]
print(splitted_DNA) # -> AAACCCHT HTH D AF HD SA F JANFAJDSNFA DK FAFJ
import random
with open('source', 'r') as in_file:
with open('dest', 'w') as out_file:
for line in in_file:
newLine = ''.join(map(lambda x:x+' '*random.randint(0,1), line)).strip() + '\n'
out_file.write(newLine)
Since you mentioned being new, I'll try to explain
I'm writing the new sequences to another file for precaution. It's
not safe to write to the file you are reading from.
The with constructor is so that you don't need to explicitly close
the file you opened.
Files can be read line by line using for loop.
''.join() converts a list to a string.
map() applies a function to every element of a list and returns the
results as a new list.
lambda is how you define a function without naming it. lambda x:
2*x doubles the number you feed it.
x + ' ' * 3 adds 3 spaces after x. random.randint(0, 1) returns
either 1 or 0. So I'm randomly selecting if I'll add a space after
each character or not. If the random.randint() returns 0, 0 spaces are added.
You can toss a coin after each character whether to add space there or not.
This function takes string as input and returns output with space inserted at random places.
def insert_random_spaces(str):
from random import randint
output_string = "".join([x+randint(0,1)*" " for x in str])
return output_string

Given a string in python, how would i split at nth byte, and perform an action with each chunk automatically?

buf = "\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\xb0\x0b\xcd\x80";
Given this shellcode string (just an example), I would like to split into multiple chunks of nth size.
Once its been split, given unknown nth number of chunks, i would then like it to automatically perform an fuction such as
os.system("echo " + chunk[1] + ">>/tmp/final")
os.system("echo " + chunk[2] + ">>/tmp/final")
but, without specifying each action each time, and not knowing the number of chunks that its been split into
See if the below code helps you, it divides the string at every nth character:
nbuf = [buf[i:i+n] for i in range(0, len(buf), n)]
for st in nbuf:
cmdk = 'os.system("echo "' + st + '">>/tmp/final")'
subprocess.call(cmdk,shell=True)

Python - How can I avoid b"\n" being written in a text file with "wb" causing a line break?

Here is my case:
# Small hashing script example
import hashlib
import os
def original_pass(Password):
salt = os.urandom(64)
hashed = hashlib.pbkdf2_hmac("sha512", Password.encode(), salt, 300000, dklen = 124)
with open("Hash.txt", "wb") as file:
file.write(salt + b"\n" + hashed)
file.close()
def check_pass(New_Pass):
with open("Hash.txt", "rb") as file:
f = file.readlines()
print (f)
check_hash = hashlib.pbkdf2_hmac("sha512", New_Pass.encode(), f[0].strip(b"\n"), 300000, dklen = 124)
file.close()
if check_hash == f[1]:
return True
else:
return False
original_pass("Password")
print (check_pass("Password"))
My problem is that occasionally the hash will contain characters such as \n. E.G. b"x004\n4\no5". This line gets split into b"x004\n", b"4\no5". This causes errors when I try to read things like the salt, since it may be split up into multiple pieces. Is there any way of avoiding it being read like that, or just stopping it from writing it that way?
To address the duplicate remark
I am dealing specifically with byte strings here and not regular strings. Both are a distinct data type, moreso in python 3 (the version I'm using) as seen here: What is the difference between a string and a byte string?. This distinction means that certain string methods such as .encode() don't work on byte strings. Thus there is a clear difference in the data types I'm dealing with and how they are manipulated, etc...
Based on #Blckknght comment, code using knowledge of fixed length salt:
import hashlib
import os
SALT_LEN = 64
def hash(password,salt):
return hashlib.pbkdf2_hmac('sha512',password.encode(),salt,300000,dklen=124)
def original_pass(password):
salt = os.urandom(SALT_LEN)
hashed = hash(password,salt)
with open('hash.bin','wb') as file:
file.write(salt + hashed)
def check_pass(password):
with open('hash.bin','rb') as file:
data = file.read()
salt,hashed = data[:SALT_LEN],data[SALT_LEN:]
check_hash = hash(password,salt)
return check_hash == hashed
original_pass('Password')
print(check_pass('Password'))

How to read the n last lines of a file?

I have to read the 4 last line of a file.
I tried the following:
top_tb_comp_file = open('../../ver/sim/top_tb_compile.tcl', 'r+')
top_tb_comp_end = top_tb_comp_file.readlines()[:-4]
top_tb_comp_file.close()
Didn't work (I get the first line of the file in top_tb_comp_end).
The following example opens a file named names.txt and prints the last 4 lines in the file. Applied to your example, you only need to take away the pattern given on lines 2, 5, and 7. The rest is simple.
#! /usr/bin/env python3
import collections
def main():
with open('names.txt') as file:
lines = collections.deque(file, 4)
print(*lines, sep='')
if __name__ == '__main__':
main()
Your indexing is wrong. With the [:-4], you are asking for the exact opposite of what you actually want.
Try the following:
top_tb_comp_file = open('../../ver/sim/top_tb_compile.tcl', 'r+')
top_tb_comp_end = top_tb_comp_file.readlines()[-4:]
# you noticed that the '-4' is now before the ':'
top_tb_comp_file.close()
EDIT
Thanks to #Noctis, I have made some benchmarking around the question. About the speed and memory usage of the collection.deque option and file.readlines one.
The collection option suggested by #Noctis seems to be better in term of memory usage AND speed: in my result I observed a little peak in the memory usage at the critical line file.readlines()[-4:] which did not happened at the line collections.deque(file, 4). Moreover, I repeated the speed test with the file reading phase and the collections option seems also faster in this case.
I have experienced some issues displaying the ouput of this code with the SO rendering but if you install the packages memory_profiler and psutil you should be able to see by yourself (with large sized file).
import sys
import collections
import time
from memory_profiler import profile
#profile
def coll_func(filename):
with open(filename) as file:
lines = collections.deque(file, 4)
return 0
#profile
def indexing_func(filename):
with open(filename) as file:
lines = file.readlines()[-4:]
return 0
#profile
def witness_func(filename):
with open(filename) as file:
pass
return 0
def square_star(s_toprint, ext="-"):
def surround(s, ext="+"):
return ext + s + ext
hbar = "-" * (len(s_toprint) + 1)
return (surround(hbar) + "\n"
+ surround(s_toprint, ext='|') + "\n"
+ surround(hbar))
if __name__ == '__main__':
s_fname = sys.argv[1]
s_func = sys.argv[2]
d_func = {
"1": coll_func,
"2": indexing_func,
"3": witness_func
}
func = d_func[s_func]
start = time.time()
func(s_fname)
elapsed_time = time.time() - start
s_toprint = square_star("Elapsed time:\t{}".format(elapsed_time))
print(s_toprint)
Just type the following:
python3 -m memory_profiler profile.py "my_file.txt" n
n being 1, 2 or 3.

Categories