I get packages of binary strings of size 61440 in hex values, somehting like:
b'004702AF42324fe380ac...'
I need to split those into batches of 4 and convert them to integers. 16 bit would be preferred but casting this later is not a problem. The way i did it looks like this and it works.
out = [int(img[i][j:j+4],16) for j in range(0,len(img[i]), 4)]
The issue im having is performance. Thing is i get a minimum of 200 of those a second possibly more and without multithreading i can only pass through 100-150 a second.
Can i improve the speed of this in some way?
This is a rewrite of my earlier offering showing how multithreading does, in fact, make a very significant difference - possibly depending on the system architecture.
The following code executes in ~0.05s on my machine:-
import random
from datetime import datetime
import concurrent.futures
N = 10
R = 61440
IMG = []
for _ in range(N):
IMG.append(''.join(random.choice('0123456789abcdef')
for _ in range(R)))
"""
now IMG has N elements each containg R pseudo randomly generated hexadecimal values
"""
def tfunc(img, k):
return k, [int(img[j:j + 4], 16) for j in range(0, len(img), 4)]
R = [0] * N
start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
"""
note that we pass the relevant index to the worker function
because we can't be sure of the order of completion
"""
for i in range(N):
futures.append(executor.submit(tfunc, IMG[i], i))
for future in concurrent.futures.as_completed(futures):
k, r = future.result()
R[k] = r
"""
list R now contains the converted values from the same relative indexes in IMG
"""
print(f'Duration={datetime.now()-start}')
I don't think multi-threading will help in this case as it's purely CPU intensive. The overheads of breaking it down over, say, 4 threads would outweigh any theoretical advantages. Your list comprehension appears to be as efficient as it can be although I'm unclear as to why img seems to have multiple dimensions. I've written the following simulation and on my machine this consistently executes in ~0.8 seconds. I think the performance you'll get from your code is going to be highly dependent on your CPU's capabilities. Here's the code:-
import random
from datetime import datetime
hv = '0123456789abcdef'
img = ''.join(random.choice(hv) for _ in range(61440))
start = datetime.now()
for _ in range(200):
out = [int(img[j:j + 4], 16) for j in range(0, len(img), 4)]
print(f'Duration={datetime.now()-start}')
I did some more research and found that not multi-threading but multi-processes are what I need. That gave me a speedup from 220 batches per second to ~370 batches per second. This probably bottnecks somewhere else now since i only got 15% load on all cores but puts me comfortably above spec and thats good enough.
from multiprocessing import Pool
def combine(img):
return np.array([int(img[j:j+4],16) for j in range(0,len(img), 4)]).reshape((24,640))
p = Pool(20)
img = p.map(combine, tmp)
Related
Me and a friend want to create a script that gives us every possible permutation of a six digit code, comprised of 36 alphanumeric characters (0-9, and a-z), in alphabetical order, then be able to see them in a .txt file.
And I want it to use all of the CPU and RAM it can, so that it takes less time to complete the task.
So far, this is the code:
import random
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
file = open("codes.txt", "a")
for g in range(0, 36**6):
key = ""
base = ""
print(str(g))
for i in range(0, 6):
char = random.choice(charset)
key += char
base += key
file.write(base + "\n")
file.close()
This code randomly generates the combinations and immediately writes them in a .txt file, while printing the amount of codes it has already created but, it isn't in alphabetical order (have to do it afterwards), and it takes too long.
How can the code be improved to give the desired outcome?
Thanks to #R0Best for providing the best answer
Although this post already has 6 answers, I'm not content with any of them, so I've decided to contribute a solution of my own.
First, note that many of the answers provide the combinations or permutations of letters, however the post actually wants the Cartesian Product of the alphabet with itself (repeated N times, where N=6). There is (at this time) two answers that do this, however they both write an excessive number of times, resulting in subpar performance, and also concatenate their intermediate results in the hottest portion of the loop (also bringing down performance).
In the interest of taking optimization to the absolute max, I present the following code:
from string import digits, ascii_lowercase
from itertools import chain
ALPHABET = (digits + ascii_lowercase).encode("ascii")
def fast_brute_force():
# Define some constants to make the following sections more readable
base_size = 6
suffix_size = 4
prefix_size = base_size - suffix_size
word_size = base_size + 1
# define two containers
# word_blob - placeholder words, with hyphens in the unpopulated characters (followed by newline)
# sleds - a tuple of repeated bytes, used for substituting a bunch of characters in a batch
word_blob = bytearray(b"-" * base_size + b"\n")
sleds = tuple(bytes([char]) for char in ALPHABET)
# iteratively extend word_blob and sleds, and filling in unpopulated characters using the sleds
# in doing so, we construct a single "blob" that contains concatenated suffixes of the desired
# output with placeholders so we can quickly substitute in the prefix, write, repeat, in batches
for offset in range(prefix_size, base_size)[::-1]:
word_blob *= len(ALPHABET)
word_blob[offset::word_size] = chain.from_iterable(sleds)
sleds = tuple(sled * len(ALPHABET) for sled in sleds)
with open("output.txt", "wb") as f:
# I've expanded out the logic for substituting in the prefixes into explicit nested for loops
# to avoid both redundancy (reassigning the same value) and avoiding overhead associated with
# a recursive implementation
# I assert this below, so any changes in suffix_size will fail loudly
assert prefix_size == 2
for sled1 in sleds:
word_blob[0::word_size] = sled1
for sled2 in sleds:
word_blob[1::word_size] = sled2
# we write to the raw FileIO since we know we don't need buffering or other fancy
# bells and whistles, however in practice it doesn't seem that much faster
f.raw.write(word_blob)
There's a lot of magic happening in that code block, but in a nutshell:
I batch the writes, so that I'm writing 36**4 or 1679616 entries at once, so there's less context switching.
I update all 1679616 entries per batch simultaneously with the new prefix, using bytearray slicing / assignment.
I operate on bytes, write to the raw FileIO, expand the loops for the prefix assignments, and other small optimizations to avoid encoding/buffering/function call overhead/other performance hits.
Note, unless you have a very fast disk and slowish CPU, you won't see much benefit from the smaller optimizations, just the write batching probably.
On my system, it takes about 45 seconds to product + write the 14880348 file, and that's writing to my slowest disk. On my NVMe drive, it takes 6.868 seconds.
The fastest way I can think of is using pypy3 with this code:
import functools
import time
from string import digits, ascii_lowercase
#functools.lru_cache(maxsize=128)
def main():
cl = []
cs = digits + ascii_lowercase
for letter in cs:
cl.append(letter)
ct = tuple(cl)
with open("codes.txt", "w") as file:
for p1 in ct:
for p2 in ct:
for p3 in ct:
for p4 in ct:
for p5 in ct:
for p6 in ct:
file.write(f"{p1}{p2}{p3}{p4}{p5}{p6}\n")
if __name__ == '__main__':
start = time.time()
main()
print(f"Done!\nTook {time.time() - start} seconds!")
It writes at around 10-15MB/s. The total file is around 15GB I believe so it would take like 990-1500 seconds to generate. The results are on a VM of unraid with 1 3.4 ghz core of server CPU, with an old SATA3 SSD. You will probably get better results with an NVME drive and a faster single core CPU.
Random Can be very inefficient. You can try :
from itertools import permutations
from pandas import Series
charset = list("0123456789abcdefghijklmnopqrstuvwxyz")
links = []
file = open("codes.txt", "a")
comb = permutations(charset,6)
comb = list(comb)
comb = list(map(lambda x:return ''.join(x),comb))
mySeries = Series(comb)
mySeries = mySeries.sort_values()
base = ""
for k in mySeries:
base += k
file.write(base + "\n")
file.close()
You could use itertools.permutaions from the default itertools library. You can also specify the number of characters in the combination.
from itertools import permutations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
c = permutations(charset, 6)
with open('code.txt', 'w') as f:
for i in c:
f.write("".join(i) + '\n')
Runs on my computer in about 200 milliseconds for creating the list of permutations, then spends a lot of time writing to the file
For permutations, this would do the trick:
from itertools import permutations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
with open("codes.txt", "w") as f:
for permutation in permutations(charset, 6):
f.write(''.join(permutation) + '\n')
FYI, it would create a 7.8 GigaByte file
For combinations, this would do the trick:
from itertools import combinations
charset = "0123456789abcdefghijklmnopqrstuvwxyz"
links = []
with open("codes.txt", "w") as f:
for comb in combinations(charset, 6):
f.write(''.join(comb)+ '\n')
FYI, it would create a 10.8 megabyte file
First thing; There is better ways to do this but I want to write something clear and understandable.
Pseudo Code:
base = "";
for(x1=0; x1<charset.length(); x1++)
for(x2=0; x2<charset.length(); x2++)
for(x3=0; x3<charset.length(); x3++)
.
.
.
{ base = charset[x1]+charset[x2]+charset[x3]+.....+charset[x6];
file.write(base + "\n")
}
This is a combination problem where you are trying to get combinations of length 6 from the character set of length 36. This will produce an output of size 36!/(30!*6!) . You can refer the itertools for solving a combination problem like yours. You can refer the Combination function in itertools
Documentation. It is recommended not to perform such a performance intensive computation using Python.
I'm computing a very big for cycle and i'll try to explain how does it works. There are 4320 matrices (40x80 each) that have been taken from a matlab file.
This loop takes a matrix per time: it assign to each value the right value of H and T. Once finished, it pass to the next matrix and so on.
The dataframe created is then written on a csv file needed for the creation of a database for the wave energy converters productivity.
The problem is that this code is running since 9 days and it is at half on the total computations..Is there any way to drastically reduce the computational time?
indice_4 = 0
configuration_id=-1
n_configurations=4320
for z in range(0,n_configurations,1): #iteration on all the configurations
print(z)
power_matrix=P_mat[z]
energy_wave_period_converted = pd.DataFrame([],columns=['energy_wave_period'])
H_start=0.25
H_end=10
H_step=0.25
T_start=3
T_end=17
T_step=0.177
y=T_start
relative_direction = int(direc[z])
if relative_direction==0:
configuration_id = configuration_id + 1
print(configuration_id)
r=0 #r=row
c=0 #c=column
while y <= T_end:
energy_wave_period= float('%.2f'%y)
x=H_start #initialize on the right wave haights
r=0
while x <= H_end:
significant_wave_height= float('%.2f'%x)
average_power=float('%.2f'%power_matrix[r,c])
new_line_4 = pd.Series([indice_4 , configuration_id, significant_wave_height , energy_wave_period ,relative_direction ,average_power] , index =['id','configuration_id','significant_wave_height','energy_wave_period','relative_direction','average_output_power'])
seastate_productivity = seastate_productivity.append([new_line_4], ignore_index=True)
indice_4= indice_4 + 1
r=r+1
x=x+H_step
c=c+1
y = y + T_step
seastate_productivity.to_csv('seastate_productivity.csv',index=False,sep=';')
'
One of the main things slowing your code down is that you do pandas operations in an iteration. Specifically using pd.Series and pd.DataFrame.append in the loop (which runs for over 12 million times) really slows you down. When using pandas you should really aim to vectorize your operations (meaning performing operations in batch). When I tried your original code every iteration took about 4 seconds, but the time increased gradually. When removing the pd.append every iteration only took 0.5 seconds, and when removing the pd.Series it dropped even more.
I did some improvements by saving the data in lists and later to a dataframe in one go, which took about 2 minutes to run till completion on my laptop:
import time
import numpy as np
import pandas as pd
# Generate random data for testing
P_mat = np.random.rand(4320,40,80)
direc=np.random.rand(4320)
H_start=0.25
H_end=10
H_step=0.25
T_start=3
T_end=17
T_step=0.177
indice_4 = 0
configuration_id=-1
n_configurations=4320
data = []
# Time it
t0 = time.perf_counter()
for z in range(n_configurations):
power_matrix=P_mat[z]
print(z)
y=T_start
relative_direction = int(direc[z])
if relative_direction==0:
configuration_id = configuration_id + 1
r=0 #r=row
c=0 #c=column
while y <= T_end:
energy_wave_period= float('%.2f'%y)
x=H_start #initialize on the right wave haights
r=0
while x <= H_end:
significant_wave_height= float('%.2f'%x)
average_power=float('%.2f'%power_matrix[r,c])
# Save data to list
new_line_4 = [indice_4 , configuration_id, significant_wave_height , energy_wave_period ,relative_direction ,average_power]
data.append(new_line_4) # Append to create a list of lists
indice_4= indice_4 + 1
r=r+1
x=x+H_step
c=c+1
y = y + T_step
# Make dataframe from list of lists
seastate_productivity = pd.DataFrame.from_records(data,columns =['id','configuration_id','significant_wave_height','energy_wave_period','relative_direction','average_output_power'])
# Save data
seastate_productivity.to_csv('seastate_productivity.csv',index=False,sep=';')
# Print time it took
print("Done in:",time.perf_counter()-t0)
You could probably still optimize this solution, by moving the rounding from the loop to outside, by rounding the pandas columns. Also, since you are only moving data around, there is probably also a completely vectorized solution (without a loop) but this is probably sufficient for you.
A way to find out what the issue is with slow code is by timing portions of code. You can use the timeit module, or the time module like I used. You can then isolate lines of code, and run them and analyse the performance.
You should consider using numpy. Using numpy's matrix operations you should be able to reduce computation time.
I suggest you to dig also into concurrent.futures.
It specifically enables to run parallel tasks and reduce run time.
You need to convert your code into a function and then call it into the async func, each element at a time.
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
The asynchronous execution can be performed with threads, using ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.
https://docs.python.org/3/library/concurrent.futures.html
this is a scolastic example
import concurrent.futures
nums = range(10)
def f(x):
return x * x
def main():
print([val for val in map(f, nums)])
with concurrent.futures.ProcessPoolExecutor() as executor:
print([val for val in executor.map(f, nums)])
if __name__ == '__main__':
main()
I am using the ipyparallel module to speed up an all by all list comparison but I am having issues with huge memory consumption.
Here is a simplified version of the script that I am running:
From a SLURM script start the cluster and run the python script
ipcluster start -n 20 --cluster-id="cluster-id-dummy" &
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy"
In python, make two list of lists for the simplified example
import ipyparallel as ipp
from itertools import compress
list1 = [ [i, i, i] for i in range(4000000)]
list2 = [ [i, i, i] for i in range(2000000, 6000000)]
Then define my list comparison function:
def loop(item):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
Then connect to my ipython engines, push list2 to each of them and map my function:
rc = ipp.Client(profile='default', cluster_id = "cluster-id-dummy")
dview = rc[:]
dview.block = True
lview = rc.load_balanced_view()
lview.block = True
mydict = dict(list2 = list2)
dview.push(mydict)
trueorfalse = list(lview.map(loop, list1))
As mentioned, I am running this on a cluster using SLURM and getting the memory usage from the sacct command. Here is the memory usage that I am getting for each of the steps:
Just creating the two lists: 1.4 Gb
Creating two lists and pushing them to 20 engines: 22.5 Gb
Everything: 62.5 Gb++ (this is where I get an OUT_OF_MEMORY failure)
From running htop on the node while running the job, it seems that the memory usage is going up slowly over time until it reaches the maximum memory and fails.
I combed through this previous thread and implemented a few of the suggested solutions without success
Memory leak in IPython.parallel module?
I tried clearing the view with each loop:
def loop(item):
lview.results.clear()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
I tried purging the client with each loop:
def loop(item):
rc.purge_everything()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
And I tried using the --nodb and --sqlitedb flags with ipcontroller and started my cluster like this:
ipcontroller --profile=pierrj --nodb --cluster-id='cluster-id-dummy' &
sleep 60
for (( i = 0 ; i < 20; i++)); do ipengine --profile=pierrj --cluster-id='cluster-id-dummy' & done
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy" --profile=pierrj
Unfortunately none of this has helped and has resulted in the exact same out of memory error.
Any advice or help would be greatly appreciated!
Looking around, there seems to be lots of people complaining about LoadBalancedViews being very memory inefficient, and I have not been able to find any useful suggestions on how to fix this, for example.
However, I suspect given your example that's not the place to start. I assume that your example is a reasonable approximation of your code. If your code is doing list comparisons with several million data points, I would advise you to use something like numpy to perform the calculations rather than iterating in python.
If you restructure your algorithm to use numpy vector operations it will be much, much faster than indexing into a list and performing the calculation in python. numpy is a C library and calculation done within the library will benefit from compile time optimisations. Furthermore, performing operations on arrays also benefits from processor predictive caching (your CPU expects you to use adjacent memory looking forward and preloads it; you potentially lose this benefit if you access the data piecemeal).
I have done a very quick hack of your example to demonstrate this. This example compares your loop calculation with a very naïve numpy implementation of the same question. The python loop method is competitive with small numbers of entries, but it quickly heads towards x100 faster with the number of entries you are dealing with. I suspect looking at the way you structure data will outweigh the performance gain you are getting through parallelisation.
Note that I have chosen a matching value in the middle of the distribution; performance differences will obviously depend on the distribution.
import numpy as np
import time
def loop(item, list2):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
def run_comparison(scale):
list2 = [ [i, i, i] for i in range(4 * scale)]
arr2 = np.array([i for i in range(4 * scale)])
test_value = (2 * scale)
np_start = time.perf_counter()
res1 = test_value in arr2
np_end = time.perf_counter()
np_time = np_end - np_start
loop_start = time.perf_counter()
res2 = loop((test_value, 0, 0), list2)
loop_end = time.perf_counter()
loop_time = loop_end - loop_start
assert res1 == res2
return (scale, loop_time / np_time)
print([run_comparison(v) for v in [100, 1000, 10000, 100000, 1000000, 10000000]])
returns:
[
(100, 1.0315526939407524),
(1000, 19.066806587378263),
(10000, 91.16463510672537),
(100000, 83.63064249916434),
(1000000, 114.37531283123414),
(10000000, 121.09979997458508)
]
Assuming that a single task on the two lists is being divided up between the worker threads you will want to ensure that the individual workers are using the same copy of the lists. In most cases is looks like ipython parallel will pickle objects sent to workers (relevant doc). If you are able to use one of the types that are not copied (as stated in doc)
buffers/memoryviews, bytes objects, and numpy arrays.
the memory issue might be resolved since a reference is distributed. This answer also assumes that the individual tasks do not need to operate on the lists while working (thread-safe).
TL;DR It looks like moving the objects passed to the parallel workers into a numpy array may resolve the explosion in memory.
I have a list of (possibly long) strings.
When i convert it to np.array i quite fast run out of RAM because it seems to take much more memory than a simple list. Why and how to deal with it? Or maybe I'm just doing something wrong?
The code:
import random
import string
import numpy as np
from sys import getsizeof
cnt = 100
sentences = []
for i in range(0, cnt):
word_cnt = random.randrange(30, 100)
words = []
for j in range(0, word_cnt):
word_length = random.randrange(20)
letters = [random.choice(string.ascii_letters) for x in range(0, word_length)]
words.append(''.join(letters))
sentences.append(' '.join(words))
list_size = sum([getsizeof(x) for x in sentences]) + getsizeof(sentences)
print(list_size)
arr = np.array(sentences)
print(getsizeof(arr))
print(arr.nbytes)
The output:
76345
454496
454400
I'm not sure if i use getsizeof() correctly, but I started to investigate it when I noticed memory problems so I'm pretty sure there's something going on : )
(Bonus question)
I'm trying to run something similar to https://autokeras.com/examples/imdb/. The original example requires about 3GB of memory, and I wanted to use a bigger dataset. Maybe there's some better way?
I'm using python3.6.9 with numpy==1.17.0 on Ubuntu 18.04.
I've created a Python script that generates a list of words by permutation of characters. I'm using itertools.product to generate my permutations. My char list is composed by letters and numbers 01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ. Here is my code:
#!/usr/bin/python
import itertools, hashlib, math
class Words:
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
def __init__(self, size):
self.make(size)
def getLenght(self, size):
res = []
for i in range(1, size+1):
res.append(math.pow(len(self.chars), i))
return sum(res)
def getMD5(self, text):
m = hashlib.md5()
m.update(text.encode('utf-8'))
return m.hexdigest()
def make(self, size):
file = open('res.txt', 'w+')
res = []
i = 1
for i in range(1, size+1):
prod = list(itertools.product(self.chars, repeat=i))
res = res + prod
j = 1
for r in res:
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % (j/float(self.getLenght(size))*100))
file.write(res+'\n')
j = j + 1
file.close()
Words(3)
This script works fine for list of words with max 4 characters. If I try 5 or 6 characters, my computer consumes 100% of CPU, 100% of RAM and freezes.
Is there a way to restrict the use of those resources or optimize this heavy processing?
Does this do what you want?
I've made all the changes in the make method:
def make(self, size):
with open('res.txt', 'w+') as file_: # file is a builtin function in python 2
# also, use with statements for files used on only a small block, it handles file closure even if an error is raised.
for i in range(1, size+1):
prod = itertools.product(self.chars, repeat=i)
for j, r in enumerate(prod):
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % ((j+1)/float(self.get_length(size))*100))
file_.write(res+'\n')
Be warned this will still chew up gigabytes of memory, but not virtual memory.
EDIT: As noted by Padraic, there is no file keyword in Python 3, and as it is a "bad builtin", it's not too worrying to override it. Still, I'll name it file_ here.
EDIT2:
To explain why this works so much faster and better than the previous, original version, you need to know how lazy evaluation works.
Say we have a simple expression as follows (for Python 3) (use xrange for Python 2):
a = [i for i in range(1e12)]
This immediately evaluates 1 trillion elements into memory, overflowing your memory.
So we can use a generator to solve this:
a = (i for i in range(1e12))
Here, none of the values have been evaluated, just given the interpreter instructions on how to evaluate it. We can then iterate through each item one by one and do work on each separately, so almost nothing is in memory at a given time (only 1 integer at a time). This makes the seemingly impossible task very manageable.
The same is true with itertools: it allows you to do memory-efficient, fast operations by using iterators rather than lists or arrays to do operations.
In your example, you have 62 characters and want to do the cartesian product with 5 repeats, or 62**5 (nearly a billion elements, or over 30 gigabytes of ram). This is prohibitively large."
In order to solve this, we can use iterators.
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
for i in itertools.product(chars, repeat=5):
print(i)
Here, only a single item from the cartesian product is in memory at a given time, meaning it is very memory efficient.
However, if you evaluate the full iterator using list(), it then exhausts the iterator and adds it to a list, meaning the nearly one billion combinations are suddenly in memory again. We don't need all the elements in memory at once: just 1. Which is the power of iterators.
Here are links to the itertools module and another explanation on iterators in Python 2 (mostly true for 3).