np.array of texts memory size - python

I have a list of (possibly long) strings.
When i convert it to np.array i quite fast run out of RAM because it seems to take much more memory than a simple list. Why and how to deal with it? Or maybe I'm just doing something wrong?
The code:
import random
import string
import numpy as np
from sys import getsizeof
cnt = 100
sentences = []
for i in range(0, cnt):
word_cnt = random.randrange(30, 100)
words = []
for j in range(0, word_cnt):
word_length = random.randrange(20)
letters = [random.choice(string.ascii_letters) for x in range(0, word_length)]
words.append(''.join(letters))
sentences.append(' '.join(words))
list_size = sum([getsizeof(x) for x in sentences]) + getsizeof(sentences)
print(list_size)
arr = np.array(sentences)
print(getsizeof(arr))
print(arr.nbytes)
The output:
76345
454496
454400
I'm not sure if i use getsizeof() correctly, but I started to investigate it when I noticed memory problems so I'm pretty sure there's something going on : )
(Bonus question)
I'm trying to run something similar to https://autokeras.com/examples/imdb/. The original example requires about 3GB of memory, and I wanted to use a bigger dataset. Maybe there's some better way?
I'm using python3.6.9 with numpy==1.17.0 on Ubuntu 18.04.

Related

Convert stream of hex values to 16-bit ints

I get packages of binary strings of size 61440 in hex values, somehting like:
b'004702AF42324fe380ac...'
I need to split those into batches of 4 and convert them to integers. 16 bit would be preferred but casting this later is not a problem. The way i did it looks like this and it works.
out = [int(img[i][j:j+4],16) for j in range(0,len(img[i]), 4)]
The issue im having is performance. Thing is i get a minimum of 200 of those a second possibly more and without multithreading i can only pass through 100-150 a second.
Can i improve the speed of this in some way?
This is a rewrite of my earlier offering showing how multithreading does, in fact, make a very significant difference - possibly depending on the system architecture.
The following code executes in ~0.05s on my machine:-
import random
from datetime import datetime
import concurrent.futures
N = 10
R = 61440
IMG = []
for _ in range(N):
IMG.append(''.join(random.choice('0123456789abcdef')
for _ in range(R)))
"""
now IMG has N elements each containg R pseudo randomly generated hexadecimal values
"""
def tfunc(img, k):
return k, [int(img[j:j + 4], 16) for j in range(0, len(img), 4)]
R = [0] * N
start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
"""
note that we pass the relevant index to the worker function
because we can't be sure of the order of completion
"""
for i in range(N):
futures.append(executor.submit(tfunc, IMG[i], i))
for future in concurrent.futures.as_completed(futures):
k, r = future.result()
R[k] = r
"""
list R now contains the converted values from the same relative indexes in IMG
"""
print(f'Duration={datetime.now()-start}')
I don't think multi-threading will help in this case as it's purely CPU intensive. The overheads of breaking it down over, say, 4 threads would outweigh any theoretical advantages. Your list comprehension appears to be as efficient as it can be although I'm unclear as to why img seems to have multiple dimensions. I've written the following simulation and on my machine this consistently executes in ~0.8 seconds. I think the performance you'll get from your code is going to be highly dependent on your CPU's capabilities. Here's the code:-
import random
from datetime import datetime
hv = '0123456789abcdef'
img = ''.join(random.choice(hv) for _ in range(61440))
start = datetime.now()
for _ in range(200):
out = [int(img[j:j + 4], 16) for j in range(0, len(img), 4)]
print(f'Duration={datetime.now()-start}')
I did some more research and found that not multi-threading but multi-processes are what I need. That gave me a speedup from 220 batches per second to ~370 batches per second. This probably bottnecks somewhere else now since i only got 15% load on all cores but puts me comfortably above spec and thats good enough.
from multiprocessing import Pool
def combine(img):
return np.array([int(img[j:j+4],16) for j in range(0,len(img), 4)]).reshape((24,640))
p = Pool(20)
img = p.map(combine, tmp)

Large memory consumption by iPython Parallel module

I am using the ipyparallel module to speed up an all by all list comparison but I am having issues with huge memory consumption.
Here is a simplified version of the script that I am running:
From a SLURM script start the cluster and run the python script
ipcluster start -n 20 --cluster-id="cluster-id-dummy" &
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy"
In python, make two list of lists for the simplified example
import ipyparallel as ipp
from itertools import compress
list1 = [ [i, i, i] for i in range(4000000)]
list2 = [ [i, i, i] for i in range(2000000, 6000000)]
Then define my list comparison function:
def loop(item):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
Then connect to my ipython engines, push list2 to each of them and map my function:
rc = ipp.Client(profile='default', cluster_id = "cluster-id-dummy")
dview = rc[:]
dview.block = True
lview = rc.load_balanced_view()
lview.block = True
mydict = dict(list2 = list2)
dview.push(mydict)
trueorfalse = list(lview.map(loop, list1))
As mentioned, I am running this on a cluster using SLURM and getting the memory usage from the sacct command. Here is the memory usage that I am getting for each of the steps:
Just creating the two lists: 1.4 Gb
Creating two lists and pushing them to 20 engines: 22.5 Gb
Everything: 62.5 Gb++ (this is where I get an OUT_OF_MEMORY failure)
From running htop on the node while running the job, it seems that the memory usage is going up slowly over time until it reaches the maximum memory and fails.
I combed through this previous thread and implemented a few of the suggested solutions without success
Memory leak in IPython.parallel module?
I tried clearing the view with each loop:
def loop(item):
lview.results.clear()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
I tried purging the client with each loop:
def loop(item):
rc.purge_everything()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
And I tried using the --nodb and --sqlitedb flags with ipcontroller and started my cluster like this:
ipcontroller --profile=pierrj --nodb --cluster-id='cluster-id-dummy' &
sleep 60
for (( i = 0 ; i < 20; i++)); do ipengine --profile=pierrj --cluster-id='cluster-id-dummy' & done
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy" --profile=pierrj
Unfortunately none of this has helped and has resulted in the exact same out of memory error.
Any advice or help would be greatly appreciated!
Looking around, there seems to be lots of people complaining about LoadBalancedViews being very memory inefficient, and I have not been able to find any useful suggestions on how to fix this, for example.
However, I suspect given your example that's not the place to start. I assume that your example is a reasonable approximation of your code. If your code is doing list comparisons with several million data points, I would advise you to use something like numpy to perform the calculations rather than iterating in python.
If you restructure your algorithm to use numpy vector operations it will be much, much faster than indexing into a list and performing the calculation in python. numpy is a C library and calculation done within the library will benefit from compile time optimisations. Furthermore, performing operations on arrays also benefits from processor predictive caching (your CPU expects you to use adjacent memory looking forward and preloads it; you potentially lose this benefit if you access the data piecemeal).
I have done a very quick hack of your example to demonstrate this. This example compares your loop calculation with a very naïve numpy implementation of the same question. The python loop method is competitive with small numbers of entries, but it quickly heads towards x100 faster with the number of entries you are dealing with. I suspect looking at the way you structure data will outweigh the performance gain you are getting through parallelisation.
Note that I have chosen a matching value in the middle of the distribution; performance differences will obviously depend on the distribution.
import numpy as np
import time
def loop(item, list2):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
def run_comparison(scale):
list2 = [ [i, i, i] for i in range(4 * scale)]
arr2 = np.array([i for i in range(4 * scale)])
test_value = (2 * scale)
np_start = time.perf_counter()
res1 = test_value in arr2
np_end = time.perf_counter()
np_time = np_end - np_start
loop_start = time.perf_counter()
res2 = loop((test_value, 0, 0), list2)
loop_end = time.perf_counter()
loop_time = loop_end - loop_start
assert res1 == res2
return (scale, loop_time / np_time)
print([run_comparison(v) for v in [100, 1000, 10000, 100000, 1000000, 10000000]])
returns:
[
(100, 1.0315526939407524),
(1000, 19.066806587378263),
(10000, 91.16463510672537),
(100000, 83.63064249916434),
(1000000, 114.37531283123414),
(10000000, 121.09979997458508)
]
Assuming that a single task on the two lists is being divided up between the worker threads you will want to ensure that the individual workers are using the same copy of the lists. In most cases is looks like ipython parallel will pickle objects sent to workers (relevant doc). If you are able to use one of the types that are not copied (as stated in doc)
buffers/memoryviews, bytes objects, and numpy arrays.
the memory issue might be resolved since a reference is distributed. This answer also assumes that the individual tasks do not need to operate on the lists while working (thread-safe).
TL;DR It looks like moving the objects passed to the parallel workers into a numpy array may resolve the explosion in memory.

Python: itertools.product consuming too much resources

I've created a Python script that generates a list of words by permutation of characters. I'm using itertools.product to generate my permutations. My char list is composed by letters and numbers 01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ. Here is my code:
#!/usr/bin/python
import itertools, hashlib, math
class Words:
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
def __init__(self, size):
self.make(size)
def getLenght(self, size):
res = []
for i in range(1, size+1):
res.append(math.pow(len(self.chars), i))
return sum(res)
def getMD5(self, text):
m = hashlib.md5()
m.update(text.encode('utf-8'))
return m.hexdigest()
def make(self, size):
file = open('res.txt', 'w+')
res = []
i = 1
for i in range(1, size+1):
prod = list(itertools.product(self.chars, repeat=i))
res = res + prod
j = 1
for r in res:
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % (j/float(self.getLenght(size))*100))
file.write(res+'\n')
j = j + 1
file.close()
Words(3)
This script works fine for list of words with max 4 characters. If I try 5 or 6 characters, my computer consumes 100% of CPU, 100% of RAM and freezes.
Is there a way to restrict the use of those resources or optimize this heavy processing?
Does this do what you want?
I've made all the changes in the make method:
def make(self, size):
with open('res.txt', 'w+') as file_: # file is a builtin function in python 2
# also, use with statements for files used on only a small block, it handles file closure even if an error is raised.
for i in range(1, size+1):
prod = itertools.product(self.chars, repeat=i)
for j, r in enumerate(prod):
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % ((j+1)/float(self.get_length(size))*100))
file_.write(res+'\n')
Be warned this will still chew up gigabytes of memory, but not virtual memory.
EDIT: As noted by Padraic, there is no file keyword in Python 3, and as it is a "bad builtin", it's not too worrying to override it. Still, I'll name it file_ here.
EDIT2:
To explain why this works so much faster and better than the previous, original version, you need to know how lazy evaluation works.
Say we have a simple expression as follows (for Python 3) (use xrange for Python 2):
a = [i for i in range(1e12)]
This immediately evaluates 1 trillion elements into memory, overflowing your memory.
So we can use a generator to solve this:
a = (i for i in range(1e12))
Here, none of the values have been evaluated, just given the interpreter instructions on how to evaluate it. We can then iterate through each item one by one and do work on each separately, so almost nothing is in memory at a given time (only 1 integer at a time). This makes the seemingly impossible task very manageable.
The same is true with itertools: it allows you to do memory-efficient, fast operations by using iterators rather than lists or arrays to do operations.
In your example, you have 62 characters and want to do the cartesian product with 5 repeats, or 62**5 (nearly a billion elements, or over 30 gigabytes of ram). This is prohibitively large."
In order to solve this, we can use iterators.
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
for i in itertools.product(chars, repeat=5):
print(i)
Here, only a single item from the cartesian product is in memory at a given time, meaning it is very memory efficient.
However, if you evaluate the full iterator using list(), it then exhausts the iterator and adds it to a list, meaning the nearly one billion combinations are suddenly in memory again. We don't need all the elements in memory at once: just 1. Which is the power of iterators.
Here are links to the itertools module and another explanation on iterators in Python 2 (mostly true for 3).

Python list-like string representation of numpy array

Consider there is a few rather long numpy arrays:
importy numpy as np;
long_array1 = np.array([random.random() for i in range(10000)]);
long_array2 = np.array([random.random() for i in range(10000)]);
long_array3 = np.array([random.random() for i in range(10000)]);
I would like to save the arrays into the file file.dat, one row per numpy array.
The text representation of an array should be in a python array-like format, i.e. in the case of following numpy array:
a = np.array([0.3213,0.145323,0.852,0.723,0.421452])
I want to save following line in the file.
[0.3213,0.145323,0.852,0.723,0.421452]
There is what I do:
array1_str = ",".join([str(item) for item in long_array1]);
array2_str = ",".join([str(item) for item in long_array2]);
array3_str = ",".join([str(item) for item in long_array3]);
with open("file.dat","w") as file_arrays:
file_arrays.write("[" + array1_str + "]\n");
file_arrays.write("[" + array2_str + "]\n");
file_arrays.write("[" + array3_str + "]\n");
Everything works fine actually. I am just doubtful about the efficiency of my code. I am almost sure there has to be another (better and more efficient) way how to do this.
I welcome comments to the random list generation as well.
This is the fastest way:
','.join(map(str, long_array1.tolist()))
If you want to keep the text more compact, this is fast too:
','.join(map(lambda x: '%.7g' % x, long_array1.tolist()))
Source: I benchmarked every possible method for this as the maintainer of the pycollada library.
Since you want a Python-list-like format, how about actually using the Python list format?
array1_str = repr(list(long_array1))
That's going to stay mostly in C-land and performance should be much better.
If you don't want the spaces, take 'em out after:
array1_str = repr(list(long_array1)).translate(None, " ")
Memory usage may be an issue, however.
sounds like you might be able to use the numpy.savetxt() for this;
something like:
def dump_array(outfile, arraylike):
outfile.write('[')
numpy.savetxt(outfile, arraylike, newline=',', fmt="%s")
outfile.write(']\n')
although i don't think the corresponding numpy.loadtxt() will be able to read in this format.

constructing a wav file and writing it to disk using scipy

I wish to deconstruct a wave file into small chunks, reassemble it in a different order and then write it to disk.
I seem to have problems with writing it after reassembling the pieces so for now I just try to debug this section and worry about the rest later.
Basically I read the original wav into a 2D numpy array, break it into 100 piece stored within a list of smaller 2D numpy arrays, and then stack these arrays vertically using vstack:
import scipy.io.wavfile as sciwav
import numpy
[sr,stereo_data] = sciwav.read('filename')
nparts = 100
stereo_parts = list()
part_length = len(stereo_data) / nparts
for i in range(nparts):
start = i*part_length
end = (i+1)*part_length
stereo_parts.append(stereo_data[start:end])
new_data = numpy.array([0,0])
for i in range(nparts):
new_data = numpy.vstack([new_data, stereo_parts[i]])
sciwav.write('new_filename', sr, new_data)
So far I verified that new_data looks similar to stereo_data with two exceptions:
1. it has [0,0] padded at the beginning.
2. It is 88 samples shorter because len(stereo_data)/nparts does not divide without remainder.
When I try to listen to the resulting new_data eave file all I hear is silence, which I think does not make much sense.
Thanks for the help!
omer
It is very likely the dtype that is different. When you generate the zeros to pad at the beggining, you are not specifying a dtype, so they are probably np.int32. Your original data is probably np.uint8or np.uint16, so the whole array gets promoted to np.int32, which is not the right bit depth for your data. Simply do:
new_data = numpy.array([0,0], dtype=stereo_data)
I would actually rather do:
new_data = numpy.zeros((1, 2), dtype=stereo_data.dtype)
You could, by the way, streamline your code quite a bit, and get rid of a lot of for loops:
sr, stereo_data = sciwav.read('filename')
nparts = 100
part_length = len(stereo_data) // nparts
stereo_parts = numpy.split(stereo_data[:part_length*nparts], nparts)
new_data = numpy.vstack([numpy.zeros((1, 2), dtype=stereo_data.dtype)] +
stereo_parts)
sciwav.write('new_filename', sr, new_data)

Categories