Large Array of binary data - python

I'm working with a large 3 dimensional array of data that is binary, each value is one of two possible values. I currently have this data stored in the numpy array as int32 objects that are either 1 or 0.
It works fine for small arrays but eventually i will need to make the array 5000x5000x20, which I can't even get close to without getting "Memory Error".
Does anyone have any suggestions for a better way to do this? I am really hoping that I can keep it all together in one data structure because I will need to access slices of it along all three axes.

Another possibility is to represent the last axis of 20 bits as a single 32 bit integer. This way a 5000x5000 array would suffice.

You'll get better performance if you change the datatype of your numpy array to something smaller.
For data which can take one of two values, you could use uint8, which will always be a single byte:
arr = np.array(your_data, dtype=np.uint8)
Alternatively, you could use np.bool, though I'm not sure offhand whether that is in fact a 8 bit value or whether it uses the native word size. (I tend to explicitly use the 8 bit value for clarity, though that's more a personal choice.)
At the end of the day, though, you're talking about a lot of data, and it's quite possible that even with a smaller set of values, you won't be able to load it all into python at once.
In that case, it might be worth investigating whether you can break up your problem into smaller parts.

Related

Resizing list from a 2D array

I came across a problem when trying to do the resize as follow:
I am having a 2D array after processing data and now I want to resize the data, with each row ignoring the first 5 items.
What I am doing right now is to:
edit: this apporach works fine as long as you make sure you are working with a list, but not string. it failed to work on my side because I haven't done the convertion from string to list properly.
so it ends up eliminating the first five characters in the entire string.
2dArray=[[array1],[array2]]
new_Resize_2dArray= [array[5:] for array in 2dArray]
However, it does not seems to work as it just recopy all the element over to the new_Resize_2dArray.
I would like to ask for help to see what did I do wrong or if there is any scientific calculation library I could use to acheive this.
First, because python list indexing is zero-based, your code should read new_Resize_2dArray= [array[5:] for array in 2dArray] if you want to not include the first 5 columns. Otherwise, I see no issue with your single line of code.
As for scientific computing libraries, numpy is a highly prevalent 3rd party package with a high-performance multidimensional array type ndarray. Using ndarrays, your code could be shortened to new_Resize_2dArray = 2dArray[:,5:]
Aside: It would help to include a bit more of your code or a minimum example where you are getting the unexpected result (e.g., use a fake/stand-in 2d array to see if it works as expected or still fails.

How to apply operations with conditionals, like if, to a large numpy array efficiently in python?

Good afternoon everybody, I was putting raw data into numpy arrays, then I wanted to perform operations, as logarithm base 10, with "if"s to those arrays, nevertheless, those numpy arrays are too big and consequently they take a lot of time to complete them.
enter image description here
x = [ 20*math.log10(i) if i>0 and 20*math.log10(i)>=-60 else (-(120+20*math.log10(abs(i))) if i<0 and 20*math.log10(abs(i))>=-60 else -60) for i in a3 ]
In the piece of code before, I use one of the channels array throwed out from the raw audio data, "a3", and I made another array, "x", that will contain an array to plot from -120 to 0, in the y edge. Futhermore, as you could note, I needed to separate positive original elements from numpy array than negative original elements from numpy array, and also 0s, being -60 the after operations 0. Having this final plot:
enter image description here
The problem with this code, is that, as I said before, it takes approximately 10 seconds to finish the computing, and this is only for 1 channel, and I need to compute 8 channels, so I need to wait approximately 80 seconds.
I wanted to know if there is a faster way to perform this, in addition, I found out a way to apply numpy.log10 to the whole numpy array, and it compute in less than two seconds:
x = 20*numpy.log10(abs(a3))
But I did not find anything related to manipulate the preferences of that operation, numpy.log10, with ifs, conditionals, or something like that. I really need to identify the negative and positive original values, and also the 0s, and obviously transform the 0 to -60, making the -60 the minimum limit, and the reference point, as the code that I showed you before.
Note: I already tried to do it with loops, like "for" and "while", but it takes way more time than the actual method, like 14 second each one.
Thank you for your responses!!
In general, when posting questions, its best practice to include a small working example. I know you included a picture of your data, but that is hard for others to use, so it would have been better to just give us a small array of data. This is important, because the solution often depends on the data. For example, all your data is (i think) between -1 and 1 so that log is always negative. If this isn't the case, then your solution might not work.
There is no need to check if i>0 and then apply abs if i is negative. This is exactly what applying abs does in the first place.
As you noticed, we can also use numpy vectorization to avoid the list comprehension. It is usually faster to do something like np.sin(X) than [ np.sin(x) for x in X].
Finally, if you do something like X>0 in numpy, it returns a boolean array saying if each element is >0.
Note that another way to have written your list comprehension would be first take 20*math.log10(abs(i)) and replace all values <-60 with -60 and then anywhere where i<0, flip the data about -60`. We can do this in the vectorized operation.
-120*(a3<0)+np.sign(a3)*np.maximum(20*np.log10(np.abs(a3)),-60)
This can probably be optimized a bit since a3<0 and np.sign(a3) are doing similar things. That said, I'm pretty sure this is faster than list comprehensions.

Memory-efficient 2d growable array in python?

I'm working on an app that processes a lot of data.
.... and keeps running my computer out of memory. :(
Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.
Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.
Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?
Based on your comments, I'd suggest that you split your task into two parts:
1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.
2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

Fast access and update integer matrix or array in Python

I will need to create array of integer arrays like [[0,1,2],[4,4,5,7]...[4,5]]. The size of internal arrays changeable. Max number of internal arrays is 2^26. So what do you recommend for the fastest way for updating this array.
When I use list=[[]] * 2^26 initialization is very fast but update is very slow. Instead I use
list=[] , for i in range(2**26): list.append.([]) .
Now initialization is slow, update is fast. For example, for 16777216 internal array and 0.213827311993 avarage number of elements on each array for 2^26-element array it takes 1.67728900909 sec. It is good but I will work much bigger datas, hence I need the best way. Initialization time is not important.
Thank you.
What you ask is quite of a problem. Different data structures have different properties. In general, if you need quick access, do not use lists! They have linear access time, which means, the more you put in them, the longer it will take in average to access an element.
You could perhaps use numpy? That library has matrices that can be accessed quite fast, and can be reshaped on the fly. However, if you want to add or delete rows, it will might be a bit slow because it generally reallocates (thus copies) the entire data. So it is a trade off.
If you are gonna have so many internal arrays of different sizes, perhaps you could have a dictionary that contains the internal arrays. I think if it is indexed by integers it will be much faster than a list. Then, the internal arrays could be created with numpy.

Efficient Datatype Python (list or numpy array?)

I'm still confused whether to use list or numpy array.
I started with the latter, but since I have to do a lot of append
I ended up with many vstacks slowing my code down.
Using list would solve this problem, but I also need to delete elements
which again works well with delete on numpy array.
As it looks now I'll have to write my own data type (in a compiled language, and wrap).
I'm just curious if there isn't a way to get the job done using a python type.
To summarize this are the criterions my data type would have to fulfil:
2d n (variable) rows, each row k (fixed) elements
in memory in one piece (would be nice for efficient operating)
append row (with an in average constant time, like C++ vector just always k elements)
delete a set of elements (best: inplace, keep free space at the end for later append)
access element given the row and column index ( O(1) like data[row*k+ column]
It appears generally useful to me to have a data type like this and not impossible to implement in C/Fortran.
What would be the closest I could get with python?
(Or maybe, Do you think it would work to write a python class for the datatype? what performance should I expect in this case?)
As I see it, if you were doing this in C or Fortran, you'd have to have an idea of the size of the array so that you can allocate the correct amount of memory (ignoring realloc!). So assuming you do know this, why do you need to append to the array?
In any case, numpy arrays have the resize method, which you can use to extend the size of the array.

Categories