How to deal with lots of data?

How to deal with lots of data? - python

I need to deal with lots of data (such as float) in my program which costs me much memory. Also, I create some data structures to organize my data which cost memory, too.
Here is the example:
Heap at the end of the function Partition of a set of 6954910 objects. Total size = 534417168 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 3446006 50 248112432 46 248112432 46 array.array
1 1722999 25 124055928 23 372168360 70 vertex.Vertex
2 574705 8 82894088 16 455062448 85 list
.......
Any solution?

Python supports array objects that are internally maintained in packed binary arrays of simple data.
For example
import array
a = array.array('f', 0. for x in range(100000))
will create an array object containing 100,000 floats and its size will be approximately just 400Kb (4 bytes per element).
Of course you can store only values of the specific type in an array object, not any Python value as you would do with regular list objects.
The numpy module extends over this concept and provides you many ways to quickly manipulate multidimensional data structures of this kind (including viewing part of arrays as arrays sharing the same memory, reshaping arrays, performing math and search operations and much more).

If you need to deal with billions of rows of data per day, by far the simplest way to do that is to create a simple indexer script that splits the billions of rows in to small files based on some key (e.g. the first two digits of the IP address in a log file row).
If you need to deal with things like numbers theory, or log files, or something else where you have a lot of ints or floats:
1) Learn to use Numpy arrays well
2) Start using Numba's just-in-time compiling
3) Learn Cython (you can do much more than with Numba)
At least moderate level linux skills is a huge plus in dealing with large sets of data. Some things take seconds to do directly from command line, while it might not be at all obvious how to do the same thing in python.
At the very least use %timeit to test range of scales leading to your desired scale (e.g. 2,5 billion rows per day). This is a easy way to identify possible performance drops, and reduce size of arrays or other factors accordingly.
Learn more about profiling / performance hacking as soon as you're doing something with data.
To make the point about 'indexer' clear, a very simple example indexer I've created and used for doing a lot of computation on files with billions of rows of data using a $60 per month server.
https://github.com/mikkokotila/indexer

Related

Memory-efficient 2d growable array in python?

I'm working on an app that processes a lot of data.
.... and keeps running my computer out of memory. :(
Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.
Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.
Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?

Based on your comments, I'd suggest that you split your task into two parts:
1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.
2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here

Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.

You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.

try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

Best data structures for 3D array in Python

I am developing a moving average filter for position "tracks" in Touch Designer, which implements a Python runtime. I'm new to Python and I'm unclear on the best data structures to use. The pcode is roughly:
Receive a new list of tracks formatted as:
id, posX, posY
2001, 0.54, 0.21
2002, 0.43, 0.23
...
Add incoming X and Y values to an existing data structure keyed on "id," if that id is already present
Create new entries for new ids
Remove any id entries that are not present in the incoming data
Return a moving average of X and Y values per id
Question: Would it be a good idea to do this as a hashtable where the key is id, and the values are a list of lists? eg:
ids = {2001: posVals, 2002: posVals2}
where posVals is a list of [x,y] pairs?
I think this is like a 3D array, but where I want to use id as a key for lots of operations.
Thanks

First, if the IDs are all relatively small positive integers, and the array is almost completely dense (that is, almost all IDs will exist), a dict is just adding extra overhead. On the other hand, if there are large numbers, or large gaps, or keys of different types, using a dict as a "sparse array" makes perfect sense.
Meanwhile, for the other two dimensions… how many IDs are you going to have, and how many pairs for each? If we're talking a handful of pairs per ID and a few thousands total pairs across all IDs, a list of pairs per ID is perfectly fine (although I'd probably represent each pair as a tuple rather than a list), and anything else would be adding unnecessary complexity. But if there are going to be a lot of pairs per ID, or a lot total, you may run into problems with storage, or performance.
If you can use the third-party library numpy, it can store a 2D array of numbers in much less memory than a list of pairs of numbers, and it can perform calculations like moving averages with much briefer/more readable code, and with much less CPU time. In fact, it can even store a sparse 3D array for you.
If you can only use the standard library, the array module can get you most of the same memory benefits, but without the simplicity benefits (in fact, your code becomes slightly more complex, as you have to represent the 2D array as a 1D array, old-school-C-style—although you can wrap that up pretty easily), or the time performance benefits, and it can't help with the sparseness.

Yes, this is how I would do it. It's very intuitive this way, assuming you're always looking things up by their id and don't need to sort in some other way.
Also, the terminology in Python is dict (as in dictionary), rather than hashtable.

Techniques for working with large Numpy arrays? [duplicate]

This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.

How to efficiently convert dates in numpy record array?

I have to read a very large (1.7 million records) csv file to a numpy record array. Two of the columns are strings that need to be converted to datetime objects. Additionally, one column needs to be the calculated difference between those datetimes.
At the moment I made a custom iterator class that builds a list of lists. I then use np.rec.fromrecords to convert it to the array.
However, I noticed that calling datetime.strptime() so many times really slows things down. I was wondering if there was a more efficient way to do these conversions. The times are accurate to the second within the span of a date. So, assuming that the times are uniformly distributed (they're not), it seems like I'm doing 20x more conversions that necessary (1.7 million / (60 X 60 X 24).
Would it be faster to store converted values in a dictionary {string dates: datetime obj} and first check the dictionary, before doing unnecessary conversions?
Or should I be using numpy functions (I am still new to the numpy library)?

I could be wrong, but it seems to me like your issue is having repeated occurrences, thus doing the same conversion more times than necessary. IF that interpretation is correct, the most efficient method would depend on how many repeats there are. If you have 100,000 repeats out of 1.7 million, then writing 1.6 million to a dictionary and checking it 1.7 million times might not be more efficient, since it does 1.6+1.7million read/writes. However, if you have 1 million repeats, then returning an answer (O(1)) for those rather than doing the conversion an extra million times would be much faster.
All-in-all, though, python is very slow and you might not be able to speed this up much at all, given that you are using 1.7 million inputs. As for numpy functions, I'm not that well versed in it either, but I believe there's some good documentation for it online.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.