I have to read a very large (1.7 million records) csv file to a numpy record array. Two of the columns are strings that need to be converted to datetime objects. Additionally, one column needs to be the calculated difference between those datetimes.
At the moment I made a custom iterator class that builds a list of lists. I then use np.rec.fromrecords to convert it to the array.
However, I noticed that calling datetime.strptime() so many times really slows things down. I was wondering if there was a more efficient way to do these conversions. The times are accurate to the second within the span of a date. So, assuming that the times are uniformly distributed (they're not), it seems like I'm doing 20x more conversions that necessary (1.7 million / (60 X 60 X 24).
Would it be faster to store converted values in a dictionary {string dates: datetime obj} and first check the dictionary, before doing unnecessary conversions?
Or should I be using numpy functions (I am still new to the numpy library)?
I could be wrong, but it seems to me like your issue is having repeated occurrences, thus doing the same conversion more times than necessary. IF that interpretation is correct, the most efficient method would depend on how many repeats there are. If you have 100,000 repeats out of 1.7 million, then writing 1.6 million to a dictionary and checking it 1.7 million times might not be more efficient, since it does 1.6+1.7million read/writes. However, if you have 1 million repeats, then returning an answer (O(1)) for those rather than doing the conversion an extra million times would be much faster.
All-in-all, though, python is very slow and you might not be able to speed this up much at all, given that you are using 1.7 million inputs. As for numpy functions, I'm not that well versed in it either, but I believe there's some good documentation for it online.
Related
I'm working on an app that processes a lot of data.
.... and keeps running my computer out of memory. :(
Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.
Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.
Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?
Based on your comments, I'd suggest that you split your task into two parts:
1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.
2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)
I need to deal with lots of data (such as float) in my program which costs me much memory. Also, I create some data structures to organize my data which cost memory, too.
Here is the example:
Heap at the end of the function Partition of a set of 6954910 objects. Total size = 534417168 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 3446006 50 248112432 46 248112432 46 array.array
1 1722999 25 124055928 23 372168360 70 vertex.Vertex
2 574705 8 82894088 16 455062448 85 list
.......
Any solution?
Python supports array objects that are internally maintained in packed binary arrays of simple data.
For example
import array
a = array.array('f', 0. for x in range(100000))
will create an array object containing 100,000 floats and its size will be approximately just 400Kb (4 bytes per element).
Of course you can store only values of the specific type in an array object, not any Python value as you would do with regular list objects.
The numpy module extends over this concept and provides you many ways to quickly manipulate multidimensional data structures of this kind (including viewing part of arrays as arrays sharing the same memory, reshaping arrays, performing math and search operations and much more).
If you need to deal with billions of rows of data per day, by far the simplest way to do that is to create a simple indexer script that splits the billions of rows in to small files based on some key (e.g. the first two digits of the IP address in a log file row).
If you need to deal with things like numbers theory, or log files, or something else where you have a lot of ints or floats:
1) Learn to use Numpy arrays well
2) Start using Numba's just-in-time compiling
3) Learn Cython (you can do much more than with Numba)
At least moderate level linux skills is a huge plus in dealing with large sets of data. Some things take seconds to do directly from command line, while it might not be at all obvious how to do the same thing in python.
At the very least use %timeit to test range of scales leading to your desired scale (e.g. 2,5 billion rows per day). This is a easy way to identify possible performance drops, and reduce size of arrays or other factors accordingly.
Learn more about profiling / performance hacking as soon as you're doing something with data.
To make the point about 'indexer' clear, a very simple example indexer I've created and used for doing a lot of computation on files with billions of rows of data using a $60 per month server.
https://github.com/mikkokotila/indexer
I have a really big, like huge, Dictionary (it isn't really but pretend because it is easier and not relevant) that contains the same strings over and over again. I have verified that I can store a lot more in memory if I do poor man's compression on the system and instead store INTs that correspond to the string.
animals = ['ape','butterfly,'cat','dog']
exists in a list and therefore has an index value such that animals.index('cat') returns 2
This allows me to store in my object BobsPets = set(2,3)
rather than Cat and Dog
For the number of items the memory savings are astronomical. (Really Don't try and dissuade me that is well tested.
Currently I then convert the INTs back to Strings with a FOR loop
tempWordList = set()
for IntegOfIndex in TempSet:
tempWordList.add(animals[IntegOfIndex])
return tempWordList
This code works. It feels "Pythonic," but it feels like there should be a better way. I am in Python 2.7 on AppEngine if that matters. It may since I wonder if Numpy has something I missed.
I have about 2.5 Million things in my object, and each has an average of 3 of these "pets" and there 7500-ish INTs that represent the pets. (no they aren't really pets)
I have considered using a Dictionary with the position instead of using Index. This doesn't seem faster, but am interested if anyone thinks it should be. (it took more memory and seemed to be the same speed or really close)
I am considering running a bunch of tests with Numpy and its array's rather than lists, but before I do, I thought I'd ask the audience and see if I would be wasting time on something that I have already reached the best solution for.
Last thing, The solution should be pickable since I do that for loading and transferring data.
It turns out that since my list of strings is fixed, and I just wish the index of the string, I am building what is essentially an index array that is immutable. Which is in short a Tuple.
Moving to a Tuple rather than a list gains about 30% improvement in speed. Far more than I would have anticipated.
The bonus is largest on very large lists. It seems that each time you cross a bit threshold the bonus increases, so in sub 1024 lists their is basically no bonus and at a million there is pretty significant.
The Tuple also uses very slightly less memory for the same data.
An aside, playing with the lists of integers, you can make these significantly smaller by using a NUMPY array, but the advantage doesn't extend to pickling. The Pickles will be about 15% larger. I think this is because of the object description being stored in the pickle, but I didn't spend much time looking.
So in short the only change was to make the Animals list a Tuple. I really was hoping the answer was something more exotic.
I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.
If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.
"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.
I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.