astropy.io fits efficient element access of a large table - python

I am trying to extract data from a binary table in a FITS file using Python and astropy.io. The table contains an events array with over 2 million events. What I want to do is store the TIME values of certain events in an array, so I can then do analysis on that array. The problem I have is that, whereas in fortran (using FITSIO) the same operation takes maybe a couple of seconds on a much slower processor, the exact same operation in Python using astropy.io is taking several minutes. I would like to know where exactly the bottleneck is, and if there is a more efficient way to access the individual elements in order to determine whether or not to store each time value in the new array. Here is the code I have so far:
from astropy.io import fits
minenergy=0.3
maxenergy=0.4
xcen=20000
ycen=20000
radius=50
datafile=fits.open('datafile.fits')
events=datafile['EVENTS'].data
datafile.close()
times=[]
for i in range(len(events)):
energy=events['PI'][i]
if energy<maxenergy*1000:
if energy>minenergy*1000:
x=events['X'][i]
y=events['Y'][i]
radius2=(x-xcen)*(x-xcen)+(y-ycen)*(y-ycen)
if radius2<=radius*radius:
times.append(events['TIME'][i])
print times
Any help would be appreciated. I am an ok programmer in other languages, but I have not really had to worry about efficiency in Python before. The reason I have chosen to do this in Python now is that I was using fortran with both FITSIO and PGPLOT, as well as some routines from Numerical Recipes, but the newish fortran compiler I have on this machine cannot be persuaded to produce a properly working program (there are some issues of 32- vs. 64-bit, etc.). Python seems to have all the functionality I need (FITS I/O, plotting, etc), but if it takes forever to access the individual elements in a list, I will have to find another solution.
Thanks very much.

You need to do this using numpy vector operations. Without special tools like numba, doing large loops like you've done will always be slow in Python because it is an interpreted language. Your program should look more like:
energy = events['PI'] / 1000.
e_ok = (energy > min_energy) & (energy < max_energy)
rad2 = (events['X'][e_ok] - xcen)**2 + (events['Y'][e_ok] - ycen)**2
r_ok = rad2 < radius**2
times = events['TIMES'][e_ok][r_ok]
This should have performance comparable to Fortran. You can also filter the entire event table, for instance:
events_filt = events[e_ok][r_ok]
times = events_filt['TIMES']

Related

Efficiently executing complex Python code in R

I'm looking for an efficient way to execute Python code in a fast and flexible way in R.
The problem is that I want to execute a particular method from topological data analysis, called persistent homology, which you don't really need to know in detail to understand the problem I'm dealing with.
The thing is that computing persistent homology is not the most easiest thing to do when working with large data sets, as it requires both a lot of memory and computational time.
Executing the method implemented in the R-package TDA is however very inconvenient: it crashes starting at around 800 data points and doesn't really allow me to use approximation algorithms.
In contrast, the Python package Ripser allows me to do the computation on a few thousands of points with ease in Python. Furthermore, it also allows one to provide a parameter that approximates the result for even larger data sets, and can also store the output I want. In summary, it is much more convenient to compute persistent homology with this package.
However, since everything else I'm working on is in R, it would also be lot more convenient for me to execute the code from the Ripser package in R as well. An illustrative example of this is as follows:
# Import R libraries
library("reticulate") # Conduct Python code in R
library("ggplot2") # plotting
# Import Python packages
ripser <- import("ripser") # Python package for computing persistent homology
persim <- import("persim") # Python package for many tools used in analyzing Persistence Diagrams
# Generate data on circle
set.seed(42)
npoints <- 200
theta <- 2 * pi * runif(npoints)
noise <- cbind(rnorm(npoints, sd=0.1), rnorm(npoints, sd=0.1))
X <- data.frame(x=cos(theta) + noise[,1], y=sin(theta) + noise[,2])
# Compute persistent homology on full data
PHfull <- ripser$ripser(X, do_cocycles=TRUE) # slow if X is large
# Plot diagrams
persim$plot_diagrams(PHfull$dgms) # crashes
Now I have two problems when using this code:
The persistent homology computation performed by the ripser function works perfectly fine in this example. However, when I increase the number of data points in X, say npoints ~ 2000, the computation will take ages compared to around 30 seconds when I would perform the computation straightforwardly in Python. I don't really know what's happening behind the scenes that causes this huge difference in computational times. Is this because R is perhaps less convenient than Python for this example, and in stead of converting my arguments to the custom types in Python and executing the code in Python, the Python code is converted to R code? I am looking for an approach that combines this flexibility and efficient type conversion with the speed of Python.
In Python, the analog to the last line would plot an image based on the result of my persistent homology computation. However, executing this line causes R to crash. Is there a possible way to show the image that would normally result in Python in R?

Python script for analyzing pair separations in huge arrays

I am writing a Python script to compare two matrices (A and B) and do a specific range query. Each matrix has three columns and over a million rows. The specific range query involves finding indices of A for each row of B (when A is compared to B) in such a manner that the the pair separation distance is in a given range (e.g. 15 units < Pair Separation < 20 units). (It doesn't matter which array comes first.) And then, I have to repeat it for other slices of pair separations, viz. 20 units < Pair Separation < 25 units and so on till a maximum pair separation range is reached.
I have read that cKDTree of scipy.spatial can do this job very quickly. I tried using cKDTree and followed by the use of query_ball_tree. The query_ball_tree returns output in the form of a list of lists and it works perfectly fine for matrices have ~ 100,000 rows (small size). But, once the size of the matrix (number of rows) becomes closer to a million, I start getting memory errors and my code doesn't work. Obviously, the computer memory doesn't like to store huge arrays. I tried using query_ball_point and that also runs into memory errors.
This looks like a typical big data problem. Time of execution is extremely important for the nature of my work, so I am searching for ways to make the code faster (but, I also run into memory errors and my code comes to a grinding halt)
I was wondering if someone can tell me what is the best method to approach in such a scenario.
: Is the use of scipy.spatial cKDTree and query_ball_point (or query_ball_tree) the best approach for my problem?
: Can I hope to keep to keep using cKDTree and be successful, using some modification?
: I tried to see if I can make some headway by modifying the source of scipy.spatial cKDTree, specifically query_ball_tree or query_ball_point to prevent it from storing huge arrays. But, I am not very experienced with that, so I haven't been successful yet.
: Its very possible that there are solutions that haven't struck my mind. I will very much appreciate any useful suggestion from the people here.

Prime number hard drive storage for very large primes - Sieve of Atkin

I have implemented the Sieve of Atkin and it works great up to primes nearing 100,000,000 or so. Beyond that, it breaks down because of memory problems.
In the algorithm, I want to replace the memory based array with a hard drive based array. Python's "wb" file functions and Seek functions may do the trick. Before I go off inventing new wheels, can anyone offer advice? Two issues appear at the outset:
Is there a way to "chunk" the Sieve of Atkin to work on segment in memory, and
is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them.
Why am I doing this? An old geezer looking for entertainment and to keep the noodle working.
Implementing the SoA in Python sounds fun, but note it will probably be slower than the SoE in practice. For some good monolithic SoE implementations, see RWH's StackOverflow post. These can give you some idea of the speed and memory use of very basic implementations. The numpy version will sieve to over 10,000M on my laptop.
What you really want is a segmented sieve. This lets you constrain memory use to some reasonable limit (e.g. 1M + O(sqrt(n)), and the latter can be reduced if needed). A nice discussion and code in C++ is shown at primesieve.org. You can find various other examples in Python. primegen, Bernstein's implementation of SoA, is implemented as a segmented sieve (Your question 1: Yes the SoA can be segmented). This is closely related (but not identical) to sieving a range. This is how we can use a sieve to find primes between 10^18 and 10^18+1e6 in a fraction of a second -- we certainly don't sieve all numbers to 10^18+1e6.
Involving the hard drive is, IMO, going the wrong direction. We ought to be able to sieve faster than we can read values from the drive (at least with a good C implementation). A ranged and/or segmented sieve should do what you need.
There are better ways to do storage, which will help some. My SoE, like a few others, uses a mod-30 wheel so has 8 candidates per 30 integers, hence uses a single byte per 30 values. It looks like Bernstein's SoA does something similar, using 2 bytes per 60 values. RWH's python implementations aren't quite there, but are close enough at 10 bits per 30 values. Unfortunately it looks like Python's native bool array is using about 10 bytes per bit, and numpy is a byte per bit. Either you use a segmented sieve and don't worry about it too much, or find a way to be more efficient in the Python storage.
First of all you should make sure that you store your data in an efficient manner. You could easily store the data for up to 100,000,000 primes in 12.5Mb of memory by using bitmap, by skipping obvious non-primes (even numbers and so on) you could make the representation even more compact. This also helps when storing the data on hard drive. You getting into trouble at 100,000,000 primes suggests that you're not storing the data efficiently.
Some hints if you don't receive a better answer.
1.Is there a way to "chunk" the Sieve of Atkin to work on segment in memory
Yes, for the Eratosthenes-like part what you could do is to run multiple elements in the sieve list in "parallell" (one block at a time) and that way minimize the disk accesses.
The first part is somewhat more tricky, what you would want to do is to process the 4*x**2+y**2, 3*x**2+y**2 and 3*x**2-y**2 in a more sorted order. One way is to first compute them and then sort the numbers, there are sorting algorithms that work well on drive storage (still being O(N log N)), but that would hurt the time complexity. A better way would be to iterate over x and y in such a way that you run on a block at a time, since a block is determined by an interval you could for example simply iterate over all x and y such that lo <= 4*x**2+y**2 <= hi.
2.is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them
In order to achieve this (no matter how and when the program is terminated) you have to first have journalizing disk accesses (fx use a SQL database to keep the data, but with care you could do it yourself).
Second since the operations in the first part are not indempotent you have to make sure that you don't repeat those operations. However since you would be running that part block by block you could simply detect which was the last block processed and resume there (if you can end up with partially processed block you'd just discard that and redo that block). For the Erastothenes part it's indempotent so you could just run through all of it, but for increasing speed you could store a list of produced primes after the sieving of them has been done (so you would resume with sieving after the last produced prime).
As a by-product you should even be able to construct the program in a way that makes it possible to keep the data from the first step even when the second step is running and thereby at a later moment extending the limit by continuing the first step and then running the second step again. Perhaps even having two program where you terminate the first when you've got tired of it and then feeding it's output to the Eratosthenes part (thereby not having to define a limit).
You could try using a signal handler to catch when your application is terminated. This could then save your current state before terminating. The following script shows a simple number count continuing when it is restarted.
import signal, os, cPickle
class MyState:
def __init__(self):
self.count = 1
def stop_handler(signum, frame):
global running
running = False
signal.signal(signal.SIGINT, stop_handler)
running = True
state_filename = "state.txt"
if os.path.isfile(state_filename):
with open(state_filename, "rb") as f_state:
my_state = cPickle.load(f_state)
else:
my_state = MyState()
while running:
print my_state.count
my_state.count += 1
with open(state_filename, "wb") as f_state:
cPickle.dump(my_state, f_state)
As for improving disk writes, you could try experimenting with increasing Python's own file buffering with a 1Mb or more sized buffer, e.g. open('output.txt', 'w', 2**20). Using a with handler should also ensure your file gets flushed and closed.
There is a way to compress the array. It may cost some efficiency depending on the python interpreter, but you'll be able to keep more in memory before having to resort to disk. If you search online, you'll probably find other sieve implementations that use compression.
Neglecting compression though, one of the easier ways to persist memory to disk would be through a memory mapped file. Python has an mmap module that provides the functionality. You would have to encode to and from raw bytes, but it is fairly straightforward using the struct module.
>>> import struct
>>> struct.pack('H', 0xcafe)
b'\xfe\xca'
>>> struct.unpack('H', b'\xfe\xca')
(51966,)

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

Simultaneous assignment to a numpy array

I have a data structure which serves as a wrapper to a 2D numpy array in order to use labeled indices and perform statements such as
myMatrix[ "rowLabel", "colLabel" ] = 1.0
Basically this is implemented as
def __setitem__( self, row, col, value ):
... # Check validity of row/col labels.
self.__matrixRepresentation[ ( self.__rowMap[row], self.__colMap[col] ) ] = value
I am assigning the values in a database table to this data structure, and it was straightforward to write a loop for this. However, I want to execute this loop 100 million or more times, and iteratively retrieving chunks of values from the database table and moving them to this structure takes more time than I would prefer.
All of the values I retrieve from the database table have different (row,column) pairs.
Therefore, it seems that I could parallelize the above assignment, but I don't know if numpy arrays permit simultaneous assignment using some sort of internal locking mechanism for atomic operations, or if it altogether prohibits any such thought process. If anyone has suggestions or criticisms, I would appreciate it. (If possible, in this case I'd prefer not to resort to cython or PyPy.)
Parallel execution at that level is unlikely here. The global interpreter lock will ruin your day. Besides, you're still going to have to pull each set of values out of the database sequentially, which is quite possibly going to dwarf the in-process map lookups and array assignment. Especially if the database is on a remote machine.
If at all possible, don't store your matrix in that database. Dedicated formats exist for efficiently storing large arrays. HDF5/NetCDF come to mind. There are excellent Python/NumPy libraries for using HDF5 datasets. With no further information on the format or purpose of the database and/or matrix, I can't really give you better storage recommendations.
If you don't control how that data is stored, you're just going to have to wait for it to get slurped in. Depending on what you're using it for and how often it gets updated, you might just wait for it once, and then updates from the database can be written to it from a separate thread as they become available.
(Unrelated terminology issue: "CPython" is the standard Python interpreter. I assume you meant that you don't want to write an extension for CPython using C, Cython, Boost Python, or whatever.)

Categories