what changes when your input is giga/terabyte sized? - python

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)

I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work

In a nutshell, the main differences IMO:
You should know beforehand what your likely
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
Choice and fine-tuning of an algorithm often dominates any other choice made.
Even modest changes to algorithms and access patterns can impact performance by
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
Talk to your colleagues and other scientists to profit from their experiences with these
data sets. A lot of tricks cannot be found in textbooks.
Pre-computing and storing can be extremely successful.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).
Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.

While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.

The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.

Related

Can everything that can be done with Pandas Dataframes be reproduced with Pyspark Dataframes?

I'm trying to think of a reason (other than you only have a small dataset) that you wouldn't use Pyspark Dataframes.
Can everything that can be done with Pandas Dataframes be reproduced with Pyspark Dataframes?
Are there some Pandas-exclusive functions or some functions that are incredibly difficult to reproduce with Pyspark?
spark is a distributed processing framework. In addition to supporting the DataFrame functionality, it needs to run a JVM, a scheduler, cross-process/machine communication, it spins up databases, etc. So while of course, the answer to your question is no, not exactly everything is implemented the same way, the wider answer is that any distributed processing library naturally involves immense overhead. Lots of work goes into reducing this overhead, but it will never be trivial.
Dask (another distributed processing library with a DataFrame implementation) has a great section on best practices. In it, the first recommendation is not to use dask unless you have to:
Parallelism brings extra complexity and overhead. Sometimes it’s necessary for larger problems, but often it’s not. Before adding a parallel computing system like Dask to your workload you may want to first try some alternatives:
Use better algorithms or data structures: NumPy, Pandas, Scikit-Learn may have faster functions for what you’re trying to do. It may be worth consulting with an expert or reading through their docs again to find a better pre-built algorithm.
Better file formats: Efficient binary formats that support random access can often help you manage larger-than-memory datasets efficiently and simply. See the Store Data Efficiently section below.
Compiled code: Compiling your Python code with Numba or Cython might make parallelism unnecessary. Or you might use the multi-core parallelism available within those libraries.
Sampling: Even if you have a lot of data, there might not be much advantage from using all of it. By sampling intelligently you might be able to derive the same insight from a much more manageable subset.
Profile: If you’re trying to speed up slow code it’s important that you first understand why it is slow. Modest time investments in profiling your code can help you to identify what is slowing you down. This information can help you make better decisions about if parallelism is likely to help, or if other approaches are likely to be more effective.
There's a very good reason for this. In-memory, single-threaded applications are always going to be much faster for small datasets.
Very simplistically, if you imagine the single-threaded runtime for your workflow is T, the wall time of a distributed workflow will be T_parallelizable / n_cores + T_not_parallelizable + overhead. For pyspark, this overhead is very significant. It's worth it a lot of the time. But it's not nothing.

Is it faster and more memory efficient to manipulate data in Python or PostgreSQL?

Say I had a PostgreSQL table with 5-6 columns and a few hundred rows. Would it be more effective to use psycopg2 to load the entire table into my Python program and use Python to select the rows I want and order the rows as I desire? Or would it be more effective to use SQL to select the required rows, order them, and only load those specific rows into my Python program.
By 'effective' I mean in terms of:
Memory Usage.
Speed.
Additionally, how would these factors start to vary as the size of the table increases? Say, the table now has a few million rows?
It's almost always going to be faster to perform all of these operations in PostgreSQL. These database systems have been designed to scale well for huge amounts of data, and are highly optimised for their typical use cases. For example, they don't have to load all of the data from disk to perform most basic filters[1].
Even if this were not the case, the network latency / usage alone world be enough to balance this out, especially if you were running the query often.
Actually, if you are comparing data that is already loaded into memory to data being retrieved from a database, then the in-memory operations are often going to be faster. Databases have overhead:
They are in separate processes on the same server or on a different server, so data and commands needs to move between them.
Queries need to be parsed and optimized.
Databases support multiple users, so other work may be going on using up resources.
Databases maintain ACID properties and data integrity, which can add additional overhead.
The first two of these in particular add overhead compared to equivalent in-memory operations for every query.
That doesn't mean that databases do not have advantages, particularly for complex queries:
They implement multiple different algorithms and have an optimizer to choose the best one.
They can take advantage of more resources -- particularly by running in parallel.
They can (sometimes) cache results saving lots of time.
The advantage of databases is not that they provide the best performance all the time. The advantage is that they provide good performance across a very wide range of requests with a simple interface (even if you don't like SQL, I think you need to admit that it is simpler, more concise, and more flexible that writing code in a 3rd generation language).
In addition, databases protect data, via ACID properties and other mechanisms to support data integrity.

How to compute the performance of a python program on different machines

I would like to know what are the different performance characteristics that can be used to find the performance of a python code on 2 different systems. Also is it possible to extend about its performance on a different machine? Is this kind of stuff possible?
Lets assume that one of the two systems is computation on GPU and other is on a CPU
I want to extend the python code's performance on a CPU enabled different system.
Can this be also derived analytically?
In my experiences making assumptions based on hands on performance analysis has been sufficient for identifying initial instance sizes/requirements, and then using real time telemetry and instrumentation to closely monitor those solutions.
There are a couple ways, I've used, to commute performance (the terms are gibberish i've made up):
Informal Characterization of Bottlenecks
This has involved informally understanding where the bottlenecks of your application are likely to be, to give a very rough idea of capacity/machine requirements. If you're performing CPU bound calculations with little to no network, then could bypassing starting with a network optimized instance. Also if you're materializing processing to filesystem, and memory overhead is pretty small or bounded then you don't need a high memory instance.
External Performance Experiments
This involves creating performance test harnesses to establish base line experiments allowing you to change computer variables to determine what sort of effect they have on your program performance. I like to setup queue based systems with throughput tests ie #10k requests / second what is the queue saturation, what is the service time. It involves adding logging/telemetry to code to log those numbers. Also setup a backlog to understand how fast a single instance can process a backlog.
For HTTP there are many tools to generate load.
Hopefully there is an automated tool to support your input format but if not you may have to write your own.
Performance Profiling
I consider this using "low level" tools to scientifically (opposed to the informal analysis) determine where your code is spending its time. Usually involves using python profiler to determine which routines you're spending time in, and then try to optimize them. http://www.brendangregg.com/linuxperf.html
For this step if the performance test harness has acceptable performance then this can be ignored :p
Real time telemetry
After acceptable performance and instance size has been determined, real time telemetry is critical to see how program is perform in real-time-ish to real life workloads.
I've found Throughput, processing counts, errors, etc to all be critical to maintaining high performance systems:
http://www.brendangregg.com/usemethod.html

How to speed up Python code for running on a powerful machine? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've completed writing a multiclass classification algorithm that uses boosted classifiers. One of the main calculations consists of weighted least squares regression.
The main libraries I've used include:
statsmodels (for regression)
numpy (pretty much everywhere)
scikit-image (for extracting HoG features of images)
I've developed the algorithm in Python, using Anaconda's Spyder.
I now need to use the algorithm to start training classification models. So I'll be passing approximately 7000-10000 images to this algorithm, each about 50x100, all in gray scale.
Now I've been told that a powerful machine is available in order to speed up the training process. And they asked me "am I using GPU?" And a few other questions.
To be honest I have no experience in CUDA/GPU, etc. I've only ever heard of them. I didn't develop my code with any such thing in mind. In fact I had the (ignorant) impression that a good machine will automatically run my code faster than a mediocre one, without my having to do anything about it. (Apart from obviously writing regular code efficiently in terms of loops, O(n), etc).
Is it still possible for my code to get speeded up simply by virtue of being on a high performance computer? Or do I need to modify it to make use of a parallel-processing machine?
The comments and Moj's answer give a lot of good advice. I have some experience on signal/image processing with python, and have banged my head against the performance wall repeatedly, and I just want to share a few thoughts about making things faster in general. Maybe these help figuring out possible solutions with slow algorithms.
Where is the time spent?
Let us assume that you have a great algorithm which is just too slow. The first step is to profile it to see where the time is spent. Sometimes the time is spent doing trivial things in a stupid way. It may be in your own code, or it may even be in the library code. For example, if you want to run a 2D Gaussian filter with a largish kernel, direct convolution is very slow, and even FFT may be slow. Approximating the filter with computationally cheap successive sliding averages may speed things up by a factor of 10 or 100 in some cases and give results which are close enough.
If a lot of time is spent in some module/library code, you should check if the algorithm is just a slow algorithm, or if there is something slow with the library. Python is a great programming language, but for pure number crunching operations it is not good, which means most great libraries have some binary libraries doing the heavy lifting. On the other hand, if you can find suitable libraries, the penalty for using python in signal/image processing is often negligible. Thus, rewriting the whole program in C does not usually help much.
Writing a good algorithm even in C is not always trivial, and sometimes the performance may vary a lot depending on things like CPU cache. If the data is in the CPU cache, it can be fetched very fast, if it is not, then the algorithm is much slower. This may introduce non-linear steps into the processing time depending on the data size. (Most people know this from the virtual memory swapping, where it is more visible.) Due to this it may be faster to solve 100 problems with 100 000 points than 1 problem with 10 000 000 points.
One thing to check is the precision used in the calculation. In some cases float32 is as good as float64 but much faster. In many cases there is no difference.
Multi-threading
Python - did I mention? - is a great programming language, but one of its shortcomings is that in its basic form it runs a single thread. So, no matter how many cores you have in your system, the wall clock time is always the same. The result is that one of the cores is at 100 %, and the others spend their time idling. Making things parallel and having multiple threads may improve your performance by a factor of, e.g., 3 in a 4-core machine.
It is usually a very good idea if you can split your problem into small independent parts. It helps with many performance bottlenecks.
And do not expect technology to come to rescue. If the code is not written to be parallel, it is very difficult for a machine to make it parallel.
GPUs
Your machine may have a great GPU with maybe 1536 number-hungry cores ready to crunch everything you toss at them. The bad news is that making GPU code is a bit different from writing CPU code. There are some slightly generic APIs around (CUDA, OpenCL), but if you are not accustomed to writing parallel code for GPUs, prepare for a steepish learning curve. On the other hand, it is likely someone has already written the library you need, and then you only need to hook to that.
With GPUs the sheer number-crunching power is impressive, almost frightening. We may talk about 3 TFLOPS (3 x 10^12 single-precision floating-point ops per second). The problem there is how to get the data to the GPU cores, because the memory bandwidth will become the limiting factor. This means that even though using GPUs is a good idea in many cases, there are a lot of cases where there is no gain.
Typically, if you are performing a lot of local operations on the image, the operations are easy to make parallel, and they fit well a GPU. If you are doing global operations, the situation is a bit more complicated. A FFT requires information from all over the image, and thus the standard algorithm does not work well with GPUs. (There are GPU-based algorithms for FFTs, and they sometimes make things much faster.)
Also, beware that making your algorithms run on a GPU bind you to that GPU. The portability of your code across OSes or machines suffers.
Buy some performance
Also, one important thing to consider is if you need to run your algorithm once, once in a while, or in real time. Sometimes the solution is as easy as buying time from a larger computer. For a dollar or two an hour you can buy time from quite fast machines with a lot of resources. It is simpler and often cheaper than you would think. Also GPU capacity can be bought easily for a similar price.
One possibly slightly under-advertised property of some cloud services is that in some cases the IO speed of the virtual machines is extremely good compared to physical machines. The difference comes from the fact that there are no spinning platters with the average penalty of half-revolution per data seek. This may be important with data-intensive applications, especially if you work with a large number of files and access them in a non-linear way.
I am afraid you can not speed up your program by just running it on a powerful computer. I had this issue while back. I first used python (very slow), then moved to C(slow) and then had to use other tricks and techniques. for example it is sometimes possible to apply some dimensionality reduction to speed up things while having reasonable accurate result, or as you mentioned using multi processing techniques.
Since you are dealing with image processing problem, you do a lot of matrix operations and GPU for sure would be a great help. there are some nice and active cuda wrappers in python that you can easily use, by not knowing too much CUDA. I tried Theano, pycuda and scikit-cuda (there should be more than that since then).

Experience with using h5py to do analytical work on big data in Python?

I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).
I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).
Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?
EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.
HDF5 advantages:
data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
APIs are available for different platforms and languages
structure data using groups
annotating data using attributes
worry-free built-in data compression
io on single datasets is fast
HDF5 pitfalls:
Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)
I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.
Three reasons:
you can mine the source code of any of those packages for ideas that might help you generally
you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python
Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.

Categories