I'm running Pandas data wrangling with several nested functions on a 1gig+ csv file (one of possibly many). I've tried various code optimization techniques and I'm still not happy with the performance. I realize there are possible optimizations with Cython etc. which I have not explored. Before I go with further complex code optimizations I would like to understand some most natural next steps with throwing more comp. power at the problem.
What techniques would you suggest trying first? I understand going the Spark route would require rewriting many of the Pandas operations to Spark frames. I also read about Dask. Any suggestions, advantages/disadvantages would be welcome.
Related
I'm trying to think of a reason (other than you only have a small dataset) that you wouldn't use Pyspark Dataframes.
Can everything that can be done with Pandas Dataframes be reproduced with Pyspark Dataframes?
Are there some Pandas-exclusive functions or some functions that are incredibly difficult to reproduce with Pyspark?
spark is a distributed processing framework. In addition to supporting the DataFrame functionality, it needs to run a JVM, a scheduler, cross-process/machine communication, it spins up databases, etc. So while of course, the answer to your question is no, not exactly everything is implemented the same way, the wider answer is that any distributed processing library naturally involves immense overhead. Lots of work goes into reducing this overhead, but it will never be trivial.
Dask (another distributed processing library with a DataFrame implementation) has a great section on best practices. In it, the first recommendation is not to use dask unless you have to:
Parallelism brings extra complexity and overhead. Sometimes it’s necessary for larger problems, but often it’s not. Before adding a parallel computing system like Dask to your workload you may want to first try some alternatives:
Use better algorithms or data structures: NumPy, Pandas, Scikit-Learn may have faster functions for what you’re trying to do. It may be worth consulting with an expert or reading through their docs again to find a better pre-built algorithm.
Better file formats: Efficient binary formats that support random access can often help you manage larger-than-memory datasets efficiently and simply. See the Store Data Efficiently section below.
Compiled code: Compiling your Python code with Numba or Cython might make parallelism unnecessary. Or you might use the multi-core parallelism available within those libraries.
Sampling: Even if you have a lot of data, there might not be much advantage from using all of it. By sampling intelligently you might be able to derive the same insight from a much more manageable subset.
Profile: If you’re trying to speed up slow code it’s important that you first understand why it is slow. Modest time investments in profiling your code can help you to identify what is slowing you down. This information can help you make better decisions about if parallelism is likely to help, or if other approaches are likely to be more effective.
There's a very good reason for this. In-memory, single-threaded applications are always going to be much faster for small datasets.
Very simplistically, if you imagine the single-threaded runtime for your workflow is T, the wall time of a distributed workflow will be T_parallelizable / n_cores + T_not_parallelizable + overhead. For pyspark, this overhead is very significant. It's worth it a lot of the time. But it's not nothing.
This question is a follow up to this one: How to the increase performance of a Python loop?.
Basically I have a script that takes as inputs a few csv files and after some data manipulation it outputs 2 csv files. In this script there is a loop on a table with ~14 million rows whose objective is to create another table with the same number of rows. I am working with Python on this project but the loop is just too slow (I know this because I used the tqdm package to measure speed).
So I’m looking for suggestions on what I should use in order to achieve my objective. Ideally the technology is free and it doesn’t take long for to learn it. I already got a few suggestions from other people: Cython and Power BI. The last one is paid and the first one seems complicated but I am willing to learn if indeed it is useful.
If more details are necessary just ask. Thanks.
Read about vaex. Vaex can help you to process your data much much faster. You should first convert your csv file to hdf5 format using vaex library. csv files are very slow for read/write.
Vaex will do the multiprocessing for your operations.
Also check if you can vectorize your computation (probably you can). I glansed at your code. Try to avoid using list instead use numpy arrays if you can.
If you're willing to stay with python I would probably recommend using the multiprocessing module. Corey Schafer has a good tutorial on how it works here.
Multiprocessing is a bit like threading however uses multiple interpreters to complete the main task unlike the threading module which switches through each thread to execute a line of code independently.
Divide up the work with however many cores you have on your CPU with this:
import os
cores = os.cpu_count()
This should speed up the work load by dividing up the work across your entire computer.
14 million rows is very achievable with python, but you can't do it with inefficient looping methods. I had a glance at the code you posted here, and saw that you're using iterrows(). iterrows() is fine for small dataframes, but it is (as you know) painfully slow when used on dataframes the size of yours. Instead, I suggest you start by looking into the apply() method (see docs here). That should get you up to speed!
I am currently working on my thesis, which involves dealing with quite a sizable dataset: ~4mln observations and ~260ths features. It is a dataset of chess games, where most of the features are player dummies (130k for each colour).
As for the hardware and the software, I have around 12GB of RAM on this computer. I am doing all my work in Python 3.5 and use mainly pandas and scikit-learn packages.
My problem is that obviously I can't load this amount of data to my RAM. What I would love to do is to generate the dummy variables, then slice the database into like a thousand or so chunks, apply the Random Forest and aggregate the results again.
However, to do that I would need to be able to first create the dummy variables, which I am not able to do due to memory error, even if I use sparse matrices. Theoretically, I could just slice up the database first, then create the dummy variables. However, the effect of that will be that I will have different features for different slices, so I'm not sure how to aggregate such results.
My questions:
1. How would you guys approach this problem? Is there a way to "merge" the results of my estimation despite having different features in different "chunks" of data?
2. Perhaps it is possible to avoid this problem altogether by renting a server. Are there any trial versions of such services? I'm not sure exactly how much CPU/RAM would I need to complete this task.
Thanks for your help, any kind of tips will be appreciated :)
I would suggest you give CloudxLab a try.
Though it is not free it is quite affordable ($25 for a month). It provides complete environment to experiment with various tools such as HDFS, Map-Reduce, Hive, Pig, Kafka, Spark, Scala, Sqoop, Oozie, Mahout, MLLib, Zookeeper, R, Scala etc. Many of the popular trainers are using CloudxLab.
I have been using numpy/scipy for data analysis. I recently started to learn Pandas.
I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.
It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.
Is there something revolutionary about Pandas that I just stupidly missed?
Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.
For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):
Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
Split-Apply-Combine is a powerful way of thinking about and implementing data-processing
However, there are downsides to Pandas:
Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.
I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.
Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.
For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.
For time series, there is the massive benefit of handling time series data using a datetime index, which allows you to resample smoothly to different intervals, fill in values and plot your series incredibly easily.
Since many of my problems start their lives in spreadsheets, I am also very grateful for the relatively transparent handling of Excel files in both .xls and .xlsx formats with a uniform interface.
There is also a greater ecosystem, with packages like seaborn enabling more fluent statistical analysis and model fitting than is possible with the base numpy/scipy stuff.
A main point is that it introduces new data structures like dataframes, panels etc. and has good interfaces to other structure and libs. So in generally its more an great extension to the python ecosystem than an improvement over other libs. For me its a great tool among others like numpy, bcolz. Often i use it to reshape my data, get an overview before starting to do data mining etc.
I am currently using python pandas and want to know if there is a way to output the data from pandas into julia Dataframes and vice versa. (I think you can call python from Julia with Pycall but I am not sure if it works with dataframes) Is there a way to call Julia from python and have it take in pandas dataframes? (without saving to another file format like csv)
When would it be advantageous to use Julia Dataframes than Pandas other than extremely large datasets and running things with many loops(like neural networks)?
So there is a library developed for this
PyJulia is a library used to interface with Julia using Python 2 and 3
https://github.com/JuliaLang/pyjulia
It is experimental but somewhat works
Secondly Julia also has a front end for pandas which is pandas.jl
https://github.com/malmaud/Pandas.jl
It looks to be just a wrapper for pandas but you might be able to execute multiple functions using julia's parallel features.
As for the which is better so far pandas has faster I/O according to this reading csv in Julia is slow compared to Python
I'm a novice at this sort of thing but have definitely been using both as of late. Truth be told, they seem very quite comparable but there is far more documentation, Stack Overflow questions, etc pertaining to Pandas so I would give it a slight edge. Do not let that fact discourage you however because Julia has some amazing functionality that I'm only beginning to understand. With large datasets, say over a couple gigs, both packages are pretty slow but again Pandas seems to have a slight edge (by no means would I consider my benchmarking to be definitive). Without a more nuanced understanding of what you are trying to achieve, it's difficult for me to envision a circumstance where you would even want to call a Pandas function while working with a Julia DataFrame or vice versa. Unless you are doing something pretty cerebral or working with really large datasets, I can't see going too wrong with either. When you say "output the data" what do you mean? Couldn't you write the Pandas data object to a file and then open/manipulate that file in a Julia DataFrame (as you mention)? Again, unless you have a really good machine reading gigs of data into either pandas or a Julia DataFrame is tedious and can be prohibitively slow.