I have been using numpy/scipy for data analysis. I recently started to learn Pandas.
I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.
It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.
Is there something revolutionary about Pandas that I just stupidly missed?
Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.
For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):
Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
Split-Apply-Combine is a powerful way of thinking about and implementing data-processing
However, there are downsides to Pandas:
Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.
I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.
Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.
For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.
For time series, there is the massive benefit of handling time series data using a datetime index, which allows you to resample smoothly to different intervals, fill in values and plot your series incredibly easily.
Since many of my problems start their lives in spreadsheets, I am also very grateful for the relatively transparent handling of Excel files in both .xls and .xlsx formats with a uniform interface.
There is also a greater ecosystem, with packages like seaborn enabling more fluent statistical analysis and model fitting than is possible with the base numpy/scipy stuff.
A main point is that it introduces new data structures like dataframes, panels etc. and has good interfaces to other structure and libs. So in generally its more an great extension to the python ecosystem than an improvement over other libs. For me its a great tool among others like numpy, bcolz. Often i use it to reshape my data, get an overview before starting to do data mining etc.
Related
I'm running Pandas data wrangling with several nested functions on a 1gig+ csv file (one of possibly many). I've tried various code optimization techniques and I'm still not happy with the performance. I realize there are possible optimizations with Cython etc. which I have not explored. Before I go with further complex code optimizations I would like to understand some most natural next steps with throwing more comp. power at the problem.
What techniques would you suggest trying first? I understand going the Spark route would require rewriting many of the Pandas operations to Spark frames. I also read about Dask. Any suggestions, advantages/disadvantages would be welcome.
I am currently working on large data sets in csv format. In some cases, it is faster to use excel functions to get the work done. However, I want to write python scripts to read/write csv and carry out the required function. In what cases would python scripts be better than using excel functions for data manipulation tasks? What would be the long term advantages?
Using python is recommended for below scenarios:
Repeated action: Perform similar set of action over a similar dataset repeatedly. For ex, say you get a monthly forecast data and you have to perform various slicing & dicing and plotting. Here the structure of data and the steps of analysis is more or less the same, but the data differs for every month. Using Python and Pandas will save you a bulk of time and also reduces manual error.
Exploratory analysis: Once you establish a certain familiarity with Pandas, Numpy and Matplotlib, analysis using these python libraries are faster and much efficient than Excel Analysis. One simple usecase to justify this statement is backtracking. With Pandas, you can quickly trace back and regain the dataset to its original form or an earlier analysed form. With Excel, you could get lost after a maze of analysis, and might be in a lose of backtrack to an earlier form outside of cntrl + z
Teaching Tool: In my opinion, this is the most underutilized feature. IPython notebook could be an excellent teaching tool and reference document for data analysis. Using this, you can efficiently transfer knowledge between colleagues rather than sharing a complicated excel file.
After learning python, you are more flexible. The operations you can do with
on the user interface of MS excel are limited, whereas there are no limits
if you use python.
The benefit is also, that you automate the modifications, e.g. you can re-use
it or re-apply it to a different dataset. The speed depends heavily on the
algorithm and library you use and on the operation.
You can also use VB script / macros in excel to automate things, but
usually python is less cumbersome and more flexible.
Pandas Coding practice: Is it better to build functions returning a DataFrame or Series?
This is a pretty fundamental question (and apols if already asked) but it would be great to hear views on this. I am leaning towards Series as it appears a more fundamental building block (i.e. index into df receives series), but there are some limits on the functionality that can be applied to Series. Equally, the fundamental argument could be taken one step further to numpy arrays where I begin to lose development speed.
The most obvious constraint you should consider is memory when performing functions. There are many techniques for estimating memory usage (linked below) including writing your dataframes do .csv files and checking their dbytes(). However if you're managing a small dataset, managing several dataframes shouldn't be an issue.
How to estimate how much memory a Pandas' DataFrame will need?
That said, you can also be structuring multiple functions and looking at their core process time statistics:
What do 'real', 'user' and 'sys' mean in the output of time(1)?
There is really no more detail I can provide without clarity/specificity around the question above.
I am currently using python pandas and want to know if there is a way to output the data from pandas into julia Dataframes and vice versa. (I think you can call python from Julia with Pycall but I am not sure if it works with dataframes) Is there a way to call Julia from python and have it take in pandas dataframes? (without saving to another file format like csv)
When would it be advantageous to use Julia Dataframes than Pandas other than extremely large datasets and running things with many loops(like neural networks)?
So there is a library developed for this
PyJulia is a library used to interface with Julia using Python 2 and 3
https://github.com/JuliaLang/pyjulia
It is experimental but somewhat works
Secondly Julia also has a front end for pandas which is pandas.jl
https://github.com/malmaud/Pandas.jl
It looks to be just a wrapper for pandas but you might be able to execute multiple functions using julia's parallel features.
As for the which is better so far pandas has faster I/O according to this reading csv in Julia is slow compared to Python
I'm a novice at this sort of thing but have definitely been using both as of late. Truth be told, they seem very quite comparable but there is far more documentation, Stack Overflow questions, etc pertaining to Pandas so I would give it a slight edge. Do not let that fact discourage you however because Julia has some amazing functionality that I'm only beginning to understand. With large datasets, say over a couple gigs, both packages are pretty slow but again Pandas seems to have a slight edge (by no means would I consider my benchmarking to be definitive). Without a more nuanced understanding of what you are trying to achieve, it's difficult for me to envision a circumstance where you would even want to call a Pandas function while working with a Julia DataFrame or vice versa. Unless you are doing something pretty cerebral or working with really large datasets, I can't see going too wrong with either. When you say "output the data" what do you mean? Couldn't you write the Pandas data object to a file and then open/manipulate that file in a Julia DataFrame (as you mention)? Again, unless you have a really good machine reading gigs of data into either pandas or a Julia DataFrame is tedious and can be prohibitively slow.
I'm trying to decide the best way to store my time series data in mongodb. Outside of mongo I'm working with them as numpy arrays or pandas DataFrames. I have seen a number of people (such as in this post) recommend pickling it and storing the binary, but I was under the impression that pickle should never be used for long term storage. Is that only true for data structures that might have underlying code changes to their class structures? To put it another way, numpy arrays are probably stable so fine to pickle, but pandas DataFrames might go bad as pandas is still evolving?
UPDATE:
A friend pointed me to this, which seems to be a good start on exactly what I want:
http://docs.scipy.org/doc/numpy/reference/routines.io.html
Numpy has its own binary file format, which should be long term storage stable. Once I get it actually working I'll come back and post my code. If someone else has made this work already I'll happily accept your answer.
We've built an open source library for storing numeric data (Pandas, numpy, etc.) in MongoDB:
https://github.com/manahl/arctic
Best of all, it's easy to use, pretty fast and supports data versioning, multiple data libraries and more.