I am trying to use the Python pandas library with timeseries data that go beyond AD 1. Pandas' datetime objects apparently use the numpy datetime64 object, whose range is limited to 1/1/1 - 9999/12/31.
It is possible to index a DataFrame with a non-datetime index, but let's face it: the real strength of pandas (e.g. resampling) shines when rows are indexed by a proper datetime object. Without the resample() method I may not need pandas at all.
Several Python projects have implemented alternate calendars, in particular astropy and FlexiDate, but neither generates a datetime object. I just learned that Numpy folks do not envision supporting non-standard calendars anytime soon.
Before posting a ticket on the pandas project GitHub, I am hereby appealing to the StackOverflow hive mind: do you know of a not-too-clunky way to generate a datetime index that pandas will recognize and that handles non-standard calendars?
Related
The documentation of pandas.Timestamp states a concept well-known to every pandas user:
Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases.
But I don't understand why are pandas.Timestamps needed at all.
Why is, or was, it useful to have a different object than python's Datetime? Wouldn't it be cleaner to simply build pandas.DatetimeIndex out of Datetimes?
You can go through Pandas documentation for the details:
"pandas.Timestamp" is a replacement for python datetime.datetime for
Padas usage.
Timestamp is the pandas equivalent of python’s Datetime and is
interchangeable with it in most cases. It’s the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.
Notes
There are essentially three calling conventions for the constructor.
The primary form accepts four parameters. They can be passed by
position or keyword.
The other two forms mimic the parameters from datetime.datetime. They
can be passed by either position or keyword, but not both mixed
together.
Timedeltas are differences in times, expressed in difference units,
e.g. days, hours, minutes, seconds. They can be both positive and
negative.
Timedelta is a subclass of datetime.timedelta, and behaves in a
similar manner, but allows compatibility with np.timedelta64 types
as well as a host of custom representation, parsing, and attributes.
I would say as pandas works better with Time Series data hence its been a kind of warper on the original built-in datetime module.
The weaknesses of Python's datetime format inspired the NumPy team to
add a set of native time series data type to NumPy. The datetime64
dtype encodes dates as 64-bit integers, and thus allows arrays of
dates to be represented very compactly.
I have been using numpy/scipy for data analysis. I recently started to learn Pandas.
I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.
It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.
Is there something revolutionary about Pandas that I just stupidly missed?
Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.
For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):
Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
Split-Apply-Combine is a powerful way of thinking about and implementing data-processing
However, there are downsides to Pandas:
Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.
I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.
Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.
For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.
For time series, there is the massive benefit of handling time series data using a datetime index, which allows you to resample smoothly to different intervals, fill in values and plot your series incredibly easily.
Since many of my problems start their lives in spreadsheets, I am also very grateful for the relatively transparent handling of Excel files in both .xls and .xlsx formats with a uniform interface.
There is also a greater ecosystem, with packages like seaborn enabling more fluent statistical analysis and model fitting than is possible with the base numpy/scipy stuff.
A main point is that it introduces new data structures like dataframes, panels etc. and has good interfaces to other structure and libs. So in generally its more an great extension to the python ecosystem than an improvement over other libs. For me its a great tool among others like numpy, bcolz. Often i use it to reshape my data, get an overview before starting to do data mining etc.
I'm trying to decide the best way to store my time series data in mongodb. Outside of mongo I'm working with them as numpy arrays or pandas DataFrames. I have seen a number of people (such as in this post) recommend pickling it and storing the binary, but I was under the impression that pickle should never be used for long term storage. Is that only true for data structures that might have underlying code changes to their class structures? To put it another way, numpy arrays are probably stable so fine to pickle, but pandas DataFrames might go bad as pandas is still evolving?
UPDATE:
A friend pointed me to this, which seems to be a good start on exactly what I want:
http://docs.scipy.org/doc/numpy/reference/routines.io.html
Numpy has its own binary file format, which should be long term storage stable. Once I get it actually working I'll come back and post my code. If someone else has made this work already I'll happily accept your answer.
We've built an open source library for storing numeric data (Pandas, numpy, etc.) in MongoDB:
https://github.com/manahl/arctic
Best of all, it's easy to use, pretty fast and supports data versioning, multiple data libraries and more.
Before, there was larry and structured/record arrays in NumPy, but I wonder if they are used any more with any frequency given the rapid development of the pandas package. Coming from R, I would always get stuck having to unpack the record arrays to modify values from multiple columns and reassign them back into the structure but I'm so glad that pandas now allows this for its data frames. I wonder if there are any uses for which record arrays are still superior (does it have some useful methods that pandas does not have)?
Here's a good explanation and simple comparison between pandas and numpy record arrays - Normalize/Standardize a numpy recarray
There are at least four data struts in pandas.
->Slice
->DateFrame
->DateMatrix
->Panel
What are the use cases for these. The documents seem to highlight slice and DataFrame.
Please give examples of use cases. I know where the doc is located.
The 3 main data structures are Series (1-dimensional), DataFrame (2D), and Panel (3D) (http://pandas.pydata.org/pandas-docs/stable/dsintro.html). A DataFrame is like a collection of Series while a Panel is like a collection of DataFrames. In many problem domains (statistics, economics, social sciences, ...) these are the 3 major kinds of data that are dealt with.
http://pandas.pydata.org/pandas-docs/stable/overview.html
Also, DataMatrix has been deprecated. In pandas >= 0.4.0, DataMatrix is just an alias for DataFrame.