Why pandas has its own datetime object Timestamp? - python

The documentation of pandas.Timestamp states a concept well-known to every pandas user:
Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases.
But I don't understand why are pandas.Timestamps needed at all.
Why is, or was, it useful to have a different object than python's Datetime? Wouldn't it be cleaner to simply build pandas.DatetimeIndex out of Datetimes?

You can go through Pandas documentation for the details:
"pandas.Timestamp" is a replacement for python datetime.datetime for
Padas usage.
Timestamp is the pandas equivalent of python’s Datetime and is
interchangeable with it in most cases. It’s the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.
Notes
There are essentially three calling conventions for the constructor.
The primary form accepts four parameters. They can be passed by
position or keyword.
The other two forms mimic the parameters from datetime.datetime. They
can be passed by either position or keyword, but not both mixed
together.
Timedeltas are differences in times, expressed in difference units,
e.g. days, hours, minutes, seconds. They can be both positive and
negative.
Timedelta is a subclass of datetime.timedelta, and behaves in a
similar manner, but allows compatibility with np.timedelta64 types
as well as a host of custom representation, parsing, and attributes.
I would say as pandas works better with Time Series data hence its been a kind of warper on the original built-in datetime module.
The weaknesses of Python's datetime format inspired the NumPy team to
add a set of native time series data type to NumPy. The datetime64
dtype encodes dates as 64-bit integers, and thus allows arrays of
dates to be represented very compactly.

Related

Why is there a seemingly random conversion of floats to ints with aggregation functions

From the python pandas pivot table reference the difference between the fourth last and third last example is simply to add this parameter: 'fill_value=0'.
Yet the difference to the table is much broader - while previously all values (whether nan or not) were shown as floats, they are now presented as ints (that is, no suffix of '.0').
I have seen similar behaviour with particular groupby() operations, for instance. Is there a way to force a consistent behaviour?

How can I implement time based time series windowing?

I want to window a pandas series which has a DatetimeIndex to the last X seconds. Usually I'd use pandas.Series.rolling for windowing. However the datetime indices are not equidistant means I cannot calculate the number of data points in a reliable manner. How can I implement a time based windowing (e.g. by implementing a BaseIndexer subclass and passing it to the window parameter of rolling())?
The easiest way to get the last X seconds from a datetime indexed series I came up with is getting the newest_timestamp = series.index.max(), calculating the oldest timestamp to consider from it oldest_timestamp = newest_timestamp - pd.to_timedelta(<X-seconds>, unit='s') and slicing the windowed_series = series[oldest_timestamp:newest_timestamp. Cause the oldest_timestamp is not extracted from the series but calculated the slicing operation will usually not match exactly. However this does not matter cause it is handled automatically.
NOTE: series.rolling() is usually used in a time series data pre-processing context (e.g. weighting samples within a window dependent on a function as part of a forecasting application) not for plain windowing use cases.

storing datetime in python matrix

I have a 100 by 2 matrix. The first column has dates. in numerical format. The dates are not necessarily sequentially increasing or decreasing. The granularity of the dates is 5 minutes. Therefore, there could be rows whose year, month, day and hour are the same but their minutes are different. I need it to do some operations in the matrix, how can I do that? Is there any way to save date and time in the matrix?
Yes, that all depends on the data structure that you want to use:
numpy has a dtype datetime: doc here
pandas too: tuto here
You can also choose to store them as unix timestamps, which are basically integers counting the number of seconds from 1/1/1970.
If you choose to use builtin types instead such as lists and dictionaries, then you can use the library datetime which provides datetime objects.
If you want more information, a simple google search for "python datetime" will probably shed some light...

pandas data format to preserve DateTimeIndex

I do a lot of work with data that has DateTime indexes and multi-indexes. Saving and reading as a .csv is tedious because every time I have to reset_index and name it "date" then when I read again, I have to convert the date back to a datetime and set the index. What format will help me avoid this? I'd prefer something open source - for instance I think SAS and Stata will do this, but they are proprietary.
feather was made for this:
https://github.com/wesm/feather
Feather provides binary columnar serialization for data frames. It is
designed to make reading and writing data frames efficient, and to
make sharing data across data analysis languages easy. This initial
version comes with bindings for python (written by Wes McKinney) and R
(written by Hadley Wickham).
Feather uses the Apache Arrow columnar memory specification to
represent binary data on disk. This makes read and write operations
very fast. This is particularly important for encoding null/NA values
and variable-length types like UTF8 strings.
Feather is a part of the broader Apache Arrow project. Feather defines
its own simplified schemas and metadata for on-disk representation.
Feather currently supports the following column types:
A wide range of numeric types (int8, int16, int32, int64, uint8,
uint16, uint32, uint64, float, double). Logical/boolean values. Dates,
times, and timestamps. Factors/categorical variables that have fixed
set of possible values. UTF-8 encoded strings. Arbitrary binary data.

Non-standard calendars in pandas

I am trying to use the Python pandas library with timeseries data that go beyond AD 1. Pandas' datetime objects apparently use the numpy datetime64 object, whose range is limited to 1/1/1 - 9999/12/31.
It is possible to index a DataFrame with a non-datetime index, but let's face it: the real strength of pandas (e.g. resampling) shines when rows are indexed by a proper datetime object. Without the resample() method I may not need pandas at all.
Several Python projects have implemented alternate calendars, in particular astropy and FlexiDate, but neither generates a datetime object. I just learned that Numpy folks do not envision supporting non-standard calendars anytime soon.
Before posting a ticket on the pandas project GitHub, I am hereby appealing to the StackOverflow hive mind: do you know of a not-too-clunky way to generate a datetime index that pandas will recognize and that handles non-standard calendars?

Categories