I have a 100 by 2 matrix. The first column has dates. in numerical format. The dates are not necessarily sequentially increasing or decreasing. The granularity of the dates is 5 minutes. Therefore, there could be rows whose year, month, day and hour are the same but their minutes are different. I need it to do some operations in the matrix, how can I do that? Is there any way to save date and time in the matrix?
Yes, that all depends on the data structure that you want to use:
numpy has a dtype datetime: doc here
pandas too: tuto here
You can also choose to store them as unix timestamps, which are basically integers counting the number of seconds from 1/1/1970.
If you choose to use builtin types instead such as lists and dictionaries, then you can use the library datetime which provides datetime objects.
If you want more information, a simple google search for "python datetime" will probably shed some light...
Related
Overview
Basically I have a few date fields, say day1, day2 stuff and I want to calculate the year/month/day differences between these fields. Direct minus of those fields give a ns format
test_date_df = df['END_DT'] - df['START_DT']
test_date_df.dtype
dtype is dtype('<U')
1 row example:
1.97096e+12
I did not find a proper workaround to convert this 1.97096e+12 into year/month/day with native H2O function, it is great if the operation can be done with native H2O function or I expect this function works with multi-cores so that the out of memory error is not a concern as I have huge volume of data.
With Pandas it is very handy to work this out with multi-processing but just check out if someone has a solution with H2O
I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.
I need use multi-variate linear regression for my project where I have two dependent variables: mean_1 and mean_2. The independent variable is date in YYYY-mm-dd format. I have been going through various stackoverflow posts to understand how to use date as a variable in regression. some suggest to convert date to numerical value (https://stackoverflow.com/a/40217971/13713750) while the other option is to convert date to dummy variables.
What I don't understand is how to convert every date in the dataset to a dummy variable and use it as an independent variable. Is it even possible or there are any better ways to use date as independent variable?
Note: I would prefer using date in the date format so it would be easy to plot and analyse the results of regression. Also I am working with pyspark but I can switch to pandas if necessary. so any examples of implementations would be helpful. Thanks!
You could create new columns year, month, day_of_year, day_of_month, day_of_week. You could also add some binary columns like is_weekday, is_holiday. In some cases it is beneficial to add third party data, like daily weather statistics for example (I was working on a case where extra daily weather data proved very useful). Really depends on the domain you're working on. Any of those columns could unveil some pattern behind your data.
As for Dummy variables, converting month and day_of_week to dummies makes sense.
Another option is, to build a model for each month.
If you want to transform a date to numeric (but I don't recommend) you can do this:
pd.to_timedelta(df.date).dt.total_seconds().astype(int)
You can do the same but with the total number of seconds:
pd.to_timedelta(df.date).dt.total_seconds()
Also, you can use a baseline date and subtract that from your date variable and obtain the number of days, this will give you an integer number that makes sense (bigger difference means a date more into the future, while smaller difference shows older dates). This value makes sense for me to use as an independent variable in a model.
First, we create a baseline date (can be whatever you want), and add it to the dataframe to the column static:
df['static'] = pd.to_datetime(datetime.date(2017, 6, 28))
Then we obtain the difference of days of the static date vs your date
df['days'] = (df['static'] - df['date']).dt.days
And there you will have a number ready to be used as an independent variable
I want to window a pandas series which has a DatetimeIndex to the last X seconds. Usually I'd use pandas.Series.rolling for windowing. However the datetime indices are not equidistant means I cannot calculate the number of data points in a reliable manner. How can I implement a time based windowing (e.g. by implementing a BaseIndexer subclass and passing it to the window parameter of rolling())?
The easiest way to get the last X seconds from a datetime indexed series I came up with is getting the newest_timestamp = series.index.max(), calculating the oldest timestamp to consider from it oldest_timestamp = newest_timestamp - pd.to_timedelta(<X-seconds>, unit='s') and slicing the windowed_series = series[oldest_timestamp:newest_timestamp. Cause the oldest_timestamp is not extracted from the series but calculated the slicing operation will usually not match exactly. However this does not matter cause it is handled automatically.
NOTE: series.rolling() is usually used in a time series data pre-processing context (e.g. weighting samples within a window dependent on a function as part of a forecasting application) not for plain windowing use cases.
The documentation of pandas.Timestamp states a concept well-known to every pandas user:
Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases.
But I don't understand why are pandas.Timestamps needed at all.
Why is, or was, it useful to have a different object than python's Datetime? Wouldn't it be cleaner to simply build pandas.DatetimeIndex out of Datetimes?
You can go through Pandas documentation for the details:
"pandas.Timestamp" is a replacement for python datetime.datetime for
Padas usage.
Timestamp is the pandas equivalent of python’s Datetime and is
interchangeable with it in most cases. It’s the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.
Notes
There are essentially three calling conventions for the constructor.
The primary form accepts four parameters. They can be passed by
position or keyword.
The other two forms mimic the parameters from datetime.datetime. They
can be passed by either position or keyword, but not both mixed
together.
Timedeltas are differences in times, expressed in difference units,
e.g. days, hours, minutes, seconds. They can be both positive and
negative.
Timedelta is a subclass of datetime.timedelta, and behaves in a
similar manner, but allows compatibility with np.timedelta64 types
as well as a host of custom representation, parsing, and attributes.
I would say as pandas works better with Time Series data hence its been a kind of warper on the original built-in datetime module.
The weaknesses of Python's datetime format inspired the NumPy team to
add a set of native time series data type to NumPy. The datetime64
dtype encodes dates as 64-bit integers, and thus allows arrays of
dates to be represented very compactly.