I have multiple time-series as dataframes in Python3, imported from excel files looking like this:
But they come in varying levels of data granularity such as hourly, daily, monthly and yearly series. To be perfectly clear: within a single file the granularity is consistent, it only varies across the different files. However, there might be missing time stamps (predictable ones like daylight saving or unpredictable ones because of technical failure to record (e.g. in the context of weather data)).
I would like to efficiently recognize each series' granularity level with a function, assuming the first column is always datetime such that an hourly series would have datetime steps like 2022-11-11 01:00 and 2022-11-11 02:00 whereas a yearly series would have datetime steps like 2022-01-01 00:00 and 2023-01-01 00:00
As a first approach I thought about taking the difference between the datetime series against its lagged series and calculate the average over total horizon to infer on the granularity level, but that seems rather inefficient. I'm hoping there is some built-in function in the datetime package already or that someone can come up with a reliable and more efficient method.
Edit 1
The above screenshot shows an example of a df featuring daily time-series granularity. Here is another screenshot showing a time-series with hourly granularity:
Related
This seems like a trivial thing to solve, but every idea I have is very hacky. I have a series of timestamps that spans multiple days. What I'm interested in, is the distribution of these timestamps (events) within 24h: e.g., see whether there are more events in the morning.
My data is in a pandas.DataFrame, and the timestamp column has dtype datetime[ns]. I tried matplotlib.pyplot.plot(data.timestamp.dt.time), but that gives an error. I also thought of subtracting the data from my timestamps so they all start on 'day 0', and formatting the X-axis in the plot to not show the date. Feels very clumsy. Is there a better way?
If you are interested in distribution with resolution limited to e.g.
hours, you can:
Create a new column with extracted hour from your source timestamp.
Group your data by hour.
Generate your plot.
As you failed to post any sample data, I'm not able to post any code.
I need use multi-variate linear regression for my project where I have two dependent variables: mean_1 and mean_2. The independent variable is date in YYYY-mm-dd format. I have been going through various stackoverflow posts to understand how to use date as a variable in regression. some suggest to convert date to numerical value (https://stackoverflow.com/a/40217971/13713750) while the other option is to convert date to dummy variables.
What I don't understand is how to convert every date in the dataset to a dummy variable and use it as an independent variable. Is it even possible or there are any better ways to use date as independent variable?
Note: I would prefer using date in the date format so it would be easy to plot and analyse the results of regression. Also I am working with pyspark but I can switch to pandas if necessary. so any examples of implementations would be helpful. Thanks!
You could create new columns year, month, day_of_year, day_of_month, day_of_week. You could also add some binary columns like is_weekday, is_holiday. In some cases it is beneficial to add third party data, like daily weather statistics for example (I was working on a case where extra daily weather data proved very useful). Really depends on the domain you're working on. Any of those columns could unveil some pattern behind your data.
As for Dummy variables, converting month and day_of_week to dummies makes sense.
Another option is, to build a model for each month.
If you want to transform a date to numeric (but I don't recommend) you can do this:
pd.to_timedelta(df.date).dt.total_seconds().astype(int)
You can do the same but with the total number of seconds:
pd.to_timedelta(df.date).dt.total_seconds()
Also, you can use a baseline date and subtract that from your date variable and obtain the number of days, this will give you an integer number that makes sense (bigger difference means a date more into the future, while smaller difference shows older dates). This value makes sense for me to use as an independent variable in a model.
First, we create a baseline date (can be whatever you want), and add it to the dataframe to the column static:
df['static'] = pd.to_datetime(datetime.date(2017, 6, 28))
Then we obtain the difference of days of the static date vs your date
df['days'] = (df['static'] - df['date']).dt.days
And there you will have a number ready to be used as an independent variable
I have a 100 by 2 matrix. The first column has dates. in numerical format. The dates are not necessarily sequentially increasing or decreasing. The granularity of the dates is 5 minutes. Therefore, there could be rows whose year, month, day and hour are the same but their minutes are different. I need it to do some operations in the matrix, how can I do that? Is there any way to save date and time in the matrix?
Yes, that all depends on the data structure that you want to use:
numpy has a dtype datetime: doc here
pandas too: tuto here
You can also choose to store them as unix timestamps, which are basically integers counting the number of seconds from 1/1/1970.
If you choose to use builtin types instead such as lists and dictionaries, then you can use the library datetime which provides datetime objects.
If you want more information, a simple google search for "python datetime" will probably shed some light...
What would be the best way to approach this problem using python and pandas?
I have an excel file of electricity usage. It comes in an awkward structure and I want to transform it so that I can compare it to weather data based on date and time.
The structure look like ( foo is a string and xx is a number)
100,foo,foo,foo,foo
200,foo,foo,foo,foo,foo,0000,kWh,15
300,20181101,xx,xx,xx,xx...(96 columns)xx,A
... several hundred more 300 type rows
the 100 and 200 rows identify the meter and provide a partial schema. ie data is in kWh and 15 minute intervals. The 300 rows contain date and 96 (ie 96 = 24hours*4 15min blocks) columns of 15min power consumption and one column with a data quality flag.
I have previously processed all the data in other tools but I'm trying to learn how to do it in Python (jupyter notebook to be precise) and tap into the far more advanced analysis, modeling and visualisation tools available.
I think the thing to do is transform the data into a series of datetime and power. From there I can aggregate filter and compare however I like.
I am at a loss even to know what question to ask or resource to look up to tackle this problem. I could just import the 300 rows as is and loop through the rows and columns to create a new series in the right structure - easy enough to do. However, I strongly suspect there is an inbuilt method for doing this sort of thing and I would greatly appreciate any advise on what might be the best strategies. Possibly I don't need to transform the data at all.
You can read the data easy enough into a DataFrame, you just have to step over the metadata rows, e.g.:
df = pd.read_csv(<file>, skiprows=[0,1], index_col=1, parse_dates=True, header=None)
This will read in the csv, skip over the first 2 lines, make the date column the index and try and parse it to a date type.
I'm new to Pandas and would like some insight from the pros. I need to perform various statistical analyses (multiple regression, correlation etc) on >30 time series of financial securities' daily Open, High, Low, Close prices. Each series has 500-1500 days of data. As each analysis looks at multiple securities, I'm wondering if it's preferable from an ease of use and efficiency perspective to store each time series in a separate df, each with date as the index, or to merge them all into a single df with a single date index, which would effectively be a 3d df. If the latter, any recommendations on how to structure it?
Any thoughts much appreciated.
PS. I'm working my way up to working with intraday data across multiple timezones but that's a bit much for my first pandas project; this is a first step in that direction.
since you're only dealing with OHLC, it's not that much data to process, so that's good.
for these types of things i usually use a multiindex (http://pandas.pydata.org/pandas-docs/stable/indexing.html) with symbol as the first level and date as the second. then you can have just the columns OHLC and you're all set.
to access multiindex use the .xs function.
Unless you are going to correlate everything with everything, my suggestion is to put this into separate dataframes and put them all in a dictionary, ie {"Timeseries1":df1, "Timeseries 2":df2...}. Then, when you want to correlate some timeseries together, you can merge them and put suffixes in the columns of every different df to differentiate between them.
Probably you are interested in this talk Python for Financial Data Analysis with pandas by the author of pandas himself.