Plot timestamp distribution in a day over multiple dates - python

This seems like a trivial thing to solve, but every idea I have is very hacky. I have a series of timestamps that spans multiple days. What I'm interested in, is the distribution of these timestamps (events) within 24h: e.g., see whether there are more events in the morning.
My data is in a pandas.DataFrame, and the timestamp column has dtype datetime[ns]. I tried matplotlib.pyplot.plot(data.timestamp.dt.time), but that gives an error. I also thought of subtracting the data from my timestamps so they all start on 'day 0', and formatting the X-axis in the plot to not show the date. Feels very clumsy. Is there a better way?

If you are interested in distribution with resolution limited to e.g.
hours, you can:
Create a new column with extracted hour from your source timestamp.
Group your data by hour.
Generate your plot.
As you failed to post any sample data, I'm not able to post any code.

Related

Efficient way to recognize time series's granularity in dataframe?

I have multiple time-series as dataframes in Python3, imported from excel files looking like this:
But they come in varying levels of data granularity such as hourly, daily, monthly and yearly series. To be perfectly clear: within a single file the granularity is consistent, it only varies across the different files. However, there might be missing time stamps (predictable ones like daylight saving or unpredictable ones because of technical failure to record (e.g. in the context of weather data)).
I would like to efficiently recognize each series' granularity level with a function, assuming the first column is always datetime such that an hourly series would have datetime steps like 2022-11-11 01:00 and 2022-11-11 02:00 whereas a yearly series would have datetime steps like 2022-01-01 00:00 and 2023-01-01 00:00
As a first approach I thought about taking the difference between the datetime series against its lagged series and calculate the average over total horizon to infer on the granularity level, but that seems rather inefficient. I'm hoping there is some built-in function in the datetime package already or that someone can come up with a reliable and more efficient method.
Edit 1
The above screenshot shows an example of a df featuring daily time-series granularity. Here is another screenshot showing a time-series with hourly granularity:

Plotting data over time with different start/end and frequency

I'm a new to data science and I'm looking for an approach to plot data over time and i don't know where to start.
Right now I have a SQLlite database containing datapoints consisting of a name, a timestamp
and a value and it looks like this
I can change this database to a csv file my shiny app if necessary
My goal is to plot the values by "name" over time(the one in the timestamps)
the problem I have is that the values have a different frequency and a different start/end time.
I'm looking for an approach in either R or Python but I Prefer R
i experimented with the R plot function but i get errors like:
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
Is there a library or approach that can help me achieve this goal?

Upsampling high rate data using pandas

Yes, you read that correctly. I want to upsample multi-hertz data, by a factor of 5. I have a spreadsheet with dozens of columns and thousands of rows, I would like to upsample all the data at once. So I cracked open the pandas web pages and tutorials. I read the csv file in with pandas "read_csv", then used the floating point seconds column to create a datetime-like column and set that as the index, since it seems to want that.
Then I tried both "resample" and "asfreq" on it, e.g. df.resample("Xms").sum() and df.asfreq(freq="Xms"), where X is the number of milliseconds I want to upsample to. Resample filled all the rows in between with zeros, and "interpolate" won't touch those. "asfreq" has my first data row, followed by rows of NaNs, but my subsequent data rows seem to have disappeared! Note the floating point seconds values were not necessarily on clean Xms boundaries. And yet when I interpolate that data, it becomes meaningful again (albeit for some reason it only gave me the first 25k points) I have no idea how...
I note with dismay that all of examples I find for this function relate to data taken over hours, days, weeks, months, years...so I'm beginning to think this isn't the right way to go about it... does anyone have tips to help me understand what I'm seeing / how to proceed? Thanks.

use date as independent variable for mutli-target/multi-variate regression python

I need use multi-variate linear regression for my project where I have two dependent variables: mean_1 and mean_2. The independent variable is date in YYYY-mm-dd format. I have been going through various stackoverflow posts to understand how to use date as a variable in regression. some suggest to convert date to numerical value (https://stackoverflow.com/a/40217971/13713750) while the other option is to convert date to dummy variables.
What I don't understand is how to convert every date in the dataset to a dummy variable and use it as an independent variable. Is it even possible or there are any better ways to use date as independent variable?
Note: I would prefer using date in the date format so it would be easy to plot and analyse the results of regression. Also I am working with pyspark but I can switch to pandas if necessary. so any examples of implementations would be helpful. Thanks!
You could create new columns year, month, day_of_year, day_of_month, day_of_week. You could also add some binary columns like is_weekday, is_holiday. In some cases it is beneficial to add third party data, like daily weather statistics for example (I was working on a case where extra daily weather data proved very useful). Really depends on the domain you're working on. Any of those columns could unveil some pattern behind your data.
As for Dummy variables, converting month and day_of_week to dummies makes sense.
Another option is, to build a model for each month.
If you want to transform a date to numeric (but I don't recommend) you can do this:
pd.to_timedelta(df.date).dt.total_seconds().astype(int)
You can do the same but with the total number of seconds:
pd.to_timedelta(df.date).dt.total_seconds()
Also, you can use a baseline date and subtract that from your date variable and obtain the number of days, this will give you an integer number that makes sense (bigger difference means a date more into the future, while smaller difference shows older dates). This value makes sense for me to use as an independent variable in a model.
First, we create a baseline date (can be whatever you want), and add it to the dataframe to the column static:
df['static'] = pd.to_datetime(datetime.date(2017, 6, 28))
Then we obtain the difference of days of the static date vs your date
df['days'] = (df['static'] - df['date']).dt.days
And there you will have a number ready to be used as an independent variable

Take dates and times from multiple columns to one datetime object with Python

I've got a dataset with multiple time values as below.
Area,Year,Month,Day of Week,Time of Day,Hour of Day
x,2016,1,6.0,108,1.0
z,2016,1,6.0,140,1.0
n,2016,1,6.0,113,1.0
p,2016,1,6.0,150,1.0
r,2016,1,6.0,158,1.0
I have been trying to transform this into a single datetime object to simplify the dataset and be able to do proper time series analysis against it.
For some reason I have been unable to get the right outcome using the datetime library from Python. Would anyone be able to point me in the right direction?
Update - Example of stats here.
https://data.pa.gov/Public-Safety/Crash-Incident-Details-CY-1997-Current-Annual-Coun/dc5b-gebx/data
I don't think there is a week column. Hmm. I wonder if I've missed something?
Any suggestions would be great. Really just looking to simplify this dataset. Maybe even create another table / sheet for the causes of crash, as their's a lot of superfluous columns that are taking up a lot of data, which can be labeled with simple ints.

Categories