Predicting future values with datetime data using Linear Regression - python

As the title says, I'm trying to predict future values using dates with liniar regression. I'm working for a company who wants to predict months or maybe years ahead what the solar radiation value will be. We have DataFranes (pandas) with the following format:
Date Hour Pressure (hPa) Radiation (KJ/m²) Temperature (C) Humidity (%)
2021-11-05 00:00 911.6 0.10 14.9 96.0
2021-11-05 01:00 911.0 0.60 14.3 93.0
2021-11-05 02:00 913.9 0.50 13.6 92.0
2021-11-05 03:00 913.5 1.00 12.9 91.0
2021-11-05 04:00 913.2 2.90 12.6 92.0
As those data are collected from real stations across the country, they are susceptible to errors. We identify those errors, like blank or inconsistent data and remove them, so there will be gaps in time on the dataframe. They could be just a few hours or months with no data at all. Another thing to think about is if we keep the night time data. Radiation on those will be basically zero, but we can make use of the other columns as well. My teammate, whose is doing the training, said the model performed better without the night data.
We're thiking about the best method of doing this. We had a column named Rain, but a member of our team said it would be better if we droped it because Humidity and Pressure are already correlated to it.
Now, liniar regression can't work with dates, so what can we do?
We can discard the Hour column and convert Date to ordinal.
We can combine Date and Hour to a new column, Time, consisting of a number of seconds since the epoch.
What do you suggest?
Thank you!

Related

pandas: how to avoid iterating through rows when you need to verify the data sequentially?

I have a dataframe which is composed of a timestamp and two variables:
A pressure measurement which varies sequentially, representing a specific process batch (in red);
An lab analysis, which represents a measurement that represents each batch. The analysis always occurs at the end of the batch and remains a constant value until a new analysis is made. Caution: not every batch is analyzed and I don't have a flag indicating when the batch started.
I need to create a dataframe which calculates, for each batch, the average, maximum and minimum temperature, and how long it took from start to end (timedelta).
I had an idea to loop through all analysis values from the end to the start, and every time I find a new analysis value OR the pressure dropped below a certain value (since this is a characteristic of the process, all batches starts with low pressure), I'd consider as the batch start (to calculate the timedelta and to define the interval I would consider for taking the pressure min, max, and average).
However, I know it is not effective to loop through all dataframe rows (especially with 1 million rows) so, any ideas?
Dataset sample: https://cl1p.net/5sg45sf5 or https://wetransfer.com/downloads/321bc7dc2a02c6f713963518fdd9271b20201115195604/08c169
Edit: there is no clear/ready indication of when a batch starts in the current data (as someone asked), but you can identify a batch by the following characteristics:
Every batch starts with pressure below 30 and going up fastly (in less than one hour) up to 61.
Then it stabilizes around 65 (the plateau value can be something between 61 and 70) and stays there for at least 2 and a half hours.
It ends with a pressure drop (faster than one hour) to a value smaller than 30.
The cycle repeats.
OBS: you can have smaller/shorter peaks between two valid batches, but it shall not be considered as a batch.
Thanks!
This solution assumes that the batches change when the value of lab analysis changes.
First, I'll plot those changes, so we can get an idea of how frequently it does:
df['lab analysis'].plot()
There are not many changes, so we just need to identify these:
df_shift = df.loc[(df['lab analysis'].diff()!=0) & (df['lab analysis'].diff().isna() == False)]
df_shift
time pressure lab analysis
2632 2020-09-15 19:52:00 356.155 59.7
3031 2020-09-16 02:31:00 423.267 59.4
3391 2020-09-16 08:31:00 496.583 59.3
4136 2020-09-16 20:56:00 625.494 59.4
4971 2020-09-17 10:51:00 469.114 59.2
5326 2020-09-17 16:46:00 546.989 58.9
5677 2020-09-17 22:37:00 53.730 59.0
6051 2020-09-18 04:51:00 573.789 59.2
6431 2020-09-18 11:11:00 547.015 58.7
8413 2020-09-19 20:13:00 27.852 58.5
10851 2020-09-21 12:51:00 570.747 58.9
15816 2020-09-24 23:36:00 553.846 58.7
Now we can run a loop from these few changes, categorize each batch, and then run the descriptive statistics:
index_shift = df_shift.index
i = 0
batch = 1
for shift in index_shift:
df.loc[i:shift, 'batch number'] = batch
batch = batch + 1
i = shift
stats = df.groupby('batch number')['pressure'].describe()[['mean','min','max']]
And compute the time difference and insert on stats as well:
df_shift.loc[0] = df.iloc[0,:3]
df_shift.sort_index(inplace = True)
time_difference = [*df_shift['time'].diff()][1:]
stats['duration'] = time_difference
stats
mean min max duration
batch number
1.0 518.116150 24.995 671.315 1 days 19:52:00
2.0 508.353105 27.075 670.874 0 days 06:39:00
3.0 508.562450 26.715 671.156 0 days 06:00:00
4.0 486.795097 25.442 672.548 0 days 12:25:00
5.0 491.437620 24.234 671.611 0 days 13:55:00
6.0 515.473651 29.236 671.355 0 days 05:55:00
7.0 509.180860 25.566 670.714 0 days 05:51:00
8.0 490.876639 25.397 671.134 0 days 06:14:00
9.0 498.757555 24.973 670.445 0 days 06:20:00
10.0 497.000796 25.561 670.667 1 days 09:02:00
11.0 517.255608 26.107 669.476 1 days 16:38:00
12.0 404.859498 20.594 672.566 3 days 10:45:00

How to search missing values from timestamp which are not in regular interval

I have a dataset like this with data every 10 second interval.
rec NO2_RAW NO2
0 2019-05-31 13:42:15 0.01 9.13
1 2019-05-31 13:42:25 17.0 51.64
2 2019-05-31 13:42:35 48.4 111.69
The timestamp is not consistent throughout the table. There are instances where after a long gap, the timestamp has started from a new time. Like after 2019-05-31 16:00:00, it started from 2019-06-01 00:00:08.
I want to fill up the missing value by calculating the time difference between two consecutive rows (10s) and assign NAN values to the missing time.
I saw this example Search Missing Timestamp and display in python? but it is meant for consistent data. I want to calculate moving average of 15 minutes from this data. So I want a consistent data.
Can someone please help?

How to find the area below and above a point for a given time period using scipy?

I have a pandas dataset with date being the index and a number in the value column. There is one year's worth of data.
How can I find find the area (integral) below and above each date's value for the next two months using scipy.integrate?
E.g. If 2009-01-01 has 5 as the value, I am trying to find the integral below and above 5 for the next two months, depending on the points for the next two months.
EDIT: I guess I don't know what to use as the function since the function is unknown and I only have points to use to integrate. I am thinking I may have to integrate for each day and sum up for the two months?
Below is a sample of my dataset:
DATE Y
2008-01-01 4
2008-01-02 10.4
2008-01-03 2
2008-01-04 9
2008-01-05 4.3
2008-01-06 7
2008-01-07 8.2
2008-01-08 5
2008-01-09 6.5
2008-01-10 2.3
...
2008-02-28 6.6
2008-03-01 7
2008-03-02 5.4
My objective is to start from 2008-01-01 with a value of 4 and use that as the reference point and find the integral below and above 4 (i.e. 4 to each day's y value) for the next two months. So it will not be a rolling integral but a looking-forward one.

How to remove NaN values from corr() function output

EDITED TO SHOW EXAMPLE OF ORIGINAL DATAFRAME:
df.head(4)
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
I've successfully generated the following dataframe using get_dummies():
wedding_seasons = pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
wedding_seasons.head(3)
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
The goal of the above is to help assess frequency of weddings across seasons, so I've used corr() to generate the following result:
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
I'm unsure why the wedding column is generating NaN values, but my gut feeling is that it originates from how I originally created wedding_seasons. Any guidance would be greatly appreciated so that I can properly assess column correlations.
I don't think what you're interested in seeing here is the "correlation".
All of the columns in the dataframe wedding_seasons contain floating point values; however, if my suspicions are correct, the rows in your original dataframe df contain something like transaction records, where each row corresponds to an individual.
Please tell me if I'm incorrect, but I'll proceed with my reasoning.
Correlation will measure, intuitively, the tendency for values vary together/against each other within the same observation (e.g. if X and Y are negatively correlated, then when we see X go above its mean, we'd expect Y to appear below its mean).
However, what you have here is data where, if one transaction is summer, then categorically it cannot possibly be winter at the same time. When you create wedding_seasons, Pandas is creating dummy variables that are treated as floating point values when computing your correlation matrix; since it's impossible for any row to contain two 1.0 entries at the same time, clearly your resulting correlation matrix is going to have negative entries everywhere.
You could drop the weddings column before doing corr().
wedding_seasons.drop(columns = ['weddings'])

Replace NaN or missing values with rolling mean or other interpolation

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using
pd.rolling_mean(data["variable"]), 12, center=True)
but it just gives me all NaN values.
Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.
The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.
There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.
df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
'temp' : [65,50,45,np.nan,40,43] }).set_index('month')
You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in #user394430's answer.)
df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3'] = df['temp'].rolling( 3,center=True,min_periods=1).mean()
Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).
df['update'] = df['rollmean3']
df['update'].update( df['temp'] ) # note: this is an inplace operation
There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.
df['ffill'] = df['temp'].ffill() # previous month
df['bfill'] = df['temp'].bfill() # next month
df['interp'] = df['temp'].interpolate() # mean of prev/next
In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question:
Interpolation on DataFrame in pandas
Here is the sample data with all the results:
temp rollmean12 rollmean3 update ffill bfill interp
month
10 65.0 48.6 57.500000 65.0 65.0 65.0 65.0
11 50.0 48.6 53.333333 50.0 50.0 50.0 50.0
12 45.0 48.6 47.500000 45.0 45.0 45.0 45.0
1 NaN 48.6 42.500000 42.5 45.0 40.0 42.5
2 40.0 48.6 41.500000 40.0 40.0 40.0 40.0
3 43.0 48.6 41.500000 43.0 43.0 43.0 43.0
In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.
The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be
data["variable"].rolling(min_periods=1, center=True, window=12).mean().

Categories