Cleaning timeseries data and Standardizing the sample rate in Python

Cleaning timeseries data and Standardizing the sample rate in Python - python

I have a timeseries data and I would like to clean the data by approximating the missing data points and standardizing the sample rate.
Given the fact that there might be some unevenly spaced datapoints, I would like to define a function to get the timeseries and an interval X (e.g., 30 minutes or any other interval) as an input and gives the timeseries with points being spaced within X intervals as an output.
As you can see below, the periods are every 10 minutes but some data points are missing. So the algorithm should detect the missing times and remove them and create the appropriate times and generate the value for them. Then based on the defined function, the sample rate should be changed and standardized.
For approximating missing data and cleaning it, either average or linear interpolation would work.
Here is a part of raw data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Time": ["10:09:00","10:19:00","10:29:00","10:43:00","10:59:00 ", "11:09:00"],
"Value": ["378","378","379","377","376", "377"],
})
df

First of all you need to convert "Time"" into a datetime index. Make pandas recognize the dates as actual dates with df["Time"] = pd.to_datetime(df["Time"]). Then Set time as the index: df = df.set_index("Time").
Once you have the datetime index, you can do all sorts of time-based operations with it. In your case, you want to resample: df.resample('10T')
This leaves us with the following code:
df["Time"] = pd.to_datetime(df["Time"], format="%H:%S:%M")
df = df.set_index("Time")
df.resample('10T')
From here on you have a lot of options on how to treat cases in which you have missing data (fill / interpolate / ...), or in which you have multiple data points for one new one (average / sum / ...). I suggest you take a look at the pandas resampling api. For conversions and formatting between string and datetime refer to strftime.

Related

Two csv files read differently in python

I have two .csv files with data arranged date-wise. I want to compute the monthly accumulated value for each month and for all the years. While reading the csv files, it reads without any error. However, while computing the monthly accumulated values, for one times series (in one csv file), it is doing it correctly. But, for the other time series, the same code malfunctions. The only difference I notice is, when I read the first csv file (with a 'Date' and 'Value' column, and total no. of rows = 826), the dataframe has 827 rows (last row as nan). This nan thing is not observed for the other csv file.
Please note that my timeseries starts from 06-06-20xx to 01-10-20xx every year from 2008-2014. I am obtaining the monthly accumulated value for each month and then removing the zero values (for months Jan-May and Nov-Dec). When my code runs, for the first csv, I get monthly accumulated values starting from June month of 2008. But, for the second, its starts from January 2008 (and has a non-zero value, which ideally should be zero).
Since I am new in python coding, I cannot figure out where the issue is. Any help is appreciated. Thanks in advance.
Here is my code:
import pandas as pd
import numpy as np
# read the csv file
df_obs = pd.read_csv("..path/first_csv.csv")
df_fore = pd.read_csv("..path/second_csv.csv")
# convert 'Date' column to datetime index
df_obs['Date'] = pd.to_datetime(df_obs['Date'])
df_fore['Date'] = pd.to_datetime(df_fore['Date'])
# perform GroupBy operation over monthly frequency
monthly_accumulated_obs = df_obs.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
monthly_accumulated_fore = df_fore.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()

Sometimes more verbose but explicit solutions work better and are more flexible, so here's an alternative one, using convtools:
from datetime import date, datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
# generate an ad hoc grouper;
# it's a simple function to be reused further
converter = (
c.group_by(c.item("Date"))
.aggregate(
{
"Date": c.item("Date"),
"Observed": c.ReduceFuncs.Sum(c.item("Observed")),
}
)
.gen_converter()
)
# read the stream of prepared rows
rows = (
Table.from_csv("..path/first_csv.csv", header=True)
.update(
Date=c.call_func(
datetime.strptime, c.col("Date"), "%m-%d-%Y"
).call_method("replace", day=1),
Observed=c.col("Observed").as_type(float),
)
.into_iter_rows(dict)
)
# process them
result = converter(rows)

How to display aggregated and non-aggregated values at the same time?

I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:

You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:

Using a sampling of a datetime index to select the features (rows) in a pandas dataframe at those datetimes

I have a datetime index object that consists of the index values of randomly sampled data from a LARGER dataframe on which I am training a learner. I'd like to use the date time index e.g.
DatetimeIndex(['1911-11-18', '2015-05-02', '1934-08-15', '1950-09-16',
'1944-06-01', '2004-07-30', '1947-11-18', '1977-07-08',
'1945-05-31', '1944-01-31',
...
'1884-06-24', '1999-11-22', '1960-02-02', '1883-03-08',
'1952-11-19', '1993-02-04', '1965-04-26', '1885-09-30',
'1890-02-26', '2008-03-28'],
dtype='datetime64[ns]', length=300000, freq=None)
of each training example to go back to the full data frame and look up the target value on those days, AND THEN go 1 year into the future from that date to use as the real target.
The overall context is training on a random sample from time series data, and targeting a value in the future.
My big data frame is called toLearn. And the sample dataframe on which I am training is called dataSlice (a subset of toLearn).

Something like the following works for what I was trying to do.
# Find Target 7 years after Each Training Sample
indicesOfTrainSamples=trainSamples.index
indicesOfTarget=indicesOfTrainSamples + pd.Timedelta(weeks=7*52)
targets=[]
for i in indicesOfTarget:
targets.append(toLearn.loc[i])
targetSlices=pd.DataFrame(targets,index=indicesOfTarget)
targetFeature=targetSlices['targetValues']

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:

Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

Pandas: creating an indexed time series [starting from 100] from returns data

I have data on logarithmic returns of a variable in a Pandas DataFrame. I would like to turn these returns into an indexed time series which starts from 100 (or any arbitrary number). This kind of operation is very common for example when creating an inflation index or when comparing two series of different magnitude:
So the first value in, say, Jan 1st 2000 is set to equal 100 and the next value in Jan 2nd 2000 equals 100 * exp(return_2000_01_02) and so on. Example below:
I know that I can loop through rows in a Pandas DataFrame using .iteritems() as presented in this SO question:
iterating row by row through a pandas dataframe
I also know that I can turn the DataFrame into a numpy array, loop through the values in that array and turn the numpy array back to a Pandas DataFrame. The .as_matrix() method is explained here:
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.html
An even simpler way to do it is to iterate the rows by using the Python and numpy indexing operators [] as documented in Pandas indexing:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
The problem is that all these solutions (except for the iteritems) work "outside" Pandas and are, according to what I have read, inefficient.
Is there a way to create an indexed time series using purely Pandas? And if not, could you, please, suggest the most efficient way to do this. Finding solutions is surprisingly difficult, because index and indexing have a specific meaning in Pandas, which I am not after this time.

You can use a vectorized approach instead of a loop/iteration:
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, 0.01, -0.02, 0.05, 0.07, 0.01, -0.01])})
df['series'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
#In [29]: df
#Out[29]:
# return series
#0 NaN 100.000000
#1 0.01 101.005017
#2 -0.02 99.004983
#3 0.05 104.081077
#4 0.07 111.627807
#5 0.01 112.749685
#6 -0.01 111.627807

#Crebit
I have created a framework to index prices in pandas quickly!
See on my github below for the file:
https://github.com/meinerst/JupyterWorkflow
It shows how you can pull the prices from yahoo finance and or show how you can work with your excisting dataframes.
I cant show the dataframe tables here. If you want to see them, follow the github link.
Indexing financial time series (pandas)
This example uses data pulled from yahoo finance. If you have a dataframe from elsewhere, go to part 2.
Part 1 (Pulling data)
For this, make sure the yfinance package is installed.
#pip install yfinance
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import datetime as dt
Insert the yahoo finance tickers into the variable 'tickers'. You can choose as many as you like.
tickers =['TSLA','AAPL','NFLX','MSFT']
Choose timeframe.
start=dt.datetime(2019,1,1)
end= dt.datetime.now()
In this example, the 'Adj Close' column is selected.
assets=yf.download(tickers,start,end)['Adj Close']
Part 2 (Indexing)
To graph a comparable price development graph the assets data frame needs to be indexed. New columns are added for this purpose.
First the indexing row is determined. In this case the initial prices.
assets_indexrow=assets[:1]
New columns are added to the original dataframe with the indexed price developments.
Insert your desired indexing value below. In this case, it is 100.
for ticker in tickers:
assets[ticker+'_indexed']=(assets[ticker]/ assets_indexrow[ticker][0])*100
The original columns of prices are then dropped
assets.drop(columns =tickers, inplace=True)
Graphing the result.
plt.figure(figsize=(14, 7))
for c in assets.columns.values:
plt.plot(assets.index, assets[c], lw=3, alpha=0.8,label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Value Change')
I cant insert the graph due to limited reputation points but see here:
Indexed Graph

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.