How to display aggregated and non-aggregated values at the same time? - python

I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:

You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:

Related

How to check continuity on a datetime index dataframe

I got a datetime indexed dataframe with 1 entry per hour of the year (format is "2019-01-01 00:00:00" for exemple).
I created a program which will plot every weeks, but some of the plots I obtain are weird
I was thinking that it may be a continuity problem in my dataframe, some data that would'nt be indexed at the good place, but I don't know how to check this.
If someone got a clue, it'll help me a lot!
Have a nice day all
Edit : I'll try to provide you some code
First of all I can't provide you the exact data i'm using since it's professionnal, but I'll try to adapt my code to a randomly generate dataframe
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
mpl.rc('figure', max_open_warning = 0)
df = pd.DataFrame({'datetime': pd.date_range('2019-01-01', '2020-12-31',freq='1H', closed='left')})
df['datetime']=pd.to_datetime(df['datetime'])
df['week'] = df['datetime'].dt.isocalendar().week
df['month'] = df['datetime'].dt.month
df=df.set_index(['datetime'])
df=df[['data1','data2','data3','data4','week','month']]
df19=df.loc['2019-01':'2019-12']
df20=df.loc['2020-01':'2020-12']
if not os.path.exists('mypath/Programmes/Semaines/2019'):
os.mkdir('mypath/Programmes/Semaines/2019')
def graph(a): #Creating the function that will generate all the data I need for 2019 and place them in the good folder, skipping the 1st week of the year cause it's buggued
for i in range (2,53):
if not os.path.exists('mypath/Programmes/Semaines/2019/'+str(a)):
os.mkdir('mypath/Programmes/Semaines/2019/'+str(a))
folder='mypath/Programmes/Semaines/2019/'+str(a)
plt.figure(figsize=[20,20])
x=df19[[a]][(df19["week"]==i)]
plt.plot(x)
name=str(a)+"_"+str(i)
plt.savefig(os.path.join(folder,name))
return
for j in df19.columns :
graph(j)
Hoping this can help even if i'm not providing the datas directly :/

Cleaning timeseries data and Standardizing the sample rate in Python

I have a timeseries data and I would like to clean the data by approximating the missing data points and standardizing the sample rate.
Given the fact that there might be some unevenly spaced datapoints, I would like to define a function to get the timeseries and an interval X (e.g., 30 minutes or any other interval) as an input and gives the timeseries with points being spaced within X intervals as an output.
As you can see below, the periods are every 10 minutes but some data points are missing. So the algorithm should detect the missing times and remove them and create the appropriate times and generate the value for them. Then based on the defined function, the sample rate should be changed and standardized.
For approximating missing data and cleaning it, either average or linear interpolation would work.
Here is a part of raw data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Time": ["10:09:00","10:19:00","10:29:00","10:43:00","10:59:00 ", "11:09:00"],
"Value": ["378","378","379","377","376", "377"],
})
df
First of all you need to convert "Time"" into a datetime index. Make pandas recognize the dates as actual dates with df["Time"] = pd.to_datetime(df["Time"]). Then Set time as the index: df = df.set_index("Time").
Once you have the datetime index, you can do all sorts of time-based operations with it. In your case, you want to resample: df.resample('10T')
This leaves us with the following code:
df["Time"] = pd.to_datetime(df["Time"], format="%H:%S:%M")
df = df.set_index("Time")
df.resample('10T')
From here on you have a lot of options on how to treat cases in which you have missing data (fill / interpolate / ...), or in which you have multiple data points for one new one (average / sum / ...). I suggest you take a look at the pandas resampling api. For conversions and formatting between string and datetime refer to strftime.

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

How do I find all rows with a certain date using Pandas?

I have a simple Pandas DataFrame containing columns 'valid_time' and 'value'. The frequency of the sampling is roughly hourly, but irregular and with some large gaps. I want to be able to efficiently pull out all rows for a given day (i.e. within a calender day). How can I do this using DataFrame.where() or something else?
I naively want to do something like this (which obviously doesn't work):
dt = datetime.datetime(<someday>)
rows = data.where( data['valid_time'].year == dt.year and
data['valid_time'].day == dt.day and
data['valid_time'].month == dt.month)
There's at least a few problems with the above code. I am new to pandas so am fumbling with something that is probably straightforward.
Pandas is absolutely terrific for things like this. I would recommend making your datetime field your index as can be seen here. If you give a little bit more information about the structure of your dataframe, I would be happy to include more detailed directions.
Then, you can easily grab all rows from a date using df['1-12-2014'] which would grab everything from Jan 12, 2014. You can edit that to get everything from January by using df[1-2014]. If you want to grab data from a range of dates and/or times, you can do something like:
df['1-2014':'2-2014']
Pandas is pretty powerful, especially for time-indexed data.
Try this (is just like the continuation of your idea):
import pandas as pd
import numpy.random as rd
import datetime
times = pd.date_range('2014/01/01','2014/01/6',freq='H')
values = rd.random_integers(0,10,times.size)
data = pd.DataFrame({'valid_time':times, 'values': values})
dt = datetime.datetime(2014,1,3)
rows = data['valid_time'].apply(
lambda x: x.year == dt.year and x.month==dt.month and x.day== dt.day
)
print data[rows]

Categories