How to check continuity on a datetime index dataframe

How to check continuity on a datetime index dataframe - python

I got a datetime indexed dataframe with 1 entry per hour of the year (format is "2019-01-01 00:00:00" for exemple).
I created a program which will plot every weeks, but some of the plots I obtain are weird
I was thinking that it may be a continuity problem in my dataframe, some data that would'nt be indexed at the good place, but I don't know how to check this.
If someone got a clue, it'll help me a lot!
Have a nice day all
Edit : I'll try to provide you some code
First of all I can't provide you the exact data i'm using since it's professionnal, but I'll try to adapt my code to a randomly generate dataframe
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
mpl.rc('figure', max_open_warning = 0)
df = pd.DataFrame({'datetime': pd.date_range('2019-01-01', '2020-12-31',freq='1H', closed='left')})
df['datetime']=pd.to_datetime(df['datetime'])
df['week'] = df['datetime'].dt.isocalendar().week
df['month'] = df['datetime'].dt.month
df=df.set_index(['datetime'])
df=df[['data1','data2','data3','data4','week','month']]
df19=df.loc['2019-01':'2019-12']
df20=df.loc['2020-01':'2020-12']
if not os.path.exists('mypath/Programmes/Semaines/2019'):
os.mkdir('mypath/Programmes/Semaines/2019')
def graph(a): #Creating the function that will generate all the data I need for 2019 and place them in the good folder, skipping the 1st week of the year cause it's buggued
for i in range (2,53):
if not os.path.exists('mypath/Programmes/Semaines/2019/'+str(a)):
os.mkdir('mypath/Programmes/Semaines/2019/'+str(a))
folder='mypath/Programmes/Semaines/2019/'+str(a)
plt.figure(figsize=[20,20])
x=df19[[a]][(df19["week"]==i)]
plt.plot(x)
name=str(a)+"_"+str(i)
plt.savefig(os.path.join(folder,name))
return
for j in df19.columns :
graph(j)
Hoping this can help even if i'm not providing the datas directly :/

Related

Why does my bar plot in python cut off part of the x variable when plotting?

I am using pycharm to create plots of data, and I am following along a kaggle tutorial of seaborn. The bar plot plots flight delays throughout 12 months, and on the tutorial it shows 1-12 on the x axis, but when I try to execute this in my code in python it shows only up to 11.
I am very new to python, and coding in general and trying to self teach, but I'm having a lot of problems navigating pycharm and solving this issue.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
print("Setup Complete")
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv')
print(flight_delays)
plt.figure(figsize=(10,6))
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
plt.ylabel("Arrival delay (In minutes)")
plt.title("Average Arrival Delay for Spirit Airline Flights, by Month")
plt.show()
I have tried using xlim to show all 12 x values, but that does not work for me or I dont understand how to use the command.
https://www.kaggle.com/code/alexisbcook/bar-charts-and-heatmaps/tutorial
here is the link to the tutorial I am following as well.
Thank you

I assume you use https://www.kaggle.com/datasets/alexisbcook/data-for-datavis?select=flight_delays.csv ?
When you read a csv with pandas by default the index will be numbers starting from 0. So if there are 12 months (rows) the index is 0 to 11. Unlike the Month column which contains the numbers from 1 to 12.
You can either relace the x argument with the Month column instead of the index:
sns.barplot(x=flight_delays['Month'], y=flight_delays['NK'])
Or you first set the Month column as index and then use the same command you did before:
flight_delays.set_index('Month', inplace=True)
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
You could also just explicitly set which column should be used as index column when you use read_csv, then you do not have to make any other changes to your code:
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv', index_col='Month')

How to display aggregated and non-aggregated values at the same time?

I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:

You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:

panda dataframe count by month but break down by value

my db has the following format:
database
I'm trying to get a count of each region grouped by month and i've tried different methods but still can't figure it out. I keep getting a total count for each month but can't get it to break down by region by month
the last method I tried was set_index:
import pandas as pd
import numpy as np
import datetime
%matplotlib inline
ax['date'] = pd.to_datetime(ax['date'])
ax = report[['date','region']].copy()
ax['month'] = pd.DatetimeIndex(ax['date']).month
ax = ax.set_index('date').groupby(pd.Grouper(freq='M')).count()
ax
but this one returns a total count of reoccurences for each region.
output
any ideas how to solve it?

Pandas: creating an indexed time series [starting from 100] from returns data

I have data on logarithmic returns of a variable in a Pandas DataFrame. I would like to turn these returns into an indexed time series which starts from 100 (or any arbitrary number). This kind of operation is very common for example when creating an inflation index or when comparing two series of different magnitude:
So the first value in, say, Jan 1st 2000 is set to equal 100 and the next value in Jan 2nd 2000 equals 100 * exp(return_2000_01_02) and so on. Example below:
I know that I can loop through rows in a Pandas DataFrame using .iteritems() as presented in this SO question:
iterating row by row through a pandas dataframe
I also know that I can turn the DataFrame into a numpy array, loop through the values in that array and turn the numpy array back to a Pandas DataFrame. The .as_matrix() method is explained here:
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.html
An even simpler way to do it is to iterate the rows by using the Python and numpy indexing operators [] as documented in Pandas indexing:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
The problem is that all these solutions (except for the iteritems) work "outside" Pandas and are, according to what I have read, inefficient.
Is there a way to create an indexed time series using purely Pandas? And if not, could you, please, suggest the most efficient way to do this. Finding solutions is surprisingly difficult, because index and indexing have a specific meaning in Pandas, which I am not after this time.

You can use a vectorized approach instead of a loop/iteration:
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, 0.01, -0.02, 0.05, 0.07, 0.01, -0.01])})
df['series'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
#In [29]: df
#Out[29]:
# return series
#0 NaN 100.000000
#1 0.01 101.005017
#2 -0.02 99.004983
#3 0.05 104.081077
#4 0.07 111.627807
#5 0.01 112.749685
#6 -0.01 111.627807

#Crebit
I have created a framework to index prices in pandas quickly!
See on my github below for the file:
https://github.com/meinerst/JupyterWorkflow
It shows how you can pull the prices from yahoo finance and or show how you can work with your excisting dataframes.
I cant show the dataframe tables here. If you want to see them, follow the github link.
Indexing financial time series (pandas)
This example uses data pulled from yahoo finance. If you have a dataframe from elsewhere, go to part 2.
Part 1 (Pulling data)
For this, make sure the yfinance package is installed.
#pip install yfinance
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import datetime as dt
Insert the yahoo finance tickers into the variable 'tickers'. You can choose as many as you like.
tickers =['TSLA','AAPL','NFLX','MSFT']
Choose timeframe.
start=dt.datetime(2019,1,1)
end= dt.datetime.now()
In this example, the 'Adj Close' column is selected.
assets=yf.download(tickers,start,end)['Adj Close']
Part 2 (Indexing)
To graph a comparable price development graph the assets data frame needs to be indexed. New columns are added for this purpose.
First the indexing row is determined. In this case the initial prices.
assets_indexrow=assets[:1]
New columns are added to the original dataframe with the indexed price developments.
Insert your desired indexing value below. In this case, it is 100.
for ticker in tickers:
assets[ticker+'_indexed']=(assets[ticker]/ assets_indexrow[ticker][0])*100
The original columns of prices are then dropped
assets.drop(columns =tickers, inplace=True)
Graphing the result.
plt.figure(figsize=(14, 7))
for c in assets.columns.values:
plt.plot(assets.index, assets[c], lw=3, alpha=0.8,label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Value Change')
I cant insert the graph due to limited reputation points but see here:
Indexed Graph

How do I find all rows with a certain date using Pandas?

I have a simple Pandas DataFrame containing columns 'valid_time' and 'value'. The frequency of the sampling is roughly hourly, but irregular and with some large gaps. I want to be able to efficiently pull out all rows for a given day (i.e. within a calender day). How can I do this using DataFrame.where() or something else?
I naively want to do something like this (which obviously doesn't work):
dt = datetime.datetime(<someday>)
rows = data.where( data['valid_time'].year == dt.year and
data['valid_time'].day == dt.day and
data['valid_time'].month == dt.month)
There's at least a few problems with the above code. I am new to pandas so am fumbling with something that is probably straightforward.

Pandas is absolutely terrific for things like this. I would recommend making your datetime field your index as can be seen here. If you give a little bit more information about the structure of your dataframe, I would be happy to include more detailed directions.
Then, you can easily grab all rows from a date using df['1-12-2014'] which would grab everything from Jan 12, 2014. You can edit that to get everything from January by using df[1-2014]. If you want to grab data from a range of dates and/or times, you can do something like:
df['1-2014':'2-2014']
Pandas is pretty powerful, especially for time-indexed data.

Try this (is just like the continuation of your idea):
import pandas as pd
import numpy.random as rd
import datetime
times = pd.date_range('2014/01/01','2014/01/6',freq='H')
values = rd.random_integers(0,10,times.size)
data = pd.DataFrame({'valid_time':times, 'values': values})
dt = datetime.datetime(2014,1,3)
rows = data['valid_time'].apply(
lambda x: x.year == dt.year and x.month==dt.month and x.day== dt.day
)
print data[rows]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.