Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')
Related
I have two .csv files with data arranged date-wise. I want to compute the monthly accumulated value for each month and for all the years. While reading the csv files, it reads without any error. However, while computing the monthly accumulated values, for one times series (in one csv file), it is doing it correctly. But, for the other time series, the same code malfunctions. The only difference I notice is, when I read the first csv file (with a 'Date' and 'Value' column, and total no. of rows = 826), the dataframe has 827 rows (last row as nan). This nan thing is not observed for the other csv file.
Please note that my timeseries starts from 06-06-20xx to 01-10-20xx every year from 2008-2014. I am obtaining the monthly accumulated value for each month and then removing the zero values (for months Jan-May and Nov-Dec). When my code runs, for the first csv, I get monthly accumulated values starting from June month of 2008. But, for the second, its starts from January 2008 (and has a non-zero value, which ideally should be zero).
Since I am new in python coding, I cannot figure out where the issue is. Any help is appreciated. Thanks in advance.
Here is my code:
import pandas as pd
import numpy as np
# read the csv file
df_obs = pd.read_csv("..path/first_csv.csv")
df_fore = pd.read_csv("..path/second_csv.csv")
# convert 'Date' column to datetime index
df_obs['Date'] = pd.to_datetime(df_obs['Date'])
df_fore['Date'] = pd.to_datetime(df_fore['Date'])
# perform GroupBy operation over monthly frequency
monthly_accumulated_obs = df_obs.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
monthly_accumulated_fore = df_fore.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
Sometimes more verbose but explicit solutions work better and are more flexible, so here's an alternative one, using convtools:
from datetime import date, datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
# generate an ad hoc grouper;
# it's a simple function to be reused further
converter = (
c.group_by(c.item("Date"))
.aggregate(
{
"Date": c.item("Date"),
"Observed": c.ReduceFuncs.Sum(c.item("Observed")),
}
)
.gen_converter()
)
# read the stream of prepared rows
rows = (
Table.from_csv("..path/first_csv.csv", header=True)
.update(
Date=c.call_func(
datetime.strptime, c.col("Date"), "%m-%d-%Y"
).call_method("replace", day=1),
Observed=c.col("Observed").as_type(float),
)
.into_iter_rows(dict)
)
# process them
result = converter(rows)
As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael
use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")
I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:
You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:
I have data on logarithmic returns of a variable in a Pandas DataFrame. I would like to turn these returns into an indexed time series which starts from 100 (or any arbitrary number). This kind of operation is very common for example when creating an inflation index or when comparing two series of different magnitude:
So the first value in, say, Jan 1st 2000 is set to equal 100 and the next value in Jan 2nd 2000 equals 100 * exp(return_2000_01_02) and so on. Example below:
I know that I can loop through rows in a Pandas DataFrame using .iteritems() as presented in this SO question:
iterating row by row through a pandas dataframe
I also know that I can turn the DataFrame into a numpy array, loop through the values in that array and turn the numpy array back to a Pandas DataFrame. The .as_matrix() method is explained here:
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.html
An even simpler way to do it is to iterate the rows by using the Python and numpy indexing operators [] as documented in Pandas indexing:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
The problem is that all these solutions (except for the iteritems) work "outside" Pandas and are, according to what I have read, inefficient.
Is there a way to create an indexed time series using purely Pandas? And if not, could you, please, suggest the most efficient way to do this. Finding solutions is surprisingly difficult, because index and indexing have a specific meaning in Pandas, which I am not after this time.
You can use a vectorized approach instead of a loop/iteration:
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, 0.01, -0.02, 0.05, 0.07, 0.01, -0.01])})
df['series'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
#In [29]: df
#Out[29]:
# return series
#0 NaN 100.000000
#1 0.01 101.005017
#2 -0.02 99.004983
#3 0.05 104.081077
#4 0.07 111.627807
#5 0.01 112.749685
#6 -0.01 111.627807
#Crebit
I have created a framework to index prices in pandas quickly!
See on my github below for the file:
https://github.com/meinerst/JupyterWorkflow
It shows how you can pull the prices from yahoo finance and or show how you can work with your excisting dataframes.
I cant show the dataframe tables here. If you want to see them, follow the github link.
Indexing financial time series (pandas)
This example uses data pulled from yahoo finance. If you have a dataframe from elsewhere, go to part 2.
Part 1 (Pulling data)
For this, make sure the yfinance package is installed.
#pip install yfinance
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import datetime as dt
Insert the yahoo finance tickers into the variable 'tickers'. You can choose as many as you like.
tickers =['TSLA','AAPL','NFLX','MSFT']
Choose timeframe.
start=dt.datetime(2019,1,1)
end= dt.datetime.now()
In this example, the 'Adj Close' column is selected.
assets=yf.download(tickers,start,end)['Adj Close']
Part 2 (Indexing)
To graph a comparable price development graph the assets data frame needs to be indexed. New columns are added for this purpose.
First the indexing row is determined. In this case the initial prices.
assets_indexrow=assets[:1]
New columns are added to the original dataframe with the indexed price developments.
Insert your desired indexing value below. In this case, it is 100.
for ticker in tickers:
assets[ticker+'_indexed']=(assets[ticker]/ assets_indexrow[ticker][0])*100
The original columns of prices are then dropped
assets.drop(columns =tickers, inplace=True)
Graphing the result.
plt.figure(figsize=(14, 7))
for c in assets.columns.values:
plt.plot(assets.index, assets[c], lw=3, alpha=0.8,label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Value Change')
I cant insert the graph due to limited reputation points but see here:
Indexed Graph
I want to create a time series of the valuation of a stock portfolio, by aggregating time series valuation data on the individual stock holdings of that portfolio. The problem I have is that on certain dates there may not be a valuation for a given stock holding and thus aggregating on that date would produce erroneous results.
The solution I have come up with is to exclude dates for which valuation (actually price) data doesn't exist for a given holding and then aggregate on these dates where I have complete data. The procedure I use is as follows:
# Get the individual holding valuation data
valuation = get_valuation(portfolio = portfolio, df = True)
# Then next few lines retrieve the dates for which I have complete price data for the
# assets that comprise this portflio
# First get a list of the assets that this portfolio contains (or has contained).
unique_assets = valuation['asset'].unique().tolist()
# Then I get the price data for these assets
ats = get_ats(assets = unique_assets, df = True )[['data_date','close_price']]
# I mark those dates for which I have a 'close_price' for each asset:
ats = ats.groupby('data_date')['close_price'].agg({'data_complete':lambda x: len(x) == len(unique_assets)} ).reset_index()
# And extract the corresponding valid dates.
valid_dates = ats['data_date'][ats['data_complete']]
# Filter the valuation data for those dates for which I have complete data:
valuation = valuation[valuation['data_date'].apply(lambda x: x in valid_dates.values)]
# Group by date, and sum the individual hodling valuations by date, to get the Portfolio valuation
portfolio_valuation = valuation[['data_date','valuation']].groupby('data_date').agg(lambda df: sum(df['valuation'])).reset_index()
My question is two fold:
1) The above approach feels quite convoluted, and I am confident that Pandas has a better way of implementing my solution. Any suggestions?
2) The approach I have used isn't ideal. The best method is that for those dates for which we have no valuation data (for a given holding) we should use the most recent valuation for that holding. So let's say I am calculating the valuation of the portfolio on the 21 June 2012 and have valuation data for GOOG on that date but for APPL only on the 20 June 2012. Then the valuation for the portfolio on 21 June 2012 should still be the sum of these two valuations. Is there a efficient way to do this in Pandas? I want to avoid having to iterate through the data.
It seems like some combination of resample and/or fillna is going to get you what you're looking for (realize this is coming a little late!).
Go grab your data just like you're doing. You get back this things with a few gaps. Check out this:
import pandas as pd
import numpy as np
dates = pd.DatetimeIndex(start='2012-01-01', periods=10, freq='2D')
df = pd.DataFrame(np.random.randn(20).reshape(10,2),index=dates)
So now you have this data that has lots of gaps in it -- but you want to have this daily resolution data.
Just do:
df.resample('1D')
This will fill in your dataframe with a bunch of NaNs where you have missing data. And then when you do aggregations on them, just use functions (e.g., np.nansum, np.mean) that ignore NaNs!
Still a little unclear on the exact format of the data you've got. Hope it helps.