Pandas average per timeslot over different days, then sum within hour - python

I have a question about pandas and working with time series.
I get my data as a json from an API, the data includes traffic counts on different locations, measured every 5 minutes. The simplified data looks like this:
[
{
"name": "123", // location id
"date": "2020-01-01T00:45:00Z", // date and time
"intensity": 7 // number of vehicles counted
},
...
]
There is a month's worth of data, read using pandas and concatenated into one big dataframe:
# in loop:
dfs = []
df = pd.read_json(de.path)
df['date'] = pd.to_datetime(df['date'])
dfs.append(df)
data = pd.concat(dfs)
I average the counts for equivalent time slots over different days:
data = data.set_index('date')
data = data.groupby(df.index.time).aggregate("mean")
The final step is where I have a problem. I have tried using the pandas.resample function, but that requires a TimeDateIndex, which is lost in the previous step - if I print the index out, I get this:
print(data.index)
>>> Index([00:00:00, 00:05:00, 00:10:00, 00:15:00, 00:20:00, 00:25:00, 00:30:00,
00:35:00, 00:40:00, 00:45:00,
...
23:10:00, 23:15:00, 23:20:00, 23:25:00, 23:30:00, 23:35:00, 23:40:00,
23:45:00, 23:50:00, 23:55:00],
dtype='object', length=288)
I tried converting the index to PeriodIndex, but failed.
Is there a common way of doing this? I feel I have probably missed something simple.

I solved the problem by using the groupby function by like this:
for i, g in df.groupby(np.arange(len(df)) // 12):
counts_by_name[name].append((i+1, sum(g['intensity'])))
Does anyone know of a nicer solution that would allow actual resampling?

Related

assigning list of strings as name for dataframe

I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.

Faster way of iterate through xarray and dataframe

Im new to python and dont know all the aspects.
I want to loop through a dataframe (2D) and assign some of those values to an xarray (3D).
The coordinates of my xarray are company ticker symbols (1), financial variables (2) and daily dates (3).
The columns of the dataframe for each company are some of the same financial variables as in the xarray and the index is made up of quarterly dates.
My goal is to take an already generated dataframe for each company and look for a value that is in the column of a certain variable and the row of a certain date and assign it to its corresponding spot in the xarray.
Since some dates are not going to be in the index of the dataframe (only has 4 dates per calendar year), I want to assign either a 0 to that spot on the xarray or the value from the previous date on the xarray if that value is also not 0.
I have tried to do it using nested for loops, but it takes around 20 seconds to go through all the dates in just one variable.
My date list if made up of around 8000 dates, the variable list has around 30 variables and the company list is around 800 companies.
If I were to loop around all of that it would take me several days to complete the nested for loops.
Is there a faster way to assign these values to the xarray? My guess is something similar to iterrows() or iteritems() but in xarray.
Heres is a sample code of my program with shorter lists for the companies and variables:
import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time
start_time = time.time()
# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df.
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)
# Coordinates for the xarray:
companies = ['AAPL'] # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity'] # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()
# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
dims=['companies', 'variables', 'dates'])
# We assign values from the df to the ds
for company in companies:
for variable in variables:
first_value_found = False
for date in dates:
# Dates in the df are quarterly dates and dates in the ds are daily dates.
# We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
# If we do find it, we assign it the value found in the df and tell it that the first value has been found
# Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
if first_value_found == False:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
first_value_found = True
except:
ds.loc[company, variable, date] = 0
else:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
except:
ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]
print("My program took", time.time() - start_time, "to run")
The main problem is with the for loops, as I have tested these loops on separate files and these seem to be what takes the most time.
One possible strategy is to loop over the actual index of the DataFrame, rather than all possible indices
avail_dates = df.index
for date in avail_dates:
# Copy the data
That should already reduce the number of iterations by quite a bit. You still have to make sure all the blanks are filled, so you'd do something like
da.loc[company, variables, date:] = df.loc[date, variables]
That's right, you can index into DataArray and DataFrame with lists. (Also I wouldn't use ds as a variable name for something from xarray other than a DataSet)
What you probably want to use, though, is pandas.DataFrame.reindex().
If I understand what you're trying to do, this should more or less do the trick (not tested)
complete_df = df.reindex(dates, method='pad', fill_value=0)
da.loc[company, variables, :] = complete_df.loc[:, variables].T

How to display aggregated and non-aggregated values at the same time?

I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:
You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:

python resample/group by OHLC data

I have hourly OHLC data that I am trying to regroup to see only from 9pm to 5am in one row and than for every day like that.
I've tried several ways suggested here, but without success.
index_21_09 = eur.index.indexer_between_time('21:00','05:00')
df = eur.iloc[index_21_09]
With this I filter data from 21 - 05, but in several rows, I need them in one row.
then, I tried this:
df_day_max = df.groupby(pd.Grouper(freq='8h')).max()
df_day_min = df.groupby(pd.Grouper(freq = '8h')).min()
df_group = (pd.concat([df_day_max['High'], df_day_min['Low ']], axis=1).dropna())
But, I get data from 16.00 - 00.00! How, if I previously filter them from 21-05?
df_resample = df.resample('8H').ohlc()
With this, I get same result, only with NaN values.
Any help with this? Thanks.

numpy.where does not work properly with pandas dataframe

I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff.
I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.
Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.
I am out of ideas like where exactly my fault is!
# Holding very first time in the log data and stripping off
# second, minutes and microseconds.
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits
parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
(df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
how='any') for i in range(0, rngCounter)]
If I check for first log entry inside my parts data, it is something like below:
00:00:00.
00:43:23.
01:12:59.
01:53:55.
02:23:52.
....
Where as I expect the output to be like below:
00:00:00
01:00:01
02:00:00
03:00:00
04:00:01
....
Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.
Can Someone please figure out what exactly wrong with this approach?
Note: I am using python notebook with pandas, numpy.
Thanks to #raganjosh.
I got my solution to the problem by using pandas Grouper.
Below is my implementation.
I have used dynamic value for 'hr'.
timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])

Categories