Faster way of iterate through xarray and dataframe - python

Im new to python and dont know all the aspects.
I want to loop through a dataframe (2D) and assign some of those values to an xarray (3D).
The coordinates of my xarray are company ticker symbols (1), financial variables (2) and daily dates (3).
The columns of the dataframe for each company are some of the same financial variables as in the xarray and the index is made up of quarterly dates.
My goal is to take an already generated dataframe for each company and look for a value that is in the column of a certain variable and the row of a certain date and assign it to its corresponding spot in the xarray.
Since some dates are not going to be in the index of the dataframe (only has 4 dates per calendar year), I want to assign either a 0 to that spot on the xarray or the value from the previous date on the xarray if that value is also not 0.
I have tried to do it using nested for loops, but it takes around 20 seconds to go through all the dates in just one variable.
My date list if made up of around 8000 dates, the variable list has around 30 variables and the company list is around 800 companies.
If I were to loop around all of that it would take me several days to complete the nested for loops.
Is there a faster way to assign these values to the xarray? My guess is something similar to iterrows() or iteritems() but in xarray.
Heres is a sample code of my program with shorter lists for the companies and variables:
import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time
start_time = time.time()
# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df.
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)
# Coordinates for the xarray:
companies = ['AAPL'] # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity'] # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()
# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
dims=['companies', 'variables', 'dates'])
# We assign values from the df to the ds
for company in companies:
for variable in variables:
first_value_found = False
for date in dates:
# Dates in the df are quarterly dates and dates in the ds are daily dates.
# We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
# If we do find it, we assign it the value found in the df and tell it that the first value has been found
# Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
if first_value_found == False:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
first_value_found = True
except:
ds.loc[company, variable, date] = 0
else:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
except:
ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]
print("My program took", time.time() - start_time, "to run")
The main problem is with the for loops, as I have tested these loops on separate files and these seem to be what takes the most time.

One possible strategy is to loop over the actual index of the DataFrame, rather than all possible indices
avail_dates = df.index
for date in avail_dates:
# Copy the data
That should already reduce the number of iterations by quite a bit. You still have to make sure all the blanks are filled, so you'd do something like
da.loc[company, variables, date:] = df.loc[date, variables]
That's right, you can index into DataArray and DataFrame with lists. (Also I wouldn't use ds as a variable name for something from xarray other than a DataSet)
What you probably want to use, though, is pandas.DataFrame.reindex().
If I understand what you're trying to do, this should more or less do the trick (not tested)
complete_df = df.reindex(dates, method='pad', fill_value=0)
da.loc[company, variables, :] = complete_df.loc[:, variables].T

Related

assigning list of strings as name for dataframe

I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.

Populating new Dataframe in Python taking too long, need to remove explicit recursive loops to improve performance

I am building a Python code to analyze the growth of COVID-19 across different nations, I am using the OWID database to get the latest values each time the code is run:
data = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")
data.to_csv('covid.csv')
data
OWID not just provides the CSV file but also XLSX and JSON formats, JSON has a 3D structure even, might that help with the efficiency?
I am trying to create a new Dataframe with the country name as the column headings and date range containing all the listed dates as the index:
data['date'] = pd.to_datetime(data['date'])
buffer = 30
cases = pd.DataFrame(columns=data['location'].drop_duplicates(),
index=pd.date_range(start= data['date'].min() - datetime.timedelta(buffer), end=data['date'].max()))
deaths = pd.DataFrame(columns=data['location'].drop_duplicates(),
index=pd.date_range(start= data['date'].min() - datetime.timedelta(buffer), end=data['date'].max()))
I am doing differentials on the values so I need to make sure each consecutive element is at equal time-steps (1 day).
The database does not have all the dates within the daterage for most countries, many of them even have data missing for dates in the middle of the range. All I could think of was using recursive loops to populate the new dataframe:
location = data['location'].drop_duplicates()
date_range = pd.date_range(data['date'].min(), data['date'].max())
for l, t in itertools.product(location, date_range):
c = data.loc[(data['location'] == l) & (data['date'] == t), 'total_cases']
d = data.loc[(data['location'] == l) & (data['date'] == t), 'total_deaths']
if c.size != 0:
cases[l][t] = c.iloc[0]
if d.size != 0:
deaths[l][t] = d.iloc[0]
This gets the job done but it takes more than 20 Min to complete on my fairly good PC. I know there is some way to do this without using explicit loops but I am new to python.
Here is the faster implementation.
The key functions are pivot and reindex.
You can use interpolate function for smarter filling of NaN values.
import pandas as pd
filename = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"
df = pd.read_csv(
filename,
# parse dates while reading
parse_dates=["date"],
# select subset of columns
usecols=["date", "location", "total_cases", "total_deaths"],
)
locations = df["location"].unique()
date_range = pd.date_range(df["date"].min(), df["date"].max())
# select needed columns and reshape
cases = (
# turn location into columns and date into index
pd.pivot(df, index="date", columns="location", values="total_cases")
# fill missing dates
.reindex(date_range)
# fill missing locations
.reindex(columns=locations)
# fill NaN in total_cases using the previous value
# another good option is .interpolate("time")
.fillna(method="ffill")
# sort columns
.sort_index(axis=1)
)
# use the same logic for `total_deaths`
...

PYTHON: Filtering a dataset and truncating a date

I am fairly new to python, so any help would be greatly appreciated. I have a dataset that I need to filter down to specific events. For example, I have a column with dates and I need to know what dates are in the current month and have happened within the past week. The column is called POS_START_DATE with dates formatted like 2019-01-27T00:00:00-0500. I need to truncate that date and compare it to the previous week. No luck so far.
Here is my code so far:
## import data package
import datetime
## assign date variables
today = datetime.date.today()
six_day = datetime.timedelta(days = 6)
## Create week parameter
week = today + six_day
## Statement to extract recent job movements
if fields.POS_START_DATE < week and fields.POS_START_DATE > today:
out1 += in1
Here is sample of the table:
Sample Table
I am looking for the same table filtered down to only rows that happened within one week. The bottom of the sample table(not shown) will have dates in this month. I'd like the final output to only show those rows, and any other rows in the current month of November.
I am not too sure to understand what is your expected output, but this will help you create an extra column which will be used as flag for those cases that fulfill with the condition you state in your if-statement:
import numpy as np
fields['flag_1'] = np.where(((fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)),1,0)
This will generate an extra column in your dataframe with a 1 for the cases that meet the criteria you stated. Finally you can perform this calculation to get the total of cases that actually met the criteria:
total_cases = fields['flag_1'].sum()
Edit:
If you need to filter the data with only the cases that meet the criteria you can either use pandas filtering with the original if-statement (without creating the extra flag field) like this:
df_filtered = fields[(fields['POS_START_DATE'] < week) & (fields['POS_START_DATE'] > today)]
Or, if you created the flag, then much simpler:
df_filtered = fields[fields['flag'] == 1]
Both should work to generate a new dataframe, with only the cases that match your criteria.

Generating rolling averages from a time series, but subselecting based onmonth

I have long time series of weekly data. For a given observation, I want to calculate that week's value versus the average of the three previous years' average value for the same month.
Concrete example: For the 2019-02-15 datapoint, I want to compare it to the value of the average of all the feb-2018, feb-2017, and feb-2016 datapoints.
I want to populate the entire timeseries in this way. (the first three years will be np.nans of course)
I made a really gross single-datapoint example of the calculation I want to do, but I am not sure how to implement this in a vectorized solution. I also am not impressed that I had to use this intermediate helper table "mth_avg".
import pandas as pd
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix) #weekly data
mth_avg = df.resample("M").mean() #data as a monthly average over time
mth_avg['month_hack'] = mth_avg.index.month
#average of previous three years' same-month averages
df['avg_prev_3_year_same-month'] = "?"
#single arbitrary example of my intention
df.loc['2019-02-15', "avg_prev_3_year_same-month"]= (
mth_avg[mth_avg.month_hack==2]
.loc[:'2019-02-15']
.iloc[-3:]
.loc[:,'foo']
.mean()
)
df[-5:]
I think it's actually a nontrivial problem - there's no existing functionality I'm aware of Pandas for this. Making a helper table saves calculation time, in fact I used two. My solution uses a loop (namely a list comprehension) and Pandas datetime awareness to avoid your month_hack. Otherwise I think it was a good start. Would be happy to see something more elegant!
# your code
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix)
mth_avg = df.resample("M").mean()
# use multi-index of month/year with month first
mth_avg.index = [mth_avg.index.month, mth_avg.index.year]
tmp = mth_avg.sort_index().groupby(level=0).rolling(3).foo.mean()
tmp.index = tmp.index.droplevel(0)
# get rolling value from tmp
res = [tmp.xs((i.month, i.year - 1)) for i in df[df.index > '2010-12-31'].index]
# NaNs for 2010
df['avg_prev_3_year_same-month'] = np.NaN
df.loc[df.index > '2010-12-31', 'avg_prev_3_year_same-month'] = res
# output
df.sort_index(ascending=False).head()
foo avg_prev_3_year_same-month
2019-03-01 478 375.833333
2019-02-22 477 371.500000
2019-02-15 476 371.500000
2019-02-08 475 371.500000
2019-02-01 474 371.500000

Pandas calculated column from datetime index groups loop

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance
I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

Categories