Loop for interaction between two series - python

I have two series, one having the company stock volume for all the many stocks across many exchanges (a lot of the stocks trade in all the exchanges). The other series is of the standard deviation of each stock (each company, irrespective of the exchanges they are traded in). Now, I have been trying to create a loop, to divide the volume of the respective stock (in first series) with the combined standard deviation that is in the second series. I made the following loop:
#for standard deviation of volume of each stock across all exchanges.(it is working properly)
stdev_volume = Main_df_retvol.groupby(['pair_name'], sort=False)['volume'].std()
#loop to divide the volume by the standard deviation of volume of respective stock.(loop not working)
df_vol_std = []
for i in range(len(stdev_volume)):
if stdev_volume[i]['pair_name'] == Main_df_retvol['pair_name']:
df_vol_std = Main_df_retvol['vol'].divide(other = stdev_volume['Volume'])
print(df_vol_std)
Any help would be really appreciated.

Let's break it down...
Getting an index to select each row in stdev
for i in range(len(stdev_volume)):
Comparing a scalar value from a row/col in stdev to a full column from main (which will raise an exception)
if stdev_volume[i]['pair_name'] == Main_df_retvol['pair_name']:
And taking a variable you had initialized as a list and overwriting with a full column/ column division ( regardless of the intended row, and will only keep the value from the last time around the loop anyways)
df_vol_std = Main_df_retvol['vol'].divide(other = stdev_volume['Volume'])
So, instead of that loop i would suggest something like:
main.join(stdev, on='pair_name')
Or go even further to when you build stdev and add it as a column on main:
main = main.assign(stdev=
main.groupby('pair_name').volume.transform('std'))
main = main.assign(volbystd=
main.volume.div(main.stdev))
If you provide a sample of your data we can test if this works

Related

How to structure my code in order to backtest a trading strategy with Pandas

My goal is to simulate the past growth of a stock portfolio based on historical stock prices. I wrote a code, that works (at least I think so). However, I am pretty sure, that the basic structure of the code is not very clever and propably makes things more complicated than they actually are. Maybe someone can help me and tell me the best procedure to solve a problem like mine.
I started with a dataframe containing historical stock prices for a number (here: 2) of stocks:
import pandas as pd import numpy as np
price_data = pd.DataFrame({'Stock_A': [5,6,10],
'Stock_B': [5,7,2]})
Than I defined a start capital (here: 1000 €). Furthermore I decide how much of my money I want to invest in Stock_A (here: 50%) and Stock_B (here: also 50%).
capital = 1000
weighting = {'Stock_A': 0.5, 'Stock_B': 0.5}
Now I can calculate, how many shares of Stock_A and Stock_B I can buy in the beginning
quantities = {key: weighting[key]*capital/price_data.get(key,0)[0] for key in weighting}
While time goes by the weights of the portfolio components will of course change, as the prices of Stock A and B move in opposite directions. So at some point the portfolio will mainly consists of Stock A, while the proportion of Stock B (value wise) gets pretty small. To correct for this, I want to restore the initial 50:50 weighting as soon as the portfolio weights deviate too much from the initial weighting (so called rebalancing). I defined a function to decide, whether rebalancing is needed or not.
def need_to_rebalance(row):
rebalance = False
for asset in assets:
if not 0.4 < row[asset] * quantities[asset] / portfolio_value < 0.6:
rebalance = True
break
return rebalance
If we perform a rebalancing, the following formula, returns the updated number of shares for Stock A and Stock B:
def rebalance(row):
for asset in assets:
quantities[asset] = weighting[asset]*portfolio_value/row[asset]
return quantities
Finally I defined a third funtion, that I can use to loop over the dataframe containing the sock prices in order to calculate the value of the portfolio based on the current number of Stocks we own. It looks like this:
def run_backtest(row):
global portfolio_value, quantities
portfolio_value = sum(np.array(row[assets]) * np.array(list(quantities.values())))
if need_to_rebalance(row):
quantities = rebalance(row)
for asset in assets:
historical_quantities[asset].append(quantities[asset])
return portfolio_value
Than I put it all to work using .apply:
historical_quantities = {}
for asset in assets:
historical_quantities[asset] = []
output = price_data.copy()
output['portfolio_value'] = price_data.apply(run_backtest, axis = 1)
output.join(pd.DataFrame(historical_quantities), rsuffix='_weight')
The result looks reasonable to me and it is basically, what I wanted to achieve. However, I was wondering, whether there is a more efficient way, to solve the problem. Somehow, doing the calculation line by line and storing all the values in the variable 'historical quantities' just to add it to the dataframe at the end doesn't look very clever to me. Furthermore I have to use a lot of global variables. Storing a lot of values from inside the functions as global variables makes the code pretty messy (In particular, if the calculations concering rebalancing get more complex, for example when including tax effects). Has someone read until here & is maybe willing to help me?
All the best

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Getting values from csv file in python, large dataset

I have a csv file with 500 companies stock values for 5 years (2013-2017). The columns I have are: date, open, high, low, close, volume and name. I would like to be able to compare these companies, to see which 20 of them are the best. I was thinking about just using the mean, but since the stocks value of the first data collected (jan 2013) are different (some starts of at 30 usd, and others at 130 usd), it's hard to really compare which ones that has been the best during these 5 years. I would therefore want to have the values of the first date of every company as the zero-point. Basically I want to subtract the close value from the first date to the rest of the datas collected.
My problem is that, firstly, I have a hard time getting to the first dates close value. Somehow I want to write somthing like "data.loc(data['close']).iloc(0)". But since it's a dataframe I can't find a value of a row, nor iterate through the dataframe.
Secondly, I'm not sure how I can differentiate between the companies. I want to do the procedure with the zero-point for every of these 500 companies, so somehow I need to know when to start over.
The code I have now is
def main():
data = pd.read_csv('./all_stocks_5yr.csv', usecols = ['date', 'close', 'Name'])
comp_name = sorted(set(data.Name))
number_of = comp_name.__len__()
comp_mean = []
for i in comp_name:
frames = data.loc[data['Name'] == i]
comp_mean.append([i, frames['close'].mean()])
print(comp_mean)
But this will only give me the mean, without using the zero-point
Another idea I had was to just compare the closing price from first value (January 1, 2013) with the price from the last value (December 31, 2017) to see how much the stock has increased/decreased, what I'm not sure about here is how I will reach the close values from these dates, for every single of the 500 companies.
Do you have any recommendations for any of the methods?
Thank you in advance

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.
The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

Python Pandas optimization algorithm: compare datetime and retrieve datas

This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)

Categories