Here is the simplified sample dataset:
Price Signal
0 1.5
1 2.0 Buy
2 2.1
3 2.2
4 1.7 Sell
Here is the code to generate the above sample dataset for ease of reference:
price = [1.5, 2, 2.1, 2.2, 1.7]
signal = ['', 'Buy', '', '', 'Sell']
df = pd.DataFrame(zip(price,signal), columns = ['Price', 'Signal'])
Here is the task:
Assuming initial cash = 100 and stock position = 0, simulate cash and stock position at each step based on the following code using .itterows()
cash = 100
num_of_shares = 0
for index, row in df.iterrows():
if row['Signal'] == 'Sell':
if num_of_shares > 0:
cash = num_of_shares * row['Price']
num_of_shares = 0
elif row['Signal'] == 'Buy':
if cash > 0:
num_of_shares = cash / row['Price']
cash = 0
df.loc[index, 'Position'] = num_of_shares
df.loc[index, 'Cash'] = cash
Here is the result:
Price Signal Position Cash
0 1.5 0.0 100.0
1 2.0 Buy 50.0 0.0
2 2.1 50.0 0.0
3 2.2 50.0 0.0
4 1.7 Sell 0.0 85.0
Here is the Question: Is there any way to achieve the result faster than using .itterows()?
You can use. to_dict() function in Pandas to convert the data frame to a dictionary. Now iterating over a dictionary is comparatively very fast compared to iterrows() function. Iterating over a dictionary format of the dataset takes about 25 records that is 77x times faster than the iterrows() function.
for example:
iterate dataframe with iterrows()
for key,value in df.iterrows()
# key is your index
# and value is your row
iterate dataframe after converting to dictionary
dict1 = df.to_dict(orient='index')
for key in dict1:
value = dict1[key]
#key is your index
#value is your row
if we compare dataframe and dictionary, suppose my dataframe has 3400 rows
if i use pandas iterrows() it will take 2.45 sec to iterate full dataframe
in case of dictionary for the same length it will take 0.13 sec.
so converting pandas dataframe to dictionary its always good practice and best way to optimize the code.
Related
I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
I am creating a variable 'spike' as an indicator variable that is 1 for the date corresponding to old column, Cost, n smallest values and a 0 otherwise. The code illustrated below is apart of a larger for loop.
I can only get results using the idxmin() function. I would like help in getting the index for the n smallest values.
import pandas as pd
import numpy as np
df3 = pd.DataFrame({'Dept':['A', 'A', 'B', 'B'],
'Benefit':[2000,25,55,400],
'Cost':[1000, 500, 1500, 2000]})
# Let's create an index using Timestamps
index_ = [pd.Timestamp('01-06-2018'), pd.Timestamp('04-06-2018'),
pd.Timestamp('07-06-2018'), pd.Timestamp('10-06-2018')]
df3.index = index_
print(df3)
df3.index = index_
print(df3)
df3['spike'] = np.where(df3.index.isin(lookup), 1, 0)
If you sort, then you can get the top-3 with standard Python / numpy array slicing.
low_cost = df3.sort_values('Cost')[:3]
low_cost
# Dept Benefit Cost
# 2018-04-06 A 25 500
# 2018-01-06 A 2000 1000
# 2018-07-06 B 55 1500
To get the spike column, for efficiency I would recommend a join.
spikes = low_cost.assign(spike=1)[['spike']]
spikes
# spike
# 2018-04-06 1
# 2018-01-06 1
# 2018-07-06 1
df3.join(spikes, how='left').fillna(0)
# Dept Benefit Cost spike
# 2018-01-06 A 2000 1000 1.0
# 2018-04-06 A 25 500 1.0
# 2018-07-06 B 55 1500 1.0
# 2018-10-06 B 400 2000 0.0
I am trying (unsuccessfully) to create separate columns for embedded dictionary keys. Dict data looks like this:
{'averagePrice': 32.95,
'currentDayProfitLoss': 67.2,
'currentDayProfitLossPercentage': 0.02,
'instrument': {'assetType': 'EQUITY', 'cusip': '902104108', 'symbol': 'IIVI'},
'longQuantity': 120.0,
'marketValue': 4021.2,
'settledLongQuantity': 120.0,
'settledShortQuantity': 0.0,
'shortQuantity': 0.0}]
The 'instrument' key is what I am trying to flatten in to columns (ie assetType, cusip, symbol). Here is the code I last tried and still no indivdual columns
data = accounts_data_single
my_dict = data
headers = list(my_dict['securitiesAccount']['positions'])
dict1 = my_dict['securitiesAccount']['positions']
mypositions = pd.DataFrame(dict1)
pd.concat([mypositions.drop(['instrument'], axis=1), mypositions['instrument'].apply(pd.Series)], axis=1)
mypositions.to_csv('Amer_temp.csv')
Any suggestions are greatly appreciated
I am trying to get the nestled keys/fieldnames all in columns and then all the stock positions in the rows. The above code works great except the nestled 'instrument' keys are all in one column
averagePrice currentDayProfitLoss ... assetType cusip symbol
22.5 500 ... Equity 013245 IIVI
450 250 ... Equity 321354 AAPL
etc
Here's a way to do this. Let's say d is your dict.
Step 1: Convert the dict to dataframe
d1 = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)
Step 2: Convert the instrument column into dataframe
d2 = d1['instrument'].apply(pd.Series)
Step3: Join outputs of step1 and step2
df = pd.concat([d1.drop('instrument', axis=1), d2], axis=1)
Are you trying to do this:
pd.DataFrame(d).assign(**pd.DataFrame([x['instrument'] for x in d])).drop(columns='instrument')
output:
averagePrice currentDayProfitLoss currentDayProfitLossPercentage longQuantity marketValue settledLongQuantity settledShortQuantity shortQuantity assetType cusip symbol
0 32.95 67.2 0.02 120.0 4021.2 120.0 0.0 0.0 EQUITY 902104108 IIVI
1 31.95 63.2 0.01 100.0 3021.2 100.0 0.0 0.0 EQUITY 802104108 AAPL
I'm trying to update column entries by counting the frequency of row entries in different columns. Here is a sample of my data. Actual data consistes 10k samples each having length of 220. (220 seconds).
d = {'ID':['a12', 'a12','a12','a12','a12', 'a12', 'a12','a12','v55','v55','v55','v55','v55','v55','v55', 'v55'],
'Exp_A':[0.012,0.154,0.257,0.665,1.072,1.514,1.871,2.144, 0.467, 0.812,1.59,2.151,2.68,3.013,3.514,4.015],
'freq':['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04',
'00:00:05', '00:00:06', '00:00:07','00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04',
'00:00:05', '00:00:06', '00:00:07'],
'A_Bullseye':[0,0,0,0,1,0,1,0, 0,0,1,0,0,0,1,0], 'A_Bull_Total':[0,0,0,0,0,1,1,2,0,0,0,1,1,1,1,2], 'A_Shot':[0,1,1,1,0,1,0,0, 1,1,0,1,0,1,0,0]}
df = pd.DataFrame(data=d)
Per each second, only Bullseye or Shot can be registered.
Count1: Number of df.A_Shot == 1 before the first df.A_Bullseye == 1 for each ID is 3 & 2 for ID=a12 and ID=v55 resp.
Count2: Number of df.A_Shot == 1 from the end of count1 to the second df.A_Bullseye == 1, 1 for df[df.ID=='a12'] and 2 for df[df.ID=='v55']
Where i in count(i) is df.groupby(by='ID')[A_Bull_Total].max(). Here i is 2.
So, if I can compute the average count for each i, then I will be able to adjust the values of df.Exp_A using the average of the above counts.
mask_A_Shot= df.A_Shot == 1
mask_A_Bullseye= df.A_Bulleseye == 0
mask = mask_A_Shot & mask_A_Bulleseye
df[mask.groupby(df['ID'])].mean()
Ideally I would like to have something like for each i (Bullseye), how many Shots are needed and how many seconds it took.
Create a grouping key of Bullseye within each ID using .cumsum and then you can find how many shots, and how much time elapsed between the bullseyes.
import pandas as pd
df['freq'] = pd.to_timedelta(df.freq, unit='s')
df['Bullseye'] = df.groupby('ID').A_Bullseye.cumsum()+1
# Chop off any shots after the final bullseye
m = df.Bullseye <= df.groupby('ID').A_Bullseye.transform(lambda x: x.cumsum().max())
df[m].groupby(['ID', 'Bullseye']).agg({'A_Shot': 'sum',
'freq': lambda x: x.max()-x.min()})
Output:
A_Shot freq
ID Bullseye
a12 1 3 00:00:03
2 1 00:00:01
v55 1 2 00:00:01
2 2 00:00:03
Edit:
Given your comment, here is how I would proceed. We're going to .shift the Bullseye column so instead of incrementing the counter at the Bullseye, we increment the counter the row after the bullseye. We'll modify A_Shot so bullseyes are also considered to be a shot.
df['freq'] = pd.to_timedelta(df.freq, unit='s')
df['Bullseye'] = df.groupby('ID').A_Bullseye.apply(lambda x: x.shift().cumsum().fillna(0)+1)
# Also consider Bullseye's as a shot:
df.loc[df.A_Bullseye == 1, 'A_Shot'] = 1
# Chop off any shots after the final bullseye
m = df.Bullseye <= df.groupby('ID').A_Bullseye.transform(lambda x: x.cumsum().max())
df1 = (df[m].groupby(['ID', 'Bullseye'])
.agg({'A_Shot': 'sum',
'freq': lambda x: (x.max()-x.min()).total_seconds()}))
Output: df1
A_Shot freq
ID Bullseye
a12 1.0 4 4.0
2.0 2 1.0
v55 1.0 3 2.0
2.0 3 3.0
And now since freq is an integer number of seconds, you can do divisions easily:
df1.A_Shot / df1.freq
#ID Bullseye
#a12 1.0 1.0
# 2.0 2.0
#v55 1.0 1.5
# 2.0 1.0
#dtype: float64
Suppoose df.bun (df is a Pandas dataframe)is a multi-index(date and name) with variable being category values written in string,
date name values
20170331 A122630 stock-a
A123320 stock-a
A152500 stock-b
A167860 bond
A196030 stock-a
A196220 stock-a
A204420 stock-a
A204450 curncy-US
A204480 raw-material
A219900 stock-a
How can I make this to represent total counts in the same date and its percentage to make table like below with each of its date,
date variable counts Percentage
20170331 stock 7 70%
bond 1 10%
raw-material 1 10%
curncy 1 10%
I have done print(df.groupby('bun').count()) as a resort to this question but it lacks..
cf) Before getting df.bun I used the following code to import nested dictionary to Pandas dataframe.
import numpy as np
import pandas as pd
result = pd.DataFrame()
origDict = np.load("Hannah Lee.npy")
for item in range(len(origDict)):
newdict = {(k1, k2):v2 for k1,v1 in origDict[item].items() for k2,v2 in origDict[item][k1].items()}
df = pd.DataFrame([newdict[i] for i in sorted(newdict)],
index=pd.MultiIndex.from_tuples([i for i in sorted(newdict.keys())]))
print(df.bun)
I believe need SeriesGroupBy.value_counts:
g = df.groupby('date')['values']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
counts percentage
date values
20170331 stock-a 6 60.0
bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-b 1 10.0
Another solution with size for counts and then divide by new Series created by transform and sum:
df2 = df.reset_index().groupby(['date', 'values']).size().to_frame('count')
df2['percentage'] = df2['count'].div(df2.groupby('date')['count'].transform('sum')).mul(100)
print (df2)
count percentage
date values
20170331 bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-a 6 60.0
stock-b 1 10.0
Difference between solutions is first sort by values per groups and second sort MultiIndex.