Comparing values in Python data frame efficiently - python

I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:

First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.

Related

how to align similar values in two arrays in python

I am trying to align two videos using their utc timestamps.
for example:
video 1 timestamps = 1234.4321, 1234.4731, 1234.5432, 1234.5638, ...
video 2 timestamps = 1234.4843, 1234.5001, 1234.5632, 1234.5992, ...
I would like to align them so that the closest timestamps within a .0150s window are aligned without aligning two values from one array to one value in the second array.
example output:
video 1 timestamps = 1234.4321, 1234.4731, _________, 1234.5432, 1234.5638, _________, ...
video 2 timestamps = _________, 1234.4843, 1234.5001, _________, 1234.5632, 1234.5992, ...
Can someone help?
EDIT
There was a little confusion with the timestamps. The issue isn't that they simply need to be shifted once every two values. Hopefully this updated example will clear it up. Both examples are correct. A single solution should be able to solve both.
Example 2:
timestamp3 = 1590595834.6775, 1590595834.70479, 1590595834.73812, 1590595834.77163, 1590595834.80438
timestamp4 = 1590595835.70971, 1590595835.73674, 1590595835.7695, 1590595835.80338, 1590595835.83634
output:
timestamp3 = 1590595835.6775, 1590595835.70479, 1590595835.73812, 1590595835.77163, 1590595835.80438, _______________, ...
timestamp4 = _______________, 1590595835.70971, 1590595835.73674, 1590595835.7695, 1590595835.80338, 1590595835.83634, ...
Something like this:
timestamp3 = [1590595834.6775, 1590595834.70479, 1590595834.73812, 1590595834.77163, 1590595834.80438]
timestamp4 = [1590595834.70971, 1590595834.73674, 1590595834.7695, 1590595834.80338, 1590595834.83634]
len3 = len(timestamp3)
len4 = len(timestamp4)
ins = '_____________'
diff = 0.015
ii = jj = 0
while True:
if timestamp3[ii] < timestamp4[jj] - diff:
timestamp4.insert(jj, ins)
len4 += 1
elif timestamp4[ii] < timestamp3[jj] - diff:
timestamp3.insert(ii, ins)
len3 += 1
ii += 1
jj += 1
if ii == len3 or jj == len4:
if len3 > len4:
timestamp4.extend([ins]*(len3-len4))
elif len4 > len3:
timestamp3.extend([ins]*(len4-len3))
break
print(timestamp3)
print(timestamp4)
Gives:
[1590595834.6775, 1590595834.70479, 1590595834.73812, 1590595834.77163, 1590595834.80438, '_____________']
['_____________', 1590595834.70971, 1590595834.73674, 1590595834.7695, 1590595834.80338, 1590595834.83634]
I think this is what you mean:
timestamps1 = [1234.4321, 1234.4731, 1234.5432, 1234.5638]
timestamps2 = [1234.4843, 1234.5001, 1234.5632, 1234.5992]
index = len(timestamps1)
while index > 0:
timestamps1.insert(index,'_______')
index -= 2
timestamps2.insert(index,'_______')
print(timestamps1)
print(timestamps2)
Output:
[1234.4321, 1234.4731, '_______', 1234.5432, 1234.5638, '_______']
['_______', 1234.4843, 1234.5001, '_______', 1234.5632, 1234.5992]

Trying to get a specific temperature range while also keeping in range of other two variables

min_desired =int(input("Min. Desired Temp.: "))
max_desired = int(input("Man. Desired Temp.: "))
def desired(min_desired,max_desired):
holder= []
count = 0
total = 0
with open('C:/Users/amaya/OneDrive/Desktop/Weather_final.txt','r') as weather_contents:
weather = weather_contents.readlines()
for lines in weather:
#Use map to convert values on each line to float and to list
column = list(map(float, lines.strip().split()))
holder.append(column)
print(holder)
for x in holder:
print(x)
if x >= min_desired and x <= max_desired:
if humidity < 70 and humidity > 40:
if wind < 12:
count +=1
total += x
avg = (total/ count)
print(count)
print (avg)
print(desired(min_desired, max_desired))
I'm aware that 'humidity' and 'wind' are undefined and that what I've tried might be completely wrong. I'm stumped on how to get the first column, which would be 'Temp' that needs to be in a specific range.
ex. min temp = 60
max temp = 85
while taking into consideration 2 pre-set conditions
humidity must be between 70 and 40 & wind must be lower than 12
Thanks in advance for all the help!!
enter image description here
I would suggest using pandas. Not only will this make parsing your textfile much easier, but pandas dataframes have methods to help you select data based on any criteria you want. Using pandas, your code can be made much simpler:
import pandas as pd
weather = pd.read_csv('weather.txt', sep=" ", names=['Temperature', 'Humidity', 'Wind'])
To select data where wind < 12 and 40 < humidity < 70:
subset = weather.loc[(weather['Wind']>12) & (weather['Humidity']>40) & (weather['Humidity']<70)]
Normally I would use pandas because it would need simpler code and it has many other useful functions.
But here I will show how it could be done without pandas - but I don't have your data to test it.
I would use one for-loop and then I could directly convert column (or rather row) to variables
temp, humidity, wind = column
# --- functions ---
def desired(min_desired,max_desired):
#data = []
count = 0
total = 0
with open('C:/Users/amaya/OneDrive/Desktop/Weather_final.txt','r') as weather_contents:
for line in weather_contents:
row = list(map(float, line.strip().split()))
#data.append(row)
temp, humidity, wind = row
if min_desired <= temp <= max_desired and 40 < humidity < 70 and wind < 12:
count += 1
total += temp
print('count:', count)
if count != 0:
print('avg:', total/count) # don't divide by zero
# --- main ---
min_desired = int(input("Min. Desired Temp.: "))
max_desired = int(input("Man. Desired Temp.: "))
# without `print()` if you use `print()` inside function
desired(min_desired, max_desired)

How do I binary search a pandas dataframe for a combination of column values?

Sorry if this is a simple question that the pandas documentation explains, but I've tried searching for how to do this and haven't had any luck.
I have a pandas datafame with several columns, and I want to be able to search for a particular row using binary search since my dataset is big and I'll be doing a lot of searches.
My data looks like this:
Name Course Week Grade
------------- ------- ---- -----
Homer Simpson MATH001 1 97
Homer Simpson MATH001 3 85
Homer Simpson CSCI100 1 89
John McGuirk MATH001 2 78
John McGuirk CSCI100 1 100
John McGuirk CSCI100 2 96
I want to be able to search my data quickly for a specific combination of name, course, and week. Each distinct combination of name, course, and week will have either zero or one row in the dataset. If there is a missing value for the combination of name, course, and week that I'm searching for, I want my search to return 0.
For instance, I would like to search for the value (John McGuirk, CSCI100, 1)
Is there a built in way to do this, or do I have to write my own binary search?
Update:
I tried doing this using the built-in way that was suggested by one of the commenters below, and I also tried doing it with a custom binary search that's written for my specific data, and another custom binary search that uses recursion to handle different columns than my specific example.
The data frame for these tests contains 10,000 rows. I put the timings below. Both binary searches performed better than using [...] to get rows. I'm far from a Python expert, so I'm not sure how well optimized my code is.
# Load data
from pandas import DataFrame, read_csv
import math
import pandas as pd
import time
file = 'grades.xlsx'
df = pd.read_excel(file)
# This was suggested by one of the commenters below
def get_grade(name, course, week):
mask = (df.name.values == name) & (df.course.values == course) & (df.week.values == week)
row = df[mask]
if row.empty == False:
return row.grade.values[0]
else:
return 0
# Binary search that is specific to my particular data
def get_grade_binary_search(name, course, week):
lower = 0
upper = len(df.index) - 1
while lower <= upper:
mid = math.floor((lower + upper) / 2)
row_name = df.iat[mid, 0]
if name < row_name:
upper = mid - 1
elif name > row_name:
lower = mid + 1
else:
row_course = df.iat[mid, 1]
if course < row_course:
upper = mid - 1
elif course > row_course:
lower = mid + 1
else:
row_week = df.iat[mid, 2]
if week < row_week:
upper = mid - 1
elif week > row_week:
lower = mid + 1
else:
return df.iat[mid, 3]
return 0
# General purpose binary search
def get_grade_binary_search_recursive(search_value):
lower = 0
upper = len(df.index) - 1
while lower <= upper:
mid = math.floor((lower + upper) / 2)
comparison = compare(search_value, 0, mid)
if comparison < 0:
upper = mid - 1
elif comparison > 0:
lower = mid + 1
else:
return df.iat[mid, len(search_value)]
# Utility method
def compare(search_value, search_column_index, df_value_index):
if search_column_index >= len(search_value):
return 0
if search_value[search_column_index] < df.iat[df_value_index, search_column_index]:
return -1
elif search_value[search_column_index] > df.iat[df_value_index, search_column_index]:
return 1
else:
return compare(search_value, search_column_index + 1, df_value_index)
Here are the timings. I also printed the sum of the returned values from each search to verify that the same rows are getting returned.
# Non binary search
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade(name, course, week)
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
elapsed time: 26.130020141601562
sum of grades: 498724
# Binary search specific to this data
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search(name, course, week)
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
elapsed time: 4.4506165981292725
sum of grades: 498724
# Binary search with recursion
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search_recursive([name, course, week])
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum_of_grades: ', sum_of_grades)
elapsed time: 7.559535264968872
sum_of_grades: 498724
Pandas has searchsorted.
From the Notes:
Binary search is used to find the required insertion points.

Python code not working as intended

I started learning Python < 2 weeks ago.
I'm trying to make a function to compute a 7 day moving average for data. Something wasn't going right so I tried it without the function.
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
if j == (i+6):
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
If I run this and look at the value of sum_7, it's just a single value in the numpy array which made all the moving_average values wrong. But if I remove the first for loop with the variable i and manually set i = 0 or any number in the range of the data set and run the exact same code from the inner for loop, sum_7 comes out as a length 7 numpy array. Originally, I just did sum += temp[j] but the same problem occurred, the total sum ended up as just the single value.
I've been staring at this trying to fix it for 3 hours and I'm clueless what's wrong. Originally I wrote the function in R so all I had to do was convert to python language and I don't know why sum_7 is coming up as a single value when there are two for loops. I tried to manually add an index variable to act as i to use it in the range(i, i+7) but got some weird error instead. I also don't know why that is.
https://gyazo.com/d900d1d7917074f336567b971c8a5cee
https://gyazo.com/132733df8bbdaf2847944d1be02e57d2
Hey you can using rolling() function and mean() function from pandas.
Link to the documentation :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html
df['moving_avg'] = df['your_column'].rolling(7).mean()
This would give you some NaN values also, but that is a part of rolling mean because you don't have all past 7 data points for first 6 values.
Seems like you misindented the important line:
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
# The following condition should be indented one more level
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if j == (i+6):
# this ^ condition does not do what you meant
# you should use a flag instead
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
Instead of a flag you can use a for-else construct, but this is not readable. Here's the relevant documentation.
Shorter way to do this:
moving_average = np.array([])
for i in range(len(temp)-6):
ngram_7 = [t for t in temp[i:i+7] if not pd.isnull(t)]
average = (sum(ngram_7) / len(ngram_7)) if ngram_7 else np.nan
moving_average = np.append(moving_average, average)
This could be refactored further:
def average(ngram):
valid = [t for t in temp[i:i+7] if not pd.isnull(t)]
if not valid:
return np.nan
return sum(valid) / len(valid)
def ngrams(seq, n):
for i in range(len(seq) - n):
yield seq[i:i+n]
moving_average = [average(k) for k in ngrams(temp, 7)]

unable to loop through numpy arrays

I am really confused and can't seem to find an answer for my code below. I keep getting the following error:
File "C:\Users\antoniozeus\Desktop\backtester2.py", line 117, in backTest
if prices >= smas:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Now, as you will see my code below, I am trying to compare two numpy arrays, step by step, to try and generate a signal once my condition is met. This is based on stock Apple data.
Going from one point at a time so starting at index[0] then [1], if my prices is greater than or equal to smas (moving average), then a signal is produced. Here is the code:
def backTest():
#Trade Rules
#Buy when prices are greater than our moving average
#Sell when prices drop below or moving average
portfolio = 50000
tradeComm = 7.95
stance = 'none'
buyPrice = 0
sellPrice = 0
previousPrice = 0
totalProfit = 0
numberOfTrades = 0
startPrice = 0
startTime = 0
endTime = 0
totalInvestedTime = 0
overallStartTime = 0
overallEndTime = 0
unixConvertToWeeks = 7*24*60*60
unixConvertToDays = 24*60*60
date, closep, highp, lowp, openp, volume = np.genfromtxt('AAPL2.txt', delimiter=',', unpack=True,
converters={ 0: mdates.strpdate2num('%Y%m%d')})
## FIRST SMA
window = 10
weights = np.repeat(1.0, window)/window
'''valid makes sure that we only calculate from valid data, no MA on points 0:21'''
smas = np.convolve(closep, weights, 'valid')
prices = closep[9:]
for price in prices:
if stance == 'none':
if prices >= smas:
print "buy triggered"
buyPrice = closep
print "bought stock for", buyPrice
stance = "holding"
startTime = date
print 'Enter Date:', startTime
if numberOfTrades == 0:
startPrice = buyPrice
overallStartTime = date
numberOfTrades += 1
elif stance == 'holding':
if prices < smas:
print 'sell triggered'
sellPrice = closep
print 'finished trade, sold for:',sellPrice
stance = 'none'
tradeProfit = sellPrice - buyPrice
totalProfit += tradeProfit
print totalProfit
print 'Exit Date:', endTime
endTime = date
timeInvested = endTime - startTime
totalInvestedTime += timeInvested
overallEndTime = endTime
numberOfTrades += 1
#this is our reset
previousPrice = closep
You have numpy arrays -- smas is the output of np.convolve which is an array, and I believe that prices is also an array. with numpy,arr > other_arrwill return anndarray` which doesn't have a well defined truth value (hence the error).
You probably want to compare price with a single element from smas although I'm not sure which (or what np.convolve is going to return here -- It may only have a single element)...
I think you mean
if price >= smas
You have
if prices >= smas
which compares the whole list at once.

Categories