How do I create a dataframe using one variable being live streamed? - python

I am streaming live price data using the IB API, and I want to put it in a dataframe for analysis. My data consists of a price being live streamed with no timestamp.
I think I need to create new rows using row numbers that are automatically added, and have the prices inserted in the price column.
I have tried defining the dataframe and telling the price where to go as follows:
def tick_df(self, reqId,
contract): # this stores price dataframe by creating an empty dataframe and setting the index to the time column
self.bardata[reqId] = pd.DataFrame(columns=['index', 'price'])
self.reqMktData(reqId, contract, "", False, False, [])
self.bardata[reqId].index = [x for x in range(1, len(self.bardata[reqId].values) + 1)]
return self.bardata[reqId]
def tickPrice(self, reqId, tickType, price, attrib): # this function prints the price
if tickType == 2 and reqId == 102:
self.bardata[reqId].loc[self.bardata[reqId].index] = price
I have been using a methodology similar to here (https://github.com/PythonForForex/Interactive-brokers-python-api-guide/blob/master/GOOG_five_percent.py). However, as I am only streaming a price, I am unable to use the timestamp for creating new rows.

I don't know if this is what you need. In a loop I generate random price that I append to a data frame.
import numpy as np
import pandas as pd
_price = 1.1300 # first price in the series
_std = 0.0005 # volatility (stadard deviation)
df = pd.DataFrame(columns=['price'])
for i in range(1000):
_wn = np.random.normal(loc=0, scale=_std, size=1) # random white noise
_price = _price + _wn[0] # random price
df = df.append({'price':_price}, ignore_index=True)
df
I work with FOREX time series and I do not conceive time series without time so, just in case you have the same 'problem', I'm including a version with time stamp:
import numpy as np
import pandas as pd
from datetime import datetime
_price = 1.1300 # first price in the series
_std = 0.0005 # volatility (stadard deviation)
df = pd.DataFrame(columns=['price', 'time'])
for i in range(1000):
_wn = np.random.normal(loc=0, scale=_std, size=1) # random white noise
_price = _price + _wn[0] # random price
_time = datetime.now()
df = df.append({'price':_price, 'time':_time}, ignore_index=True)
df
Please let me know if this is what you needed.

Related

how to find values above threshold in pandas and store them with date

I have a DF with stock prices and I want to find stock prices for each day that are above a threshold and record the date, percent increase and stock name.
import pandas as pd
import requests
import time
import pandas as pd
import yfinance as yf
stock_ticker=['AAPL','MSFT','LCID','HOOD','TSLA]
df = yf.download(stock_tickers,
start='2020-01-01',
end='2021-06-12',
progress=True,
)
data=df['Adj Close']
data=data.pct_change()
data.dropna(inplace=True)
top=[]
for i in range(len(data)):
if i>.01 :
top.append(data.columns[i])
I tried to do a for loop but it saves all the tickers name
What I want to do is find the stocks for each day that increased by 1% and save the name, date and percent increase in a pandas.
Any help would be appreciate it
There might be a more efficient way, but I'd use DataFrame.iteritems(). An example attached below. I kept duplicated Date index since I was not sure how you'd like to keep the data.
data = df["Adj Close"].pct_change()
threshold = 0.01
df_above_th_list = []
for item in data.iteritems():
stock = item[0]
sr_above_th = item[1][item[1] > threshold]
df_above_th_list.append(pd.DataFrame({"stock": stock, "pct": sr_above_th}))
df_above_th = pd.concat(df_above_th_list)
If you want to process the data by row, you can use DataFrame.iterrows() or DataFrame.itertuples().

Error when adding a new column to pandas dataframe using a rolling mean function

I have a script where I download some fx rates from the web and would like to calculate the rolling mean. When running the script, I obtain an error in relation to the rates column that I am trying to calculate the rolling mean for. I would like to produce an extra column with the rolling average displayed. Here is what I have so far. The last 3 lines above the comments is where the error seems to be.
Now I get the following error "KeyError: 'rates'"
import pandas as pd
import matplotlib.pyplot as plt
url1 = 'http://www.bankofcanada.ca/'
url2 = 'valet/observations/group/FX_RATES_DAILY/csv?start_date='
start_date = '2017-01-03' # Earliest start date is 2017-01-03
url = url1 + url2 + start_date # Complete url to download csv file
# Read in rates for different currencies for a range of dates
rates = pd.read_csv(url, skiprows=39, index_col='date')
rates.index = pd.to_datetime(rates.index) # assures data type to be a datetime
print("The pandas dataframe with the rates ")
print(rates)
# Get number of days & number of currences from shape of rates - returns a tuple in the
#format (rows, columns)
days, currencies = rates.shape
# Read in the currency codes & strip off extraneous part. Uses url string, skips the first
#10 rows and returns to the data frame columns of index 0 and 2. It will read n rows according
# to the variable currencies. This was returned in line 19 from a tuple produced by .shape
codes = pd.read_csv(url, skiprows=10, usecols=[0,2],
nrows=currencies)
#Print out the dataframe read from the web
print("Dataframe with the codes")
print(codes)
#A for loop to goe through the codes dataframe. For each ith row and for the index 1 column,
# the for loop will split the string with a string 'to Canadian'
for i in range(currencies):
codes.iloc[i, 1] = codes.iloc[i, 1].split(' to Canadian')[0]
# Report exchange rates for the most most recent date available
date = rates.index[-1] # most recent date available
print('\nCurrency values on {0}'.format(date))
#Using a for loop and zip, the values in the code and rate objects are grouped together
# and then printed to the screen with a new format
for (code, rate) in zip(codes.iloc[:, 1], rates.loc[date]):
print("{0:20s} Can$ {1:8.6g}".format(code, rate))
#Assign values into a dataframe/slice rates dataframe
FXAUDCAD_daily = pd.DataFrame(index=['dates'], columns={'dates', 'rates'})
FXAUDCAD_daily = FXAUDCAD
FXAUDCAD_daily['rolling mean'] = FXAUDCAD_daily.loc['rates'].rolling_mean()
print(FXAUDCAD_daily)
#Print the values to the screen
#Calculate the rolling average using the rolling average pandas function
#Create a figure object using matplotlib/pandas
#Plot values on figure on the figure object.
New updated code using feedback, I made the following
import pandas as pd
import matplotlib.pyplot as plt
import datetime
url1 = 'http://www.bankofcanada.ca/'
url2 = 'valet/observations/group/FX_RATES_DAILY/csv?start_date='
start_date = '2017-01-03' # Earliest start date is 2017-01-03
url = url1 + url2 + start_date # Complete url to download csv file
# Read in rates for different currencies for a range of dates
rates = pd.read_csv(url, skiprows=39, index_col='date')
rates.index = pd.to_datetime(rates.index) # assures data type to be a
datetime
#print("The pandas dataframe with the rates ")
#print(rates)
# Get number of days & number of currences from shape of rates - returns
#a tuple in the
#format (rows, columns)
days, currencies = rates.shape
# Read in the currency codes & strip off extraneous part. Uses url
string, skips the first
#10 rows and returns to the data frame columns of index 0 and 2. It will
#read n rows according
# to the variable currencies. This was returned in line 19 from a tuple
#produced by .shape
codes = pd.read_csv(url, skiprows=10, usecols=[0,2],
nrows=currencies)
#Print out the dataframe read from the web
#print("Dataframe with the codes")
#print(codes)
#A for loop to goe through the codes dataframe. For each ith row and for
#the index 1 column,
# the for loop will split the string with a string 'to Canadian'
for i in range(currencies):
codes.iloc[i, 1] = codes.iloc[i, 1].split(' to Canadian')[0]
# Report exchange rates for the most most recent date available
date = rates.index[-1] # most recent date available
#print('\nCurrency values on {0}'.format(date))
#Using a for loop and zip, the values in the code and rate objects are
grouped together
# and then printed to the screen with a new format
#for (code, rate) in zip(codes.iloc[:, 1], rates.loc[date]):
#print("{0:20s} Can$ {1:8.6g}".format(code, rate))
#Create dataframe with columns of date and raters
#Assign values into a dataframe/slice rates dataframe
FXAUDCAD_daily = pd.DataFrame(index=['date'], columns={'date', 'rates'})
FXAUDCAD_daily = rates['FXAUDCAD']
print(FXAUDCAD_daily)
FXAUDCAD_daily['rolling mean'] =
FXAUDCAD_daily['rates'].rolling(1).mean()
Let's try to fix your code.
First of all, this line seems a bit odd to me, as FXAUDCAD isn't defined.
FXAUDCAD_daily = FXAUDCAD
Then, you might consider rewriting your rolling mean calculation as follows.
FXAUDCAD_daily['rolling mean'] = FXAUDCAD_daily['rates'].rolling(WINDOW_SIZE).mean()
What's your pandas version? pd.rolling_mean() is not supported above pandas version 0.18.0
Update your pandas library with:
pip3 install --upgrade pandas
And then use rolling() method https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html):
FXAUDCAD_daily['rolling mean'] = FXAUDCAD_daily['rates'].rolling(*window_size*).mean()
I managed to solve it, when I sliced the original dataframe rates into FXAUDCAD_daily, it already came with the same index of date. So I was getting a key error because the currency abbreviation was used as the name of the column with index 1, not the string 'rate'
But now I have another trivial problem, how do I rename the FXAUDCAD column to just rate. I will post another question for this
import pandas as pd
import matplotlib.pyplot as plt
import datetime
url1 = 'http://www.bankofcanada.ca/'
url2 = 'valet/observations/group/FX_RATES_DAILY/csv?start_date='
start_date = '2017-01-03'
url = url1 + url2 + start_date
rates = pd.read_csv(url, skiprows=39, index_col='date')
rates.index = pd.to_datetime(rates.index) # assures data type to be a
datetime
print("Print rates to the screen",rates)
#print index
print("Print index to the screen", rates.index)
days, currencies = rates.shape
codes = pd.read_csv(url, skiprows=10, usecols=[0,2],
nrows=currencies)
for i in range(currencies):
codes.iloc[i, 1] = codes.iloc[i, 1].split(' to Canadian')[0]
#date = rates.index[-1]
#Make a series of just the rates of FXAUDCAD
FXAUDCAD_daily = pd.DataFrame(rates['FXAUDCAD'])
#Print FXAUDRATES to the screen
print(FXAUDCAD_daily)
#Calculate the MA using the rolling function with a window size of 1
FXAUDCAD_daily['rolling mean'] =
FXAUDCAD_daily['FXAUDCAD'].rolling(1).mean()
#print out the new dataframe with calculation
print(FXAUDCAD_daily)
#Rename one of the data frame from FXAUDCAD to Exchange Rate
FXAUDCAD_daily.rename(columns={'rate':'FXAUDCAD'})
#print out the new dataframe with calculation
print(FXAUDCAD_daily)

build indicator fractals with pandas

My DataFrame looks like this:
<DATE>,<TIME>,<PRICE>
20200702,110000,207.2400000
20200702,120000,207.4400000
20200702,130000,208.2400000
20200702,140000,208.8200000
20200702,150000,208.0700000
20200702,160000,208.8100000
20200702,170000,209.4300000
20200702,180000,208.8700000
20200702,190000,210.0000000
20200702,200000,209.6900000
20200702,210000,209.8700000
20200702,220000,209.8000000
20200702,230000,209.5900000
20200703,000000,209.6000000
20200703,110000,211.1800000
20200703,120000,209.3900000
20200703,130000,209.6400000
I want to add here 2 another boolean columns called 'Up Fractal' and 'Down Fractal'.
It is stock market indicator Fractals with period 5.
It means:
Script runs from first row to last.
Script takes current row and looks at PRICE.
Script takes 5 previous rows and 5 next rows.
If PRICE of current row is maximum it is called 'Up Fractal'. True value in column 'Up Fractal'
If PRICE of current row is minimum it is called 'Down Fractal'. True value in column 'Down Fractal'
On stock market chart it looks something like this (this is an example from internet, not about my DataFrame)
It is easy for me to find fractals using standart methods of python. But I need high speed of pandas.
Help me please. I am very new to pandas library.
from binance.spot import Spot
import pandas as pd
from pandas import DataFrame
import numpy as np
if __name__ == '__main__':
cl = Spot()
r = cl.klines("BTCUSDT", "5m", limit = "100")
df = DataFrame(r).iloc[:, :6]
df.columns = list("tohlcv")
# number of rows to calculate fractal
n = 10
df = df.astype({'t': int})
df = df.astype({'o': float})
df = df.astype({'h': float})
df = df.astype({'l': float})
df = df.astype({'c': float})
# the first way
df['uf'] = (df['h'] == df['h'].rolling(n+n+1, center=True).max())
df['df'] = (df['l'] == df['l'].rolling(n+n+1, center=True).min())
# the second way
df['upfractal'] = np.where(df['h'] == df['h'].rolling(n+n+1, center=True).max(), True, False)
df['downfractal'] = np.where(df['l'] == df['l'].rolling(n+n+1, center=True).min(), True, False)
print(df)
df.to_csv('BTC_USD.csv')

Trying to use Deque to limit DataFrame of incoming data... suggestions?

I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)

Spatial temporal query in python with many records

I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
Import
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
np.random.seed(111)
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
<3
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli.ids
cli = Client()
dview=cli[:]
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
#dview.parallel(block=True)
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.

Categories