My DataFrame looks like this:
<DATE>,<TIME>,<PRICE>
20200702,110000,207.2400000
20200702,120000,207.4400000
20200702,130000,208.2400000
20200702,140000,208.8200000
20200702,150000,208.0700000
20200702,160000,208.8100000
20200702,170000,209.4300000
20200702,180000,208.8700000
20200702,190000,210.0000000
20200702,200000,209.6900000
20200702,210000,209.8700000
20200702,220000,209.8000000
20200702,230000,209.5900000
20200703,000000,209.6000000
20200703,110000,211.1800000
20200703,120000,209.3900000
20200703,130000,209.6400000
I want to add here 2 another boolean columns called 'Up Fractal' and 'Down Fractal'.
It is stock market indicator Fractals with period 5.
It means:
Script runs from first row to last.
Script takes current row and looks at PRICE.
Script takes 5 previous rows and 5 next rows.
If PRICE of current row is maximum it is called 'Up Fractal'. True value in column 'Up Fractal'
If PRICE of current row is minimum it is called 'Down Fractal'. True value in column 'Down Fractal'
On stock market chart it looks something like this (this is an example from internet, not about my DataFrame)
It is easy for me to find fractals using standart methods of python. But I need high speed of pandas.
Help me please. I am very new to pandas library.
from binance.spot import Spot
import pandas as pd
from pandas import DataFrame
import numpy as np
if __name__ == '__main__':
cl = Spot()
r = cl.klines("BTCUSDT", "5m", limit = "100")
df = DataFrame(r).iloc[:, :6]
df.columns = list("tohlcv")
# number of rows to calculate fractal
n = 10
df = df.astype({'t': int})
df = df.astype({'o': float})
df = df.astype({'h': float})
df = df.astype({'l': float})
df = df.astype({'c': float})
# the first way
df['uf'] = (df['h'] == df['h'].rolling(n+n+1, center=True).max())
df['df'] = (df['l'] == df['l'].rolling(n+n+1, center=True).min())
# the second way
df['upfractal'] = np.where(df['h'] == df['h'].rolling(n+n+1, center=True).max(), True, False)
df['downfractal'] = np.where(df['l'] == df['l'].rolling(n+n+1, center=True).min(), True, False)
print(df)
df.to_csv('BTC_USD.csv')
Related
First of all, thank you for all the feedback so far, it is much appreciated. I have included the rest of my code from the assignment and added some details to give a better idea of what I am trying to achieve:
So I have this python code I am trying to modify to create a new column for each existing column with percentage change, instead of overwriting the existing values. How could one do this effectively?
I should add, that this is using 1 min trading data for a few select cryptocurrencies with [ time, low, high, open, close ] as the row values.
when I tried adding a new colunm like so:
df[col] = df[col+'pctchg'].pct_change() #calculate pct change
I get an error message. Am I missing some obvius syntax issue?
import pandas as pd
from collections import deque
import random
import numpy as np
import time
from sklearn import preprocessing
pd.set_option('display.max_rows', 500) #increase the display size for dataframes
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)
def classify(current, future):
if float(future) > float(current): # if the future price is higher than the current, that's a buy, or a 1
return 1
else: # else a sell
return 0
def preprocess_df(df):
# df = df.drop("future", 1) # don't need this anymore.
for col in df.columns: # go through all of the columns
if col != "target": # do not adjust the target
df[col] = df[col].pct_change() #calculate pct change
df.dropna(inplace=True) # remove the nas created by pct_change
df[col] = preprocessing.scale(df[col].values) # scale the data
main_df = pd.DataFrame() # begin empty
ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"] # the 4 ratios to consider
for ratio in ratios: # begin iteration
ratio = ratio.split('.csv')[0] # split away the ticker from the file-name
print(ratio)
dataset = f'crypto_data/{ratio}.csv' # get the full path to the file
df = pd.read_csv(dataset, names=['time', 'low', 'high', 'open', 'close', 'volume']) # read in specific file
# rename volume and close to include the ticker so we can still which close/volume is which:
df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)
df.set_index("time", inplace=True) # set time as index so we can join them on this shared time
df = df[[f"{ratio}_close", f"{ratio}_volume"]] # ignore the other columns besides price and volume
if len(main_df)==0: # if the dataframe is empty
main_df = df # then it's just the current df
else: # otherwise, join this data to the main one
main_df = main_df.join(df)
preprocess_df(main_df)
print(main_df)
When I run the code as it I get the following output:
Dataframe output
How would I create the same dataframe, but retain my original values and create new columns with the percentage change?
Sorry for posting but i cannot comment.
You can try this:
# create a new list of columns out of the columns you wise to modify
new_cols = [col+'_pct' for col in df.drop(columns=['target']).columns]
# then calculate the pct change on the desired columns and add them to the df
df[new_cols] = df.drop(columns=['target']).pct_change()
As #Tim above mentioned you can also try:
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # don't modify target
df[col+'new'] = df[col].pct_change() # <------
df.dropna(inplace=True)
I ended up using the following code, which achieved the goal I was after:
for col in df.columns: # go through all of the columns
if col != "target": # do not adjust the target
df[col+'pctchg'] = df[col].pct_change() #calculate pct change
df.dropna(inplace=True) # remove the nas created by pct_change
df[col+'pctchg'] = preprocessing.scale(df[col+'pctchg'].values) # scale the data
I am streaming live price data using the IB API, and I want to put it in a dataframe for analysis. My data consists of a price being live streamed with no timestamp.
I think I need to create new rows using row numbers that are automatically added, and have the prices inserted in the price column.
I have tried defining the dataframe and telling the price where to go as follows:
def tick_df(self, reqId,
contract): # this stores price dataframe by creating an empty dataframe and setting the index to the time column
self.bardata[reqId] = pd.DataFrame(columns=['index', 'price'])
self.reqMktData(reqId, contract, "", False, False, [])
self.bardata[reqId].index = [x for x in range(1, len(self.bardata[reqId].values) + 1)]
return self.bardata[reqId]
def tickPrice(self, reqId, tickType, price, attrib): # this function prints the price
if tickType == 2 and reqId == 102:
self.bardata[reqId].loc[self.bardata[reqId].index] = price
I have been using a methodology similar to here (https://github.com/PythonForForex/Interactive-brokers-python-api-guide/blob/master/GOOG_five_percent.py). However, as I am only streaming a price, I am unable to use the timestamp for creating new rows.
I don't know if this is what you need. In a loop I generate random price that I append to a data frame.
import numpy as np
import pandas as pd
_price = 1.1300 # first price in the series
_std = 0.0005 # volatility (stadard deviation)
df = pd.DataFrame(columns=['price'])
for i in range(1000):
_wn = np.random.normal(loc=0, scale=_std, size=1) # random white noise
_price = _price + _wn[0] # random price
df = df.append({'price':_price}, ignore_index=True)
df
I work with FOREX time series and I do not conceive time series without time so, just in case you have the same 'problem', I'm including a version with time stamp:
import numpy as np
import pandas as pd
from datetime import datetime
_price = 1.1300 # first price in the series
_std = 0.0005 # volatility (stadard deviation)
df = pd.DataFrame(columns=['price', 'time'])
for i in range(1000):
_wn = np.random.normal(loc=0, scale=_std, size=1) # random white noise
_price = _price + _wn[0] # random price
_time = datetime.now()
df = df.append({'price':_price, 'time':_time}, ignore_index=True)
df
Please let me know if this is what you needed.
I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price
I'm having trouble creating a new data table that will show annual energy use data. Basically, I'd like to multiply energy by different factors to show annual energy use.
The code is below.
#calculate energy amounts
energy_use_by_fuel = pd.DataFrame()
for hhid in energy_data.hhid.unique():
tempdtf = pd.DataFrame({
'hhid':hhid,
'monthly_electricity': energy_data.loc[energy_data.hhid == hhid, 'estimated_kwh_monthly']*3,
'monthly_gas': energy_data.loc[energy_data.hhid == hhid, 'monthly_gas_use_kg'] * 4,
'monthly_charcoal': energy_data.loc[energy_data.hhid == hhid,
'monthly_charcoal_use_kg'] * 5})
#join
tempdtf = energy_use_by_fuel.append(tempdtf, ignore_index = True)
As you can see, I'd like to calculate different energy uses for electricity, gas and charcoal. But when I multiply the data by the numbers, the resulting dataframe energy_use_by_fuel is empty.
The function df.append() does return a new object adding a DataFrame, so I think your code is setting wrong variable.
#join
energy_use_by_fuel = energy_use_by_fuel.append(tempdtf, ignore_index = True)
I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time