I'm trying to store stock data (Open, High, Low, Close, Volume), pulled by pandas_datareader, into 5 distinct lists named accordingly. I am new to Python and am wondering where I am going wrong. I got it to cycle through a one-dimensional list of integer values and assign them to each list, but am unsure of how to handle the additional dimension of the f.head output. I have twice gotten a traceback error indicating index values out of range, but know that I've made a mistake beyond simple index range.
Open, High, Low, Close, Vol = [], [], [], [], []
col_data = [Open, High, Low, Close, Vol]
stock = 'BABA'
# data period
yStart = 2017
mStart = 11
dStart = 14
yEnd = 2018
mEnd = 2
dEnd = 14
import pandas as p
p.core.common.is_list_like = p.api.types.is_list_like
import pandas_datareader.data as pdr
from datetime import datetime
start = datetime(yStart,mStart,dStart)
end = datetime(yEnd,mEnd,dEnd)
f = pdr.DataReader(stock, 'morningstar', start, end)
f.head()
a = 0
b = 0
while a < len(col_data):
b = 0
while b < len(f):
cur = (f.loc[f.index[b], col_data[a]])
col_data[a].append(cur)
b += 1
a += 1
I would like to ultimately be able to print the individual lists ( like print(Open) and retrieve the list of Open prices ). Any advice/additional resources that might help would be appreciated.
Related
I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(2022)
# make example data
close = np.sin(range(610)) + 10
high = close + np.random.rand(*close.shape)
open = high - np.random.rand(*close.shape)
low = high - 3
close[2] += 100
dates = pd.date_range(end='2022-06-30', periods=len(close))
# insert into pd.dataframe
df = pd.DataFrame(index=dates, data=np.array([open, high, low, close]).T, columns=['Open', 'High', 'Low', 'Close'])
print(df)
Output
Open High Low Close
2020-10-29 9.557631 10.009359 7.009359 10.000000
2020-10-30 10.794789 11.340529 8.340529 10.841471
2020-10-31 10.631242 11.022681 8.022681 110.909297
2020-11-01 9.639562 10.191094 7.191094 10.141120
2020-11-02 9.835697 9.928605 6.928605 9.243198
... ... ... ... ...
2022-06-26 10.738942 11.167593 8.167593 10.970521
2022-06-27 10.031187 10.868859 7.868859 10.321565
2022-06-28 9.991932 10.271633 7.271633 9.376964
2022-06-29 9.069759 9.684232 6.684232 9.005179
2022-06-30 9.479291 10.300242 7.300242 9.548028
The goal here is to compare a specific value in the dataframe, to another value in the dataframe.
Edit:
I now know many different ways to achieve this however I have re-written the question so it is more clear for future readers what the original goal was.
For example:
Check when the value at 'open' column is less than the value at close column.
One solution for this is using itertuples, I have written an answer below explaining the solution
The first step you want to do can be done by df.loc["A", "High"] > df.loc["C", "Low"]. To apply this to all rows you could do something like below:
for i in range(2, len(df)):
print(df["High"][i-2] > df["Low"][i])
I'm sure there are better ways to do it, but this would work.
you can use shift operation on column to shift the rows up/down
`df['High'] > df['Low'].shift(-2)`
To elaborate what's going on, run below commands
df = pd.DataFrame(np.random.randn(5,4), list('ABCDE'), ['Open', 'High', 'Low', 'Close'])
df['Low_shiftup'] = df['Low'].shift(-2)
df.head()
df['High'] > df['Low_shiftup']
As I explained in the question I have now found multiple solutions for this problem. One being itertuples.
Here is how to use itertuples to solve the problem.
First, create the dataframe
import pandas as pd
import numpy as np
np.random.seed(2022)
# make example data
close = np.sin(range(610)) + 10
high = close + np.random.rand(*close.shape)
open = high - np.random.rand(*close.shape)
low = high - 3
close[2] += 100
dates = pd.date_range(end='2022-06-30', periods=len(close))
# insert into pd.dataframe
df = pd.DataFrame(index=dates, data=np.array([open, high, low, close]).T, columns=['Open', 'High', 'Low', 'Close'])
print(df)
Now we use itertuples to iterate over the rows of the dataframe
for row in df.itertuples():
o = row.Open
for r in df.itertuples():
c = r.Close
if o < c:
print('O is less than C')
else:
print('O is greater than C')
This will find all instances of when the open price is less than the close price
This can be expanded on to check other conditions within the same loop just by adding more variables and more if statements, and also using enumerate to check positioning
For example:
for idx, row in enumerate(df.itertuples()):
o = row.Open
h = row.High
for i, r in enumerate(df.itertuples()):
c = r.Close
l = r.Low
if (i > idx) & ((h - 2) > l):
if o < c:
print('O is less than C')
else:
print('O is greater than C')
else:
continue
The above code uses enumerate to add a counter to each loop. The additional if statement will only check if 'o < c' in rows which the loop counter for 'c' is greater than the loop counter for 'o'.
As you can see any value in the dataframe can be compared to another using the correct if statements.
I'm fairly new to Orange.
I'm trying to separate rows of angle (elv) into intervals.
Let's say, if I want to separate my 90-degree angle into 8 intervals, or 90/8 = 11.25 degrees per interval.
Here's the table I'm working with
Here's what I did originally, separating them by their elv value
Here's the result that I want, x rows 16 columns separated by their elv value.
But I want them done dynamically.
I list them out and turn each list into a table with x rows and 2 columns.
This is what I originally did
from Orange.data.table import Table
from Orange.data import Domain, Domain, ContinuousVariable, DiscreteVariable
import numpy
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame()
num = 10 #number of intervals that we want to seperate our elv into.
interval = 90.00/num #separating them into degree/interval
low = 0
high = interval
table = []
first = []
second = []
for i in range(num):
between = []
if i != 0: #not the first run
low = high
high = high + interval
for row in in_data: #Run through the whole table to see if the elv falls in between interval
if row[0] >= low and row[0] < high:
between.append(row)
elv = "elv" + str(i)
err = "err" + str(i)
domain = Domain([ContinuousVariable.make(err)],[ContinuousVariable.make(elv)])
data = Table.from_numpy(domain, numpy.array(between))
print("table number ", i)
print(data[:3])
Here's the output
But as you can see, these are separated tables being assigned every loop.
And I have to find a way to concatenate axis = 1 for these tables.
Even the source code for Orange3 forbids this for some reason.
I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx
I have data in following csv format
Date,State,City,Station Code,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Evaporation (mm),Sunshine (hours),Direction of maximum wind gust,Speed of maximum wind gust (km/h),9am Temperature (C),9am relative humidity (%),3pm Temperature (C),3pm relative humidity (%)
2017-12-25,VIC,Melbourne,086338,15.1,21.4,0,8.2,10.4,S,44,17.2,57,20.7,54
2017-12-25,VIC,Bendigo,081123,11.3,26.3,0,,,ESE,46,17.2,53,25.5,25
2017-12-25,QLD,Gold Coast,040764,22.3,35.7,0,,,SE,59,29.2,53,27.7,67
2017-12-25,SA,Adelaide,023034,13.9,29.5,0,10.8,12.4,N,43,18.6,42,27.7,17
The output for VIC sohuld be
S : 1
ESE : 1
SE : 0
N : 0
however i am getting output as
S : 1
ESE : 1
Thus would like to know, how can a unique function be used to include the other 2 missing results. Below is the proram which calls a csv file
import pandas as pd
#read file
df = pd.read_csv('climate_data_Dec2017.csv')
#marker
value = df['Date']
date = value == "2017-12-26"
marker = df[date]
#group data
directionwise_data = marker.groupby('Direction of maximum wind gust')
count = directionwise_data.size()
numbers = count.to_dict()
for key in numbers:
print(key, ":", numbers[key])
To begin with, i'm not sure what you're trying to get from this:
Your data sample has no "2017-12-26" records yet you're using it in your code, hence i presume for that sample, i'll change the code to "2017-12-25" just to see what is it producing, now that produces the exact thing you're expecting! Therefore i guess in your full data, you don't have records for "2017-12-26" for SE and N and therefore it's not being grouped, i suggest you create a unique set of the four directions you've in your df, then just count their occurances in a slice of your dataframe fo the needed date!
Or if all you want is how many records for each direction you have by date, why not just pivot it like below:
output = df.pivot_table(index='Date', columns = 'Direction of maximum wind gust', aggfunc={'Direction of maximum wind gust':'count'}, fill_value=0)
EDIT:
Ok, so i wrote this real quick which should get you what you want, however you need to feed it which date you want:
import pandas as pd
#read csv
df = pd.read_csv('climate_data_Dec2017.csv')
#specify date
neededDate = '2017-12-25'
#slice dataframe to keep needed records based on the date
subFrame = df.loc[df['Date'] == neededDate].reset_index(drop=True)
#set count to zero
d1 = 0 #'S'
d2 = 0 #'SE'
d3 = 0 #'N'
d4 = 0 #'ESE'
#loop over slice and count directions
for i, row in subFrame.iterrows():
direction = subFrame.at[i,'Direction of maximum wind gust']
if direction == 'S':
d1 = d1+1
elif direction == 'SE':
d2 = d2+1
elif direction == 'N':
d3 = d3+1
if direction == 'ESE':
d4 = d4+1
#print directions count
print ('S = ' + str(d1))
print ('SE = ' + str(d2))
print ('N = ' + str(d3))
print ('ESE = ' + str(d4))
S = 1
SE = 1
N = 1
ESE = 1
I'm basically running some code as follows. Basically I'm just retrieving pairs of stocks (laid out as Row 1-Stock 1,2, Row 2-Stock 1,2 and so on, where Stock 1 and 2 are different in each row) from a CSV File. I then take in data from Yahoo associated with these "Pairs" of Stocks. I calculate the returns of the stocks and basically check if the distance (difference in returns) between a pair of stocks breaches some threshold and if so I return 1. However, I'm getting the following error which I am unable to resolve:
PricePort(tickers)
27 for ticker in tickers:
28 #print ticker
---> 29 x = pd.read_csv('http://chart.yahoo.com/table.csv?s=ttt'.replace('ttt',ticker),usecols=[0,6],index_col=0)
30 x.columns=[ticker]
31 final=pd.merge(final,x,left_index=True,right_index=True)
TypeError: expected a character buffer object
The code is as follows:
from datetime import datetime
import pytz
import csv
import pandas as pd
import pandas.io.data as web
import numpy as np
#Retrieves pairs of stocks (laid out as Row 1-Stock 1,2, Row 2-Stock 1,2 and so on, where Stock 1 and 2 are different in each row) from CSV File
def Dataretriever():
Pairs = []
f1=open('C:\Users\Pythoncode\Pairs.csv') #Enter the location of the file
csvdata= csv.reader(f1)
for row in csvdata: #reading tickers from the csv file
Pairs.append(row)
return Pairs
tickers = Dataretriever() #Obtaining the data
#Taking in data from Yahoo associated with these "Pairs" of Stocks
def PricePort(tickers):
"""
Returns historical adjusted prices of a portfolio of stocks.
tickers=pairs
"""
final=pd.read_csv('http://chart.yahoo.com/table.csv?s=^GSPC',usecols=[0,6],index_col=0)
final.columns=['^GSPC']
for ticker in tickers:
#print ticker
x = pd.read_csv('http://chart.yahoo.com/table.csv?s=ttt'.replace('ttt',ticker),usecols=[0,6],index_col=0)
x.columns=[ticker]
final=pd.merge(final,x,left_index=True,right_index=True)
return final
#Calculating returns of the stocks
def Returns(tickers):
l = []
begdate=(2014,1,1)
enddate=(2014,6,1)
p = PricePort(tickers)
ret = (p.close[1:] - p.close[:-1])/p.close[1:]
l.append(ret)
return l
#Basically a class to see if the distance (difference in returns) between a
#pair of stocks breaches some threshold
class ThresholdClass():
#constructor
def __init__(self, Pairs):
self.Pairs = Pairs
#Calculating the distance (difference in returns) between a pair of stocks
def Distancefunc(self, tickers):
k = 0
l = Returns(tickers)
summation=[[0 for x in range (k)]for x in range (k)] #2d matrix for the squared distance
for i in range (k):
for j in range (i+1,k): # it will be a upper triangular matrix
for p in range (len(self.PricePort(tickers))-1):
summation[i][j]= summation[i][j] + (l[i][p] - l[j][p])**2 #calculating distance
for i in range (k): #setting the lower half of the matrix 1 (if we see 1 in the answer we will set a higher limit but typically the distance squared is less than 1)
for j in range (i+1):
sum[i][j]=1
return sum
#This function is used in determining the threshold distance
def MeanofPairs(self, tickers):
sum = self.Distancefunc(tickers)
mean = np.mean(sum)
return mean
#This function is used in determining the threshold distance
def StandardDeviation(self, tickers):
sum = self.Distancefunc(tickers)
standard_dev = np.std(sum)
return standard_dev
def ThresholdandnewsChecker(self, tickers):
threshold = self.MeanofPairs(tickers) + 2*self.StandardDeviation(tickers)
if (self.Distancefunc(tickers) > threshold):
return 1
Threshold_Class = ThresholdClass(tickers)
Threshold_Class.ThresholdandnewsChecker(tickers,1)
The trouble is Dataretriever() returns a list, not a string. When you iterate over tickers(), the name ticker is bound to a list.
The str.replace method expects both arguments to be strings. The following code raises the error because the second argument is a list:
'http://chart.yahoo.com/table.csv?s=ttt'.replace('ttt', ticker)
The subsequent line x.columns = [ticker] will cause similar problems. Here, ticker needs to be a hashable object (like a string or integer), but lists are not hashable.