Issues with getting Pandas correlation - python

I have this code:
data = pd.read_csv("out.csv")
df=data[['created_at','ticker','close']]
print(df)
print(df.corr())
out.csv looks like this:
created_at,ticker,adj_close,close,high,low,open,volume
2020-06-02 09:30:00-04:00,A,90.33000183105469,90.33000183105469,90.41000366210938,89.94999694824219,90.0,45326.0
2020-06-02 09:31:00-04:00,A,90.2300033569336,90.2300033569336,90.2300033569336,90.22000122070312,90.22000122070312,709.0
2020-06-08 15:56:00-04:00,ZYXI,22.899900436401367,22.899900436401367,22.959999084472656,22.829999923706055,22.959999084472656,5304.0
2020-06-08 15:57:00-04:00,ZYXI,22.920000076293945,22.920000076293945,22.950000762939453,22.889999389648438,22.899999618530273,5317.0
2020-06-08 15:58:00-04:00,ZYXI,22.860000610351562,22.860000610351562,22.93000030517578,22.860000610351562,22.90999984741211,10357.0
I want to see a correlation matrix between tickers using the close price over time which is why I have included the created_at column. However, when I do print(df.corr) I only see the result below not sure why
close
close 1.0

Found the answer https://www.interviewqs.com/blog/py_stock_correlation
data = pd.read_csv("out.csv")
dfdata=data[['created_at','ticker','close']]
# print(df)
df_pivot = dfdata.pivot('created_at','ticker','close').reset_index()
print("loaded df")
# print(df_pivot.head())
corr_df = df_pivot.corr(method='pearson')
#reset symbol as index (rather than 0-X)
corr_df.head().reset_index()
del corr_df.index.name
print(corr_df.head(10))

Related

Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date

I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data.
First dataframe df:
earnDate
anncTod
103
2015-11-18
0900
104
2016-02-24
0900
105
2016-05-18
0900
...
..........
.......
128
2022-03-01
0900
129
2022-05-18
0900
130
2022-08-17
0900
Second dataframe af:
Datetime
Open
High
Low
Close
Volume
2005-01-03
36.3458
36.6770
35.5522
35.6833
3343500
...........
.........
.........
.........
........
........
2022-04-22
246.5500
247.2000
241.4300
241.9100
1817977
I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df.
I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop.
Here's the full code:
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.iloc[date,1] == '0900':
priorday = af.loc[af.index.get_loc(date)-1,0]
priorclose = af.loc[priorday,4]
open = af.loc[date,1]
close = af.loc[date,4]
df.iloc[date,2] = close/priorclose
df.iloc[date,3] = open/priorclose
df.iloc[date,4] = close/open
else:
print('afternoon')
I get an error:
if df.iloc[date,1] == '0900':
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Converting the date columns to integers creates another error. Is there a better way I should go about doing this?
Ideal output would look like (made up numbers, abbreviated output):
earnDate
anncTod
Total Move
Overnight Move
Intraday Move
2015-11-18
0900
9%
7.2%
1.8%
But would include all the dates given in the first dataframe.
UPDATE
I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot.
Here is updated and simplified code (all else remains the same):
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## set index to earnDate
df = df.set_index(pd.DatetimeIndex(df['earnDate']))
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(af.loc[date,'Open']) ##this is line generating error
else:
print('afternoon')
I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX:
import pandas as pd
df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'],
'anncTod': ['0900', '1000', '0800'],
'Open': [111, 222, 333]})
df = df.set_index(df["earnDate"])
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(df.loc[date, 'Open'])
# prints
# 111

Python Yfinance - Can't get SPY history

I'm following a tutorial on using Yfinance in Jupyter Notebook to get prices for SPY (S&P 500) in a dataframe. The code looks simple, but I can't seem to get the desired results.
df_tickers = pd.DataFrame()
spyticker = yf.Ticker("SPY")
print(spyticker)
df_ticker = spyticker.history(period="max", interval="1d", start="1998-12-01", end="2022-01-01" , auto_adjust=True, rounding=True)
df_ticker.head()
The error states: "SPY: No data found for this date range, symbol may be delisted." But when I print spyticker, I get the correct yfinance object:
yfinance.Ticker object <SPY>
I am not sure what your problem is but if I use the following:
spyticker = yf.Ticker("SPY")
df_ticker = spyticker.history(period="max", interval="1d", start="1998-12-01", end="2022-01-01" , auto_adjust=True, rounding=True)
df_ticker.head()
I get the following:
Open High Low Close Volume Dividends Stock Splits
Date
1998-12-01 76.02 77.27 75.43 77.00 8950600 0.0 0
1998-12-02 76.74 77.19 75.94 76.78 7495500 0.0 0
1998-12-03 76.76 77.45 75.35 75.51 12145300 0.0 0
1998-12-04 76.35 77.58 76.27 77.49 10339500 0.0 0
1998-12-07 77.29 78.21 77.25 77.86 4290000 0.0 0
My only explanation is that the call to spyticker.history already returns a dataframe, so it isn't necessary to define the df_ticker beforehand. ​

Changing some values in a row of pd.DataFrame leads to SettingWithCopyWarning pandas python

I'm making stock app, and when get data and edit it, see this error.
Code is simple:
df = yf.download(ticker, period=period, interval='1wk', auto_adjust=True, threads=True)
Here i get DataFrame like bellow:
Open High Low Close Volume
Date
2020-05-25 205.940002 207.880005 196.699997 207.389999 114231300
2020-06-01 205.899994 220.589996 203.940002 219.550003 85600600
2020-06-08 219.600006 225.000000 213.559998 217.639999 68520500
2020-06-15 214.110001 226.500000 212.750000 220.639999 77023000
2020-06-22 220.919998 231.029999 213.500000 215.710007 78020200
... ... ... ... ... ...
2022-05-02 96.410004 102.690002 88.709999 90.050003 95662500
2022-05-09 86.955002 88.639999 78.010002 87.989998 115590500
2022-05-16 87.699997 94.480003 84.730003 86.790001 107750100
2022-05-23 87.059998 87.415001 81.540001 82.470001 29212600
2022-05-25 83.720001 84.070000 81.070000 82.309998 22781455
Then i need to edit row with date "2022-05-23"
df2 = df.loc[d_str] #d_str is 2022-05-25
df.loc[dtd2_s]['Close'] = df2['Close'] #dtd2_s is 2022-05-23
df.loc[dtd2_s]['High'] = max(df2['High'], df.loc[dtd2_s]['High'])
df.loc[dtd2_s]['Low'] = min(df2['Low'], df.loc[dtd2_s]['High'])
But here i get SettingWithCopyWarning.
/Users/alex26/miniforge3/envs/rq/lib/python3.8/site-packages/pandas/core/series.py:1056: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
cacher_needs_updating = self._check_is_chained_assignment_possible()
Do you know how to fix it?Thank you
You may want to try this usage of loc[]:
df2 = df.loc[d_str] #d_str is 2022-05-25
df.loc[dtd2_s, 'Close'] = df2['Close'] #dtd2_s is 2022-05-23
df.loc[dtd2_s, 'High'] = max(df2['High'], df.loc[dtd2_s]['High'])
df.loc[dtd2_s, 'Low'] = min(df2['Low'], df.loc[dtd2_s]['High'])
Specifically, I have used loc[index_value, column_label] with one set of braces and a comma, rather than loc[index_value][column_label] which is a single call to loc[] followed by a chained call to [], which can give rise to the warning.
Here is the documentation showing the above loc[] usage.

How to get values from a dict into a new column, based on values in column

I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'.
I have a CSV file that looks like this:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000
A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000
I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector.
Code:
for ticker in company_dic:
sf1['sector'] = sf1['ticker'].apply(company_dic[ticker])
The code is giving me problems.
For example, the first sector is healthcare, I get this error:
ValueError: Healthcare is an unknown string function
Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation.
.map will return NaN if the ticker is not in the dict.
.apply can be used, but .map should be used
df['sector'] = df.ticker.apply(lambda x: company_dict.get(x))
.get will return None if the ticker isn't in the dict.
import pandas as pd
# test dataframe for this example
df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']})
# in your case, load the data from the file
df = pd.read_csv('file.csv')
# display(df)
ticker dimension calendardate datekey
0 AAPL ARQ 1999-12-31 2000-03-15
1 AAPL ARQ 2000-03-31 2000-06-12
2 AAPL ARQ 2000-06-30 2000-09-01
# dict of sectors
company_dict = {'AAPL': 'tech'}
# insert the sector column using map, into a specific column index
df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict))
# display(df)
ticker sector dimension calendardate datekey
0 AAPL tech ARQ 1999-12-31 2000-03-15
1 AAPL tech ARQ 2000-03-31 2000-06-12
2 AAPL tech ARQ 2000-06-30 2000-09-01
# write the updated data back to the csv file
df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop)
sf1['sector'] = temp
You can pass na_action='ignore' if you have NAN's in tickers column

ValueError: Array conditional must be same shape as self

I am super noob in pandas and I am following a tutorial that is obviously outdated.
I have this simple script that when I run I get tis error :
ValueError: Array conditional must be same shape as self
# loading the class data from the package pandas_datareader
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
# Adj Close:
# The closing price of the stock that adjusts the price of the stock for corporate actions.
# This price takes into account the stock splits and dividends.
# The adjusted close is the price we will use for this example.
# Indeed, since it takes into account splits and dividends, we will not need to adjust the price manually.
# First day
start_date = '2014-01-01'
# Last day
end_date = '2018-01-01'
# Call the function DataReader from the class data
goog_data = data.DataReader('GOOG', 'yahoo', start_date, end_date)
goog_data_signal = pd.DataFrame(index=goog_data.index)
goog_data_signal['price'] = goog_data['Adj Close']
goog_data_signal['daily_difference'] = goog_data_signal['price'].diff()
goog_data_signal['signal'] = 0.0
# this line produces the error
goog_data_signal['signal'] = pd.DataFrame.where(goog_data_signal['daily_difference'] > 0, 1.0, 0.0)
goog_data_signal['positions'] = goog_data_signal['signal'].diff()
print(goog_data_signal.head())
I am trying to understand the theory, the libraries and the methodology through practicing so bear with me if it is too obvious... :]
The where method is always called from a dataframe however here, you only need to check the condition for a series, so I found 2 ways to solve this problem:
The new where method doesn't support setting a value for the rows where condition is true (1.0 in your case), but still supports setting a value for the false rows (called the other parameter in the doc). So you can set the 1.0's manually later as follows:
goog_data_signal['signal'] = goog_data_signal.where(goog_data_signal['daily_difference'] > 0, other=0.0)
# the true rows will retain their values and you can set them to 1.0 as needed.
Or you can check the condition directly as follows:
goog_data_signal['signal'] = (goog_data_signal['daily_difference'] > 0).astype(int)
The second method produces the output for me:
price daily_difference signal positions
Date
2014-01-02 554.481689 NaN 0 NaN
2014-01-03 550.436829 -4.044861 0 0.0
2014-01-06 556.573853 6.137024 1 1.0
2014-01-07 567.303589 10.729736 1 0.0
2014-01-08 568.484192 1.180603 1 0.0

Categories