Issues with getting Pandas correlation - python
I have this code:
data = pd.read_csv("out.csv")
df=data[['created_at','ticker','close']]
print(df)
print(df.corr())
out.csv looks like this:
created_at,ticker,adj_close,close,high,low,open,volume
2020-06-02 09:30:00-04:00,A,90.33000183105469,90.33000183105469,90.41000366210938,89.94999694824219,90.0,45326.0
2020-06-02 09:31:00-04:00,A,90.2300033569336,90.2300033569336,90.2300033569336,90.22000122070312,90.22000122070312,709.0
2020-06-08 15:56:00-04:00,ZYXI,22.899900436401367,22.899900436401367,22.959999084472656,22.829999923706055,22.959999084472656,5304.0
2020-06-08 15:57:00-04:00,ZYXI,22.920000076293945,22.920000076293945,22.950000762939453,22.889999389648438,22.899999618530273,5317.0
2020-06-08 15:58:00-04:00,ZYXI,22.860000610351562,22.860000610351562,22.93000030517578,22.860000610351562,22.90999984741211,10357.0
I want to see a correlation matrix between tickers using the close price over time which is why I have included the created_at column. However, when I do print(df.corr) I only see the result below not sure why
close
close 1.0
Found the answer https://www.interviewqs.com/blog/py_stock_correlation
data = pd.read_csv("out.csv")
dfdata=data[['created_at','ticker','close']]
# print(df)
df_pivot = dfdata.pivot('created_at','ticker','close').reset_index()
print("loaded df")
# print(df_pivot.head())
corr_df = df_pivot.corr(method='pearson')
#reset symbol as index (rather than 0-X)
corr_df.head().reset_index()
del corr_df.index.name
print(corr_df.head(10))
Related
Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date
I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data. First dataframe df: earnDate anncTod 103 2015-11-18 0900 104 2016-02-24 0900 105 2016-05-18 0900 ... .......... ....... 128 2022-03-01 0900 129 2022-05-18 0900 130 2022-08-17 0900 Second dataframe af: Datetime Open High Low Close Volume 2005-01-03 36.3458 36.6770 35.5522 35.6833 3343500 ........... ......... ......... ......... ........ ........ 2022-04-22 246.5500 247.2000 241.4300 241.9100 1817977 I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df. I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop. Here's the full code: import pandas as pd import requests import datetime as dt ticker = 'TGT' ## pull orats earnings dates and store in pandas dataframe url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}' response = requests.get(url, allow_redirects=True) data = response.json() df = pd.DataFrame(data['data']) ## reduce number of dates to last 28 quarters and remove updatedAt column n = len(df.index)-28 df.drop(index=df.index[:n], inplace=True) df = df.iloc[: , 1:-1] ## import daily OHLC stock data file loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt" af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume']) ## create total return, overnight and intraday columns in df df['Total Move'] = '' ##col #2 df['Overnight'] = '' ##col #3 df['Intraday'] = '' ##col #4 for date in df['earnDate']: if df.iloc[date,1] == '0900': priorday = af.loc[af.index.get_loc(date)-1,0] priorclose = af.loc[priorday,4] open = af.loc[date,1] close = af.loc[date,4] df.iloc[date,2] = close/priorclose df.iloc[date,3] = open/priorclose df.iloc[date,4] = close/open else: print('afternoon') I get an error: if df.iloc[date,1] == '0900': ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types Converting the date columns to integers creates another error. Is there a better way I should go about doing this? Ideal output would look like (made up numbers, abbreviated output): earnDate anncTod Total Move Overnight Move Intraday Move 2015-11-18 0900 9% 7.2% 1.8% But would include all the dates given in the first dataframe. UPDATE I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot. Here is updated and simplified code (all else remains the same): import pandas as pd import requests import datetime as dt ticker = 'TGT' ## pull orats earnings dates and store in pandas dataframe url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}' response = requests.get(url, allow_redirects=True) data = response.json() df = pd.DataFrame(data['data']) ## reduce number of dates to last 28 quarters and remove updatedAt column n = len(df.index)-28 df.drop(index=df.index[:n], inplace=True) df = df.iloc[: , 1:-1] ## set index to earnDate df = df.set_index(pd.DatetimeIndex(df['earnDate'])) ## import daily OHLC stock data file loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt" af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume']) ## create total return, overnight and intraday columns in df df['Total Move'] = '' ##col #2 df['Overnight'] = '' ##col #3 df['Intraday'] = '' ##col #4 for date in df['earnDate']: if df.loc[date, 'anncTod'] == '0900': print(af.loc[date,'Open']) ##this is line generating error else: print('afternoon') I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX: import pandas as pd df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'], 'anncTod': ['0900', '1000', '0800'], 'Open': [111, 222, 333]}) df = df.set_index(df["earnDate"]) for date in df['earnDate']: if df.loc[date, 'anncTod'] == '0900': print(df.loc[date, 'Open']) # prints # 111
Python Yfinance - Can't get SPY history
I'm following a tutorial on using Yfinance in Jupyter Notebook to get prices for SPY (S&P 500) in a dataframe. The code looks simple, but I can't seem to get the desired results. df_tickers = pd.DataFrame() spyticker = yf.Ticker("SPY") print(spyticker) df_ticker = spyticker.history(period="max", interval="1d", start="1998-12-01", end="2022-01-01" , auto_adjust=True, rounding=True) df_ticker.head() The error states: "SPY: No data found for this date range, symbol may be delisted." But when I print spyticker, I get the correct yfinance object: yfinance.Ticker object <SPY>
I am not sure what your problem is but if I use the following: spyticker = yf.Ticker("SPY") df_ticker = spyticker.history(period="max", interval="1d", start="1998-12-01", end="2022-01-01" , auto_adjust=True, rounding=True) df_ticker.head() I get the following: Open High Low Close Volume Dividends Stock Splits Date 1998-12-01 76.02 77.27 75.43 77.00 8950600 0.0 0 1998-12-02 76.74 77.19 75.94 76.78 7495500 0.0 0 1998-12-03 76.76 77.45 75.35 75.51 12145300 0.0 0 1998-12-04 76.35 77.58 76.27 77.49 10339500 0.0 0 1998-12-07 77.29 78.21 77.25 77.86 4290000 0.0 0 My only explanation is that the call to spyticker.history already returns a dataframe, so it isn't necessary to define the df_ticker beforehand.
Changing some values in a row of pd.DataFrame leads to SettingWithCopyWarning pandas python
I'm making stock app, and when get data and edit it, see this error. Code is simple: df = yf.download(ticker, period=period, interval='1wk', auto_adjust=True, threads=True) Here i get DataFrame like bellow: Open High Low Close Volume Date 2020-05-25 205.940002 207.880005 196.699997 207.389999 114231300 2020-06-01 205.899994 220.589996 203.940002 219.550003 85600600 2020-06-08 219.600006 225.000000 213.559998 217.639999 68520500 2020-06-15 214.110001 226.500000 212.750000 220.639999 77023000 2020-06-22 220.919998 231.029999 213.500000 215.710007 78020200 ... ... ... ... ... ... 2022-05-02 96.410004 102.690002 88.709999 90.050003 95662500 2022-05-09 86.955002 88.639999 78.010002 87.989998 115590500 2022-05-16 87.699997 94.480003 84.730003 86.790001 107750100 2022-05-23 87.059998 87.415001 81.540001 82.470001 29212600 2022-05-25 83.720001 84.070000 81.070000 82.309998 22781455 Then i need to edit row with date "2022-05-23" df2 = df.loc[d_str] #d_str is 2022-05-25 df.loc[dtd2_s]['Close'] = df2['Close'] #dtd2_s is 2022-05-23 df.loc[dtd2_s]['High'] = max(df2['High'], df.loc[dtd2_s]['High']) df.loc[dtd2_s]['Low'] = min(df2['Low'], df.loc[dtd2_s]['High']) But here i get SettingWithCopyWarning. /Users/alex26/miniforge3/envs/rq/lib/python3.8/site-packages/pandas/core/series.py:1056: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy cacher_needs_updating = self._check_is_chained_assignment_possible() Do you know how to fix it?Thank you
You may want to try this usage of loc[]: df2 = df.loc[d_str] #d_str is 2022-05-25 df.loc[dtd2_s, 'Close'] = df2['Close'] #dtd2_s is 2022-05-23 df.loc[dtd2_s, 'High'] = max(df2['High'], df.loc[dtd2_s]['High']) df.loc[dtd2_s, 'Low'] = min(df2['Low'], df.loc[dtd2_s]['High']) Specifically, I have used loc[index_value, column_label] with one set of braces and a comma, rather than loc[index_value][column_label] which is a single call to loc[] followed by a chained call to [], which can give rise to the warning. Here is the documentation showing the above loc[] usage.
How to get values from a dict into a new column, based on values in column
I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'. I have a CSV file that looks like this: ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000 A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000 A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000 I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector. Code: for ticker in company_dic: sf1['sector'] = sf1['ticker'].apply(company_dic[ticker]) The code is giving me problems. For example, the first sector is healthcare, I get this error: ValueError: Healthcare is an unknown string function Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation. .map will return NaN if the ticker is not in the dict. .apply can be used, but .map should be used df['sector'] = df.ticker.apply(lambda x: company_dict.get(x)) .get will return None if the ticker isn't in the dict. import pandas as pd # test dataframe for this example df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']}) # in your case, load the data from the file df = pd.read_csv('file.csv') # display(df) ticker dimension calendardate datekey 0 AAPL ARQ 1999-12-31 2000-03-15 1 AAPL ARQ 2000-03-31 2000-06-12 2 AAPL ARQ 2000-06-30 2000-09-01 # dict of sectors company_dict = {'AAPL': 'tech'} # insert the sector column using map, into a specific column index df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict)) # display(df) ticker sector dimension calendardate datekey 0 AAPL tech ARQ 1999-12-31 2000-03-15 1 AAPL tech ARQ 2000-03-31 2000-06-12 2 AAPL tech ARQ 2000-06-30 2000-09-01 # write the updated data back to the csv file df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop) sf1['sector'] = temp You can pass na_action='ignore' if you have NAN's in tickers column
ValueError: Array conditional must be same shape as self
I am super noob in pandas and I am following a tutorial that is obviously outdated. I have this simple script that when I run I get tis error : ValueError: Array conditional must be same shape as self # loading the class data from the package pandas_datareader import pandas as pd from pandas_datareader import data import matplotlib.pyplot as plt # Adj Close: # The closing price of the stock that adjusts the price of the stock for corporate actions. # This price takes into account the stock splits and dividends. # The adjusted close is the price we will use for this example. # Indeed, since it takes into account splits and dividends, we will not need to adjust the price manually. # First day start_date = '2014-01-01' # Last day end_date = '2018-01-01' # Call the function DataReader from the class data goog_data = data.DataReader('GOOG', 'yahoo', start_date, end_date) goog_data_signal = pd.DataFrame(index=goog_data.index) goog_data_signal['price'] = goog_data['Adj Close'] goog_data_signal['daily_difference'] = goog_data_signal['price'].diff() goog_data_signal['signal'] = 0.0 # this line produces the error goog_data_signal['signal'] = pd.DataFrame.where(goog_data_signal['daily_difference'] > 0, 1.0, 0.0) goog_data_signal['positions'] = goog_data_signal['signal'].diff() print(goog_data_signal.head()) I am trying to understand the theory, the libraries and the methodology through practicing so bear with me if it is too obvious... :]
The where method is always called from a dataframe however here, you only need to check the condition for a series, so I found 2 ways to solve this problem: The new where method doesn't support setting a value for the rows where condition is true (1.0 in your case), but still supports setting a value for the false rows (called the other parameter in the doc). So you can set the 1.0's manually later as follows: goog_data_signal['signal'] = goog_data_signal.where(goog_data_signal['daily_difference'] > 0, other=0.0) # the true rows will retain their values and you can set them to 1.0 as needed. Or you can check the condition directly as follows: goog_data_signal['signal'] = (goog_data_signal['daily_difference'] > 0).astype(int) The second method produces the output for me: price daily_difference signal positions Date 2014-01-02 554.481689 NaN 0 NaN 2014-01-03 550.436829 -4.044861 0 0.0 2014-01-06 556.573853 6.137024 1 1.0 2014-01-07 567.303589 10.729736 1 0.0 2014-01-08 568.484192 1.180603 1 0.0