solve SettingWithCopyWarning in pandas

solve SettingWithCopyWarning in pandas - python

Here is the issue I encountered:
/var/folders/v0/57ps6v293zx6jb2g78v0__1h0000gn/T/ipykernel_58173/392784622.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here's my code
AMZN['VWDR'] = AMZN['Volume'] * AMZN['DailyReturn']/ AMZN['Volume'].cumsum()
I also tried the following, but it did not resolve the warning:
AMZN.loc[AMZN.index,'VWDR'] = AMZN.loc[AMZN.index, 'Volume'] * AMZN.loc[AMZN.index, 'DailyReturn']/ AMZN.loc[AMZN.index,'Volume'].cumsum()
Below are the codes to get my table:
import pandas as pd
import yfinance as yf
# now just read the html to get all the S&P500 tickers
dataload=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = dataload[0]
# now get the first column(tickers) from the above data
# convert it into a list
ticker_list = df['Symbol'].values.tolist()
# convert the list into a string, separated by space, and replace . with -
all_tickers = " ".join(ticker_list).replace('.', '-') # this is to ensure that we could find BRK.B and BF.B
# get all the tickers from yfinance
tickers = yf.Tickers(all_tickers)
# set a start and end date to get two-years info
# group by the ticker
hist = tickers.history(start='2020-05-01', end='2022-05-01', group_by='ticker')
# ‘Stack’ the table to get it into row form
Data_stack = pd.DataFrame(hist.stack(level=0).reset_index().rename(columns = {'level_1':'Ticker'}))
# Add a column to the original table containing the daily return per ticker
Data_stack['DailyReturn'] = Data_stack.sort_values(['Ticker', 'Date']).groupby('Ticker')['Close'].pct_change()
Data_stack = Data_stack.set_index('Date') # now set the Date as the index
# get the AMZN data by sort the original table on Ticker
AMZN = Data_stack[Data_stack.Ticker=='AMZN']
For simplicity, you might just download the AMZN ticker table from yfinance

Copy AMZN when you create it:
AMZN = Data_stack[Data_stack.Ticker=='AMZN'].copy()
# ^^^^^^^
Then the rest of your code won't have a warning.

The one you are working with is the chained assignment and to resolve this one you need copy and omit loc for select columns.
AMZN = Data_stack[Data_stack.Ticker=='AMZN'].copy()
AMZN['VWDR'] = AMZN['Volume'] * AMZN['DailyReturn']/ AMZN['Volume'].cumsum()

Related

how pull beta data from yahoo.finance?

beta values are calculated in yahoo.finance and thinking I can save time rather calculating through variance and etc. The beta chart can be seen under stock chart. I am able to extract close price an volume for the ticker using the code below:
import yfinance as yf
from yahoofinancials import YahooFinancials
df = yf.download('AAPL, MSFT',
start='2021-08-01',
end=date.today(),
progress=False)
adjusted_close=df['Adj Close'].reset_index()
volume=df['Volume'].reset_index()
but how can get beta values the same way we get for prices or volumes? I am looking for pulling historical beta data with start and end date.

you can do this in a batch, using concat instead of the soon-to-be deprecated append
# import yfinance
import yfinance as yf
# initialise with a df with the columns
df = pd.DataFrame(columns=['Stock','Beta','Marketcap'])
# here, symbol_sgx is the list of symbols (tickers) you would like to retrieve data of
# for instance, to retrieve information for DBS, UOB, and Singtel, use the following:
symbol_sgx = ['D05.SI', 'U11.SI','Z74.SI']
for stock in symbol_sgx:
ticker = yf.Ticker(stock)
info = ticker.info
beta = info.get('beta')
marketcap = info.get('marketCap')
df_temp = pd.DataFrame({'Stock':stock,'Beta':[beta],'Marketcap':[marketcap]})
df = pd.concat([df, df_temp], ignore_index=True)
# this line allows you to check that you retrieved the right information
df
info.get() is a better alternative than info[] The latter is little buggy; if one of the tickers is errant (eg outdated, delisted) the script would stop. This is especially annoying if you have a long list of tickers, and you don't know which is the errant ticker. info.get() would continue to run if no information is available. For these entries, you just need to post-process a df.dropna() to remove NaNs.

Yahoo Finance has a dictionary of company information that can be retrieved in bulk. This includes beta values, which can be used.
import yfinance as yf
ticker = yf.Ticker('AAPL')
stock_info = ticker.info
stock_info['beta']
# 1.201965

Error when adding a new column to pandas dataframe using a rolling mean function

I have a script where I download some fx rates from the web and would like to calculate the rolling mean. When running the script, I obtain an error in relation to the rates column that I am trying to calculate the rolling mean for. I would like to produce an extra column with the rolling average displayed. Here is what I have so far. The last 3 lines above the comments is where the error seems to be.
Now I get the following error "KeyError: 'rates'"
import pandas as pd
import matplotlib.pyplot as plt
url1 = 'http://www.bankofcanada.ca/'
url2 = 'valet/observations/group/FX_RATES_DAILY/csv?start_date='
start_date = '2017-01-03' # Earliest start date is 2017-01-03
url = url1 + url2 + start_date # Complete url to download csv file
# Read in rates for different currencies for a range of dates
rates = pd.read_csv(url, skiprows=39, index_col='date')
rates.index = pd.to_datetime(rates.index) # assures data type to be a datetime
print("The pandas dataframe with the rates ")
print(rates)
# Get number of days & number of currences from shape of rates - returns a tuple in the
#format (rows, columns)
days, currencies = rates.shape
# Read in the currency codes & strip off extraneous part. Uses url string, skips the first
#10 rows and returns to the data frame columns of index 0 and 2. It will read n rows according
# to the variable currencies. This was returned in line 19 from a tuple produced by .shape
codes = pd.read_csv(url, skiprows=10, usecols=[0,2],
nrows=currencies)
#Print out the dataframe read from the web
print("Dataframe with the codes")
print(codes)
#A for loop to goe through the codes dataframe. For each ith row and for the index 1 column,
# the for loop will split the string with a string 'to Canadian'
for i in range(currencies):
codes.iloc[i, 1] = codes.iloc[i, 1].split(' to Canadian')[0]
# Report exchange rates for the most most recent date available
date = rates.index[-1] # most recent date available
print('\nCurrency values on {0}'.format(date))
#Using a for loop and zip, the values in the code and rate objects are grouped together
# and then printed to the screen with a new format
for (code, rate) in zip(codes.iloc[:, 1], rates.loc[date]):
print("{0:20s} Can$ {1:8.6g}".format(code, rate))
#Assign values into a dataframe/slice rates dataframe
FXAUDCAD_daily = pd.DataFrame(index=['dates'], columns={'dates', 'rates'})
FXAUDCAD_daily = FXAUDCAD
FXAUDCAD_daily['rolling mean'] = FXAUDCAD_daily.loc['rates'].rolling_mean()
print(FXAUDCAD_daily)
#Print the values to the screen
#Calculate the rolling average using the rolling average pandas function
#Create a figure object using matplotlib/pandas
#Plot values on figure on the figure object.
New updated code using feedback, I made the following
import pandas as pd
import matplotlib.pyplot as plt
import datetime
url1 = 'http://www.bankofcanada.ca/'
url2 = 'valet/observations/group/FX_RATES_DAILY/csv?start_date='
start_date = '2017-01-03' # Earliest start date is 2017-01-03
url = url1 + url2 + start_date # Complete url to download csv file
# Read in rates for different currencies for a range of dates
rates = pd.read_csv(url, skiprows=39, index_col='date')
rates.index = pd.to_datetime(rates.index) # assures data type to be a
datetime
#print("The pandas dataframe with the rates ")
#print(rates)
# Get number of days & number of currences from shape of rates - returns
#a tuple in the
#format (rows, columns)
days, currencies = rates.shape
# Read in the currency codes & strip off extraneous part. Uses url
string, skips the first
#10 rows and returns to the data frame columns of index 0 and 2. It will
#read n rows according
# to the variable currencies. This was returned in line 19 from a tuple
#produced by .shape
codes = pd.read_csv(url, skiprows=10, usecols=[0,2],
nrows=currencies)
#Print out the dataframe read from the web
#print("Dataframe with the codes")
#print(codes)
#A for loop to goe through the codes dataframe. For each ith row and for
#the index 1 column,
# the for loop will split the string with a string 'to Canadian'
for i in range(currencies):
codes.iloc[i, 1] = codes.iloc[i, 1].split(' to Canadian')[0]
# Report exchange rates for the most most recent date available
date = rates.index[-1] # most recent date available
#print('\nCurrency values on {0}'.format(date))
#Using a for loop and zip, the values in the code and rate objects are
grouped together
# and then printed to the screen with a new format
#for (code, rate) in zip(codes.iloc[:, 1], rates.loc[date]):
#print("{0:20s} Can$ {1:8.6g}".format(code, rate))
#Create dataframe with columns of date and raters
#Assign values into a dataframe/slice rates dataframe
FXAUDCAD_daily = pd.DataFrame(index=['date'], columns={'date', 'rates'})
FXAUDCAD_daily = rates['FXAUDCAD']
print(FXAUDCAD_daily)
FXAUDCAD_daily['rolling mean'] =
FXAUDCAD_daily['rates'].rolling(1).mean()

Let's try to fix your code.
First of all, this line seems a bit odd to me, as FXAUDCAD isn't defined.
FXAUDCAD_daily = FXAUDCAD
Then, you might consider rewriting your rolling mean calculation as follows.
FXAUDCAD_daily['rolling mean'] = FXAUDCAD_daily['rates'].rolling(WINDOW_SIZE).mean()

What's your pandas version? pd.rolling_mean() is not supported above pandas version 0.18.0
Update your pandas library with:
pip3 install --upgrade pandas
And then use rolling() method https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html):
FXAUDCAD_daily['rolling mean'] = FXAUDCAD_daily['rates'].rolling(*window_size*).mean()

I managed to solve it, when I sliced the original dataframe rates into FXAUDCAD_daily, it already came with the same index of date. So I was getting a key error because the currency abbreviation was used as the name of the column with index 1, not the string 'rate'
But now I have another trivial problem, how do I rename the FXAUDCAD column to just rate. I will post another question for this
import pandas as pd
import matplotlib.pyplot as plt
import datetime
url1 = 'http://www.bankofcanada.ca/'
url2 = 'valet/observations/group/FX_RATES_DAILY/csv?start_date='
start_date = '2017-01-03'
url = url1 + url2 + start_date
rates = pd.read_csv(url, skiprows=39, index_col='date')
rates.index = pd.to_datetime(rates.index) # assures data type to be a
datetime
print("Print rates to the screen",rates)
#print index
print("Print index to the screen", rates.index)
days, currencies = rates.shape
codes = pd.read_csv(url, skiprows=10, usecols=[0,2],
nrows=currencies)
for i in range(currencies):
codes.iloc[i, 1] = codes.iloc[i, 1].split(' to Canadian')[0]
#date = rates.index[-1]
#Make a series of just the rates of FXAUDCAD
FXAUDCAD_daily = pd.DataFrame(rates['FXAUDCAD'])
#Print FXAUDRATES to the screen
print(FXAUDCAD_daily)
#Calculate the MA using the rolling function with a window size of 1
FXAUDCAD_daily['rolling mean'] =
FXAUDCAD_daily['FXAUDCAD'].rolling(1).mean()
#print out the new dataframe with calculation
print(FXAUDCAD_daily)
#Rename one of the data frame from FXAUDCAD to Exchange Rate
FXAUDCAD_daily.rename(columns={'rate':'FXAUDCAD'})
#print out the new dataframe with calculation
print(FXAUDCAD_daily)

Data appears when printed but doesn't show up in dataframe

#! /usr/lib/python3
import yfinance as yf
import pandas as pd
pd.set_option('display.max_rows', None, 'display.max_columns', None)
# Request stock data from yfinance
ticker = yf.Ticker('AAPL')
# Get all option expiration dates in the form of a list
xdates = ticker.options
# Go through the list of expiry dates one by one
for xdate in xdates:
# Get option chain info for that xdate
option = ticker.option_chain(xdate)
# print out this value, get back 15 columns and 63 rows of information
print(option)
# Put that same data in dataframe
df = pd.DataFrame(data = option)
# Show dataframe
print(df)
Expected: df will show a DataFrame containing the same information that is shown when running print(option), i.e. 15 columns and 63 rows of data, or at least some part of them
Actual:
df shows only two columns with no information
df.shape results in (2,1)
print(df.columns.tolist()) results in [0]
Since the desired info appears when you print it, I'm confused as to why it's not appearing in the dataframe.

The data of option_chain for specific expiration date is avaialable in calls property of the object as dataframe. You don't have to create a new dataframe.
ticker = yf.Ticker('AAPL')
xdates = ticker.options
option = ticker.option_chain(xdates[0])
option.calls # DataFrame
GitHub - yfinance

Change dateformat

I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!

After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)

Trying to iterate and join Pandas DFs: AttributeError: 'Series' object has no attribute 'join'

I'm looking to pull the historical data for ~200 securities in a given index. I import the list of securities from a csv file then iterate over them to pull their respective data from the quandl api. That dataframe for each security has 12 columns, so I create a new column with the name of the security and the Adjusted Close value, so I can later identify the series.
I'm receiving an error when I try to join all the new columns into an empty dataframe. I receive an attribute error:
'''
Print output data
'''
grab_constituent_data()
AttributeError: 'Series' object has no attribute 'join'
Below is the code I have used to arrive here thus far.
'''
Import the modules necessary for analysis
'''
import quandl
import pandas as pd
import numpy as np
'''
Set file pathes and API keys
'''
ticker_path = ''
auth_key = ''
'''
Pull a list of tickers in the IGM ETF
'''
def ticker_list():
df = pd.read_csv('{}IGM Tickers.csv'.format(ticker_path))
# print(df['Ticker'])
return df['Ticker']
'''
Pull the historical prices for the securities within Ticker List
'''
def grab_constituent_data():
tickers = ticker_list()
main_df = pd.DataFrame()
for abbv in tickers:
query = 'EOD/{}'.format(str(abbv))
df = quandl.get(query, authtoken=auth_key)
print('Competed the query for {}'.format(query))
df['{} Adj_Close'.format(str(abbv))] = df['Adj_Close'].copy()
df = df['{} Adj_Close'.format(str(abbv))]
print('Completed the column adjustment for {}'.format(str(abbv)))
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
print(main_df.head())

It seems that in your line
df = df['{} Adj_Close'.format(str(abbv))]
you're getting a Serie and not a Dataframe. If you want to convert your serie to a dataframe, you can use the function to_frame() like:
df = df['{} Adj_Close'.format(str(abbv))].to_frame()
I didn't check if your code might be more simple, but this should fix your issue.

To change a series into pandas dataframe you can use the following
df = pd.DataFrame(df)
After running above code, the series will become dataframe, then you can proceed with join tasks you have mentioned earlier

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

solve SettingWithCopyWarning in pandas - python

Copy AMZN when you create it: AMZN = Data_stack[Data_stack.Ticker=='AMZN'].copy() # ^^^^^^^ Then the rest of your code won't have a warning.

The one you are working with is the chained assignment and to resolve this one you need copy and omit loc for select columns. AMZN = Data_stack[Data_stack.Ticker=='AMZN'].copy() AMZN['VWDR'] = AMZN['Volume'] * AMZN['DailyReturn']/ AMZN['Volume'].cumsum()

Related

how pull beta data from yahoo.finance?

Error when adding a new column to pandas dataframe using a rolling mean function

Data appears when printed but doesn't show up in dataframe

Change dateformat

Trying to iterate and join Pandas DFs: AttributeError: 'Series' object has no attribute 'join'

Categories

Resources