Python Beautiful Soup get data from 2 URL's - python

Good day, I am a student taking python classes. We are now learning about Beautiful Soup and I am having trouble extracting data from 2 tables as you will see in the code below:
import pandas as pd
import requests
list_of_urls = ['https://tradingeconomics.com/albania/gdp-growth-annual',
'https://trdingeconomics.com/south-africa/gdp-growth-annual']
final_df = pd.DataFrame()
for i in lists_of_urls:
table = pd.read_html(i, match='Related')
for row in table:
if row.loc['Related'] == 'GDP Annual Growth Rate':
final_df.append(row)
else:
pass

You don't need neither requests nor bs4. pd.read_html does the job.
list_of_urls = ['https://tradingeconomics.com/albania/gdp-growth-annual',
'https://tradingeconomics.com/south-africa/gdp-growth-annual']
data = {}
for i in list_of_urls:
country = i.split('/')[3]
df = pd.read_html(i, match='Related')[0]
data[country] = df.loc[df['Related'] == 'GDP Annual Growth Rate']
df = pd.concat(data)
Output:
>>> df
Related Last Previous Unit Reference
albania 1 GDP Annual Growth Rate 6.99 18.38 percent Sep 2021
south-africa 1 GDP Annual Growth Rate 1.70 2.90 percent Dec 2021

Related

groupby and join with pandas dataframe

Here is part of the data of scaffold_table
import pandas as pd
scaffold_table = pd.DataFrame({
'Position':[2000]*5,
'Company':['Amazon', 'Amazon', 'Alphabet', 'Amazon', 'Alphabet'],
'Date':['2020-05-26','2020-05-27','2020-05-27','2020-05-28','2020-05-28'],
'Ticker':['AMZN','AMZN','GOOG','AMZN','GOOG'],
'Open':[2458.,2404.9899,1417.25,2384.330078,1396.859985],
'Volume':[3568200,5056900,1685800,3190200,1692200],
'Daily Return':[-0.006164,-0.004736,0.000579,-0.003854,-0.000783],
'Daily PnL':[-12.327054,-9.472236,1.157283,-7.708126,-1.565741],
'Cumulative PnL/Ticker':[-12.327054,-21.799290,1.157283,-29.507417,-0.408459]})
I would like to create a summary table that returns the overall yield per ticker. The overall yield should be calculated as the total PnL per ticker divided by the last date's position per ticker
# Create a summary table of your average daily PnL, total PnL, and overall yield per ticker
summary_table = pd.DataFrame(scaffold_table.groupby(['Date','Ticker'])['Daily PnL'].mean())
position_ticker = pd.DataFrame(scaffold_table.groupby(['Date','Ticker'])['Position'].sum())
# the total PnL is the sum of PnL per Ticker after two years period
totals = summary_table.droplevel('Date').groupby('Ticker').sum().rename(columns={'Daily PnL':'total PnL'})
summary_table = summary_table.join(totals, on='Ticker')
summary_table = summary_table.join(position_ticker, on = ['Date','Ticker'], how='inner')
summary_table['Yield'] = summary_table.loc['2022-04-29']['total PnL']/summary_table.loc['2022-04-29']['Position']
summary_table
But the yield is showing NaN, could anyone take a look at my codes?
I used ['2022-04-29'] because it is the last date, but I think there are some codes to return the last date without explicitly inputting that.
I solved the problem with the following code
# we want the overall yield per ticker, so total PnL/Position on the last date
summary_table['Yield'] = summary_table['total PnL']/summary_table.loc['2022-04-29']['Position']
This does not specify the date for total PnL since it's the sum by ticker without regard of the date.
I note the comment in your code saying: "Create a summary table of your average daily PnL, total PnL, and overall yield per ticker".
If we start from this, here are a few observations:
the average daily PnL per ticker should just be the mean of Daily PnL for each ticker
the total PnL per ticker is already listed in the Cumulative PnL/Ticker column, so if we use groupby on Ticker and get the value in the Cumulative PnL/Ticker column for the most recent date (namely, for the last() row in the groupby assuming the df is sorted by date), we don't have to calculate it
for the overall yield per ticker (which you have specified "should be calculated as the total PnL per ticker divided by the last date's position per ticker") we can get the relevant Position (namely, for the most recent date per ticker) analogously to how we got the relevant Cumulative PnL/Ticker and use these two values to calculate Yield.
Here is sample code to do this:
import pandas as pd
scaffold_table = pd.DataFrame({
'Position':[2000]*5,
'Company':['Amazon', 'Amazon', 'Alphabet', 'Amazon', 'Alphabet'],
'Date':['2020-05-26','2020-05-27','2020-05-27','2020-05-28','2020-05-28'],
'Ticker':['AMZN','AMZN','GOOG','AMZN','GOOG'],
'Open':[2458.,2404.9899,1417.25,2384.330078,1396.859985],
'Volume':[3568200,5056900,1685800,3190200,1692200],
'Daily Return':[-0.006164,-0.004736,0.000579,-0.003854,-0.000783],
'Daily PnL':[-12.327054,-9.472236,1.157283,-7.708126,-1.565741],
'Cumulative PnL/Ticker':[-12.327054,-21.799290,1.157283,-29.507417,-0.408459]})
print(scaffold_table)
# Create a summary table of your average daily PnL, total PnL, and overall yield per ticker
gb = scaffold_table.groupby(['Ticker'])
summary_table = gb.last()[['Position', 'Cumulative PnL/Ticker']].rename(columns={'Cumulative PnL/Ticker':'Total PnL'})
summary_table['Yield'] = summary_table['Total PnL'] / summary_table['Position']
summary_table['Average Daily PnL'] = gb['Daily PnL'].mean()
summary_table = summary_table[['Average Daily PnL', 'Total PnL', 'Yield']]
print('\nsummary_table:'); print(summary_table)
exit()
Input:
Position Company Date Ticker Open Volume Daily Return Daily PnL Cumulative PnL/Ticker
0 2000 Amazon 2020-05-26 AMZN 2458.000000 3568200 -0.006164 -12.327054 -12.327054
1 2000 Amazon 2020-05-27 AMZN 2404.989900 5056900 -0.004736 -9.472236 -21.799290
2 2000 Alphabet 2020-05-27 GOOG 1417.250000 1685800 0.000579 1.157283 1.157283
3 2000 Amazon 2020-05-28 AMZN 2384.330078 3190200 -0.003854 -7.708126 -29.507417
4 2000 Alphabet 2020-05-28 GOOG 1396.859985 1692200 -0.000783 -1.565741 -0.408459
Output:
Average Daily PnL Total PnL Yield
Ticker
AMZN -9.835805 -29.507417 -0.014754
GOOG -0.204229 -0.408459 -0.000204

Cannot scrape from table in yahoo finance

I am trying to scrape data from yahoo finance, but I am only able to get data from certain tables on the statistics page at this link https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL. I am able to get data from the top table and the left tables, but I can't figure out why the following program won't scrape from the right tables with values like Beta (5Y Monthly), 52 Week Change,Last Split Factor and Last Split Date
stockStatDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
page = requests.get(URL, headers=headers, timeout=5)
soup = BeautifulSoup(page.content, 'html.parser')
# Find all tables on the page
stock_data = soup.find_all('table')
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
print('tr: ', tr)
print()
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
print('tds: ', tds)
print()
print()
if len(tds) > 0:
# Index 0 of tds will contain the measurement
# Index 1 of tds will contain the value
# Insert measurement and value into stockDict
stockStatDict[tds[0].get_text()] = [tds[1].get_text()]
stock_stat_df = pd.DataFrame(data=stockStatDict)
print(stock_stat_df.head())
print(stock_stat_df.info())
Any idea why this code isn't retrieving those fields and values?
To get correct response from the Yahoo server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for t in soup.select("table"):
for tr in t.select("tr:has(td)"):
for sup in tr.select("sup"):
sup.extract()
tds = [td.get_text(strip=True) for td in tr.select("td")]
if len(tds) == 2:
print("{:<50} {}".format(*tds))
Prints:
Market Cap (intraday) 2.34T
Enterprise Value 2.36T
Trailing P/E 31.46
Forward P/E 26.16
PEG Ratio (5 yr expected) 1.51
Price/Sales(ttm) 7.18
Price/Book(mrq) 33.76
Enterprise Value/Revenue 7.24
Enterprise Value/EBITDA 23.60
Beta (5Y Monthly) 1.21
52-Week Change 50.22%
S&P500 52-Week Change 38.38%
52 Week High 145.09
52 Week Low 89.14
50-Day Moving Average 129.28
200-Day Moving Average 129.32
Avg Vol (3 month) 82.16M
Avg Vol (10 day) 64.25M
Shares Outstanding 16.69B
Implied Shares Outstanding N/A
Float 16.67B
% Held by Insiders 0.07%
% Held by Institutions 58.54%
Shares Short (Jun 14, 2021) 108.94M
Short Ratio (Jun 14, 2021) 1.52
Short % of Float (Jun 14, 2021) 0.65%
Short % of Shares Outstanding (Jun 14, 2021) 0.65%
Shares Short (prior month May 13, 2021) 94.75M
Forward Annual Dividend Rate 0.88
Forward Annual Dividend Yield 0.64%
Trailing Annual Dividend Rate 0.82
Trailing Annual Dividend Yield 0.60%
5 Year Average Dividend Yield 1.32
Payout Ratio 18.34%
Dividend Date May 12, 2021
Ex-Dividend Date May 06, 2021
Last Split Factor 4:1
Last Split Date Aug 30, 2020
Fiscal Year Ends Sep 25, 2020
Most Recent Quarter(mrq) Mar 26, 2021
Profit Margin 23.45%
Operating Margin(ttm) 27.32%
Return on Assets(ttm) 16.90%
Return on Equity(ttm) 103.40%
Revenue(ttm) 325.41B
Revenue Per Share(ttm) 19.14
Quarterly Revenue Growth(yoy) 53.60%
Gross Profit(ttm) 104.96B
EBITDA 99.82B
Net Income Avi to Common(ttm) 76.31B
Diluted EPS(ttm) 4.45
Quarterly Earnings Growth(yoy) 110.10%
Total Cash(mrq) 69.83B
Total Cash Per Share(mrq) 4.18
Total Debt(mrq) 134.74B
Total Debt/Equity(mrq) 194.78
Current Ratio(mrq) 1.14
Book Value Per Share(mrq) 4.15
Operating Cash Flow(ttm) 99.59B
Levered Free Cash Flow(ttm) 80.12B

How can get the yearly values from weekly data for a specific year in pandas?

import pandas as pd
import datetime as dt
from pandas_datareader import data, wb
#here I take the stocks from quandl
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
stocks = ["WIKI/AAPL", "WIKI/TSLA", "WIKI/IBM", "WIKI/LNKD"]
key = "fmP5pv-nVvhKmjWwF7Fb"
df = data.DataReader(stocks, "quandl", start, end, api_key=key)
# Here I create a dataFrame with the Volume values.
vol = df['Volume']
# Here I aggregate the data of Volume to weekly
vol['week'] = vol.index.week
vol['year'] = vol.index.year
week = vol.groupby(['year','week']).sum()
Now I'd like to find all the volume traded in the year of 2015, maybe deleting week column band using groupby. Something like that
year = vol.groupby(['2015']).sum()
year.head()
Thank you!
I'm not sure exactly what you're trying to do here. Maybe if you detail out what you're looking for, or what the end result should look like, I'd be closer on my answer.
you can group by like this:
vol.drop('week',axis=1).groupby('year').sum()
Symbols WIKIAAPL WIKIIBM WIKILNKD WIKITSLA
year
2015 1.306432e+10 1.306432e+10 1.306432e+10 1.306432e+10
2016 9.685872e+09 9.685872e+09 9.685872e+09 9.685872e+09
2017 6.640747e+09 6.640747e+09 6.640747e+09 6.640747e+09
2018 2.148944e+09 2.148944e+09 2.148944e+09 2.148944e+09
Or just getting totals for 2015
vol[vol.year == 2015].sum()
Symbols
WIKIAAPL 1.306432e+10
WIKIIBM 1.306432e+10
WIKILNKD 1.306432e+10
WIKITSLA 1.306432e+10
week 6.886000e+03
year 5.077800e+05
dtype: float64
Looks like the volume values are the same in your df for all 4 stocks. I'd double check that.

Calculating cummulative returns mid-year to mid-year Pandas

I have a Pandas dataframe of the size (80219 * 5) with the same structure as the image I have uploaded. The data can range from 2002-2016 for each company but if missing values appear the data either starts at a later date or ends at an earlier date as you can see in the image.
What I would like to do is to calculate yearly compounded returns measured from June to June for each company. If there is no data for the specific company for the full 12 months period from June to June the result should be nan. Below is my current code, but I don't know how to calculate the returns from June to June.
After having loaded the file and cleaned it I:
df[['Returns']] = df[['Returns']].apply(pd.to_numeric)
df['Names Date'] = pd.to_datetime(df['Names Date'])
df['Returns'] = df['Returns']+ 1
df = df[['Company Name','Returns','Names Date']]
df['year']=df['Names Date'].dt.year
df['cum_return'] = df.groupby(['Company Name','year']).cumprod()
df = df.groupby(['Company Name','year']).nth(11)
print(tabulate(df, headers='firstrow', tablefmt='psql'))
Which calculates the annual return from 1st of january to 31st of december..
I finally found a way to do it. The easiest way I could find is to calculate a rolling 12 month compounded return for each month and then slice the dataframe for to give me the 12 month returns of the months I want:
def myfunc(arr):
return np.cumprod(arr)[-1]
cum_ret = pd.Series()
grouped = df.groupby('Company Name')
for name, group in grouped:
cum_ret = cum_ret.append(pd.rolling_apply(group['Returns'],12,myfunc))
df['Cum returns'] = cum_ret
df = df.loc[df['Names Date'].dt.month==6]
df['Names Date'] = df['Names Date'].dt.year

Pause URL request Downloads

import urllib.request
import re
import csv
import pandas as pd
from bs4 import BeautifulSoup
columns = []
data = []
f = open('companylist.csv')
csv_f = csv.reader(f)
for row in csv_f:
stocklist = row
print(stocklist)
for s in stocklist:
print('http://finance.yahoo.com/q?s='+s)
optionsUrl = urllib.request.urlopen('http://finance.yahoo.com/q?s='+s).read()
soup = BeautifulSoup(optionsUrl, "html.parser")
stocksymbol = ['Symbol:', s]
optionsTable = [stocksymbol]+[
[x.text for x in y.parent.contents]
for y in soup.findAll('td', attrs={'class': 'yfnc_tabledata1','rtq_table': ''})
]
if not columns:
columns = [o[0] for o in optionsTable] #list(my_df.loc[0])
data.append(o[1] for o in optionsTable)
# create DataFrame from data
df = pd.DataFrame(data, columns=columns)
df.to_csv('test.csv', index=False)
The scripts works fine when I have about 200 to 300 stocks, but my company list has around 6000 symbols.
Is there a way I can download chunks of data, say like 200 stocks at a time, pause for while, and then resume the download again?
The export is one stock at a time; how do I write 200 at a time, and append the next batch to the initial batch (for the CSV)?
As #Merlin has recommended you - take a closer look at pandas_datareader module - you can do a LOT using this tool. Here is a small example:
import csv
import pandas_datareader.data as data
from pandas_datareader.yahoo.quotes import _yahoo_codes
stocklist = ['aapl','goog','fb','amzn','COP']
#http://www.jarloo.com/yahoo_finance/
#https://greenido.wordpress.com/2009/12/22/yahoo-finance-hidden-api/
_yahoo_codes.update({'Market Cap': 'j1'})
_yahoo_codes.update({'Div Yield': 'y'})
_yahoo_codes.update({'Bid': 'b'})
_yahoo_codes.update({'Ask': 'a'})
_yahoo_codes.update({'Prev Close': 'p'})
_yahoo_codes.update({'Open': 'o'})
_yahoo_codes.update({'1 yr Target Price': 't8'})
_yahoo_codes.update({'Earnings/Share': 'e'})
_yahoo_codes.update({"Day’s Range": 'm'})
_yahoo_codes.update({'52-week Range': 'w'})
_yahoo_codes.update({'Volume': 'v'})
_yahoo_codes.update({'Avg Daily Volume': 'a2'})
_yahoo_codes.update({'EPS Est Current Year': 'e7'})
_yahoo_codes.update({'EPS Est Next Quarter': 'e9'})
data.get_quote_yahoo(stocklist).to_csv('test.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)
Output: i've intentionally transposed the result set, because there are too many columns to show them here
In [2]: data.get_quote_yahoo(stocklist).transpose()
Out[2]:
aapl goog fb amzn COP
1 yr Target Price 124.93 924.83 142.87 800.92 51.23
52-week Range 89.47 - 132.97 515.18 - 789.87 72.000 - 121.080 422.6400 - 731.5000 31.0500 - 64.1300
Ask 97.61 718.75 114.58 716.73 44.04
Avg Daily Volume 3.81601e+07 1.75567e+06 2.56467e+07 3.94018e+06 8.94779e+06
Bid 97.6 718.57 114.57 716.65 44.03
Day’s Range 97.10 - 99.12 716.51 - 725.44 113.310 - 115.480 711.1600 - 721.9900 43.8000 - 44.9600
Div Yield 2.31 N/A N/A N/A 4.45
EPS Est Current Year 8.28 33.6 3.55 5.39 -2.26
EPS Est Next Quarter 1.66 8.38 0.87 0.96 -0.48
Earnings/Share 8.98 24.58 1.635 2.426 -4.979
Market Cap 534.65B 493.46B 327.71B 338.17B 54.53B
Open 98.6 716.51 115 713.37 43.96
PE 10.87 29.25 70.074 295.437 N/A
Prev Close 98.83 719.41 116.62 717.91 44.51
Volume 3.07086e+07 868366 2.70182e+07 2.42218e+06 5.20412e+06
change_pct -1.23% -0.09% -1.757% -0.1644% -1.0782%
last 97.61 718.75 114.571 716.73 44.0301
short_ratio 1.18 1.41 0.81 1.29 1.88
time 3:15pm 3:15pm 3:15pm 3:15pm 3:15pm
If you need more fields (codes for Yahoo Finance API) you may want to check the following links:
http://www.jarloo.com/yahoo_finance/
https://greenido.wordpress.com/2009/12/22/yahoo-finance-hidden-api/
Use python_datareader for this.
In [1]: import pandas_datareader.data as web
In [2]: import datetime
In [3]: start = datetime.datetime(2010, 1, 1)
In [4]: end = datetime.datetime(2013, 1, 27)
In [5]: f = web.DataReader("F", 'yahoo', start, end)
In [6]: f.ix['2010-01-04']
Out[6]:
Open 10.170000
High 10.280000
Low 10.050000
Close 10.280000
Volume 60855800.000000
Adj Close 9.151094
Name: 2010-01-04 00:00:00, dtype: float64
To pause after every 200 downloads, you could - also when you use pandas_datareader:
import time
for i, s in enumerate(stocklist):
if i % 200 == 0:
time.sleep(5) # in seconds
To save all data into a single file (IIUC):
stocks = pd.DataFrame() # to collect all results
In every iteration:
stocks = pd.concat([stocks, pd.DataFrame(data, columns=columns))
Finally:
stocks.to_csv(path, index=False)

Categories