How do I scrape pages with dynamically generated URLs using Python?

How do I scrape pages with dynamically generated URLs using Python? - python

I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios, but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. And the exact "full-company-name" isn't known in advance. Only the company symbol, "IBM" is known.
Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). But in this case, that can't be done.
For example, CSCO string is
http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios
and another example url string is AAPL:
http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios
So in order to get the url, I had to search the symbol in the input box on the main page:
http://www.dailyfinance.com/
I've noticed that when I type "CSCO" and inspect the search input at (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to
http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com
and that the referer actually gives the path that I want to capture
Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive
Sorry for the long explanation. So the question is how do I extract the url in the Referer? If that is not possible, how should I approach this problem? Is there another way?
I really appreciate your help.

I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.
First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?
When I land on the base URL, http://www.dailyfinance.com/, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. What does this mean?
In simple terms, I was transferred somewhere. The URL for this GET request is the following:
http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO
Now, we test if it works with AAPL by using a simple URL manipulation.
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url
The above gives the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]
See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code:
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url
When ran, this gives is the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]
Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:
from bs4 import BeautifulSoup as bsoup
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row
I then try running this code, only to encounter an error with the following traceback:
File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'
Of note is the line 'NoneType' object.... This means our target div does not exist! Egads, but why am I seeing the following?!
There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page.
I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:
A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:
WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com here completely. Our new url is as follows:
http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL
Let's try using that in our final scraper!
from bs4 import BeautifulSoup as bsoup
import requests as rq
csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick
r = rq.get(new_url)
soup = bsoup(r.content)
table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row.get_text()
And our raw results for CSCO's financial ratios data is as follows:
Company
Industry
Valuation Ratios
P/E Ratio (TTM)
15.40
14.80
P/E High - Last 5 Yrs
24.00
28.90
P/E Low - Last 5 Yrs
8.40
12.10
Beta
1.37
1.50
Price to Sales (TTM)
2.51
2.59
Price to Book (MRQ)
2.14
2.17
Price to Tangible Book (MRQ)
4.25
3.83
Price to Cash Flow (TTM)
11.40
11.60
Price to Free Cash Flow (TTM)
28.20
60.20
Dividends
Dividend Yield (%)
3.30
2.50
Dividend Yield - 5 Yr Avg (%)
N.A.
1.20
Dividend 5 Yr Growth Rate (%)
N.A.
144.07
Payout Ratio (TTM)
45.00
32.00
Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70
Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60
Growth Rates (%)
Sales - 5 Yr Growth Rate (%)
5.51
5.12
EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90
EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90
EPS - 5 Yr Growth Rate (%)
8.91
9.04
Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94
Financial Strength
Quick Ratio (MRQ)
2.40
2.70
Current Ratio (MRQ)
2.60
2.90
LT Debt to Equity (MRQ)
0.22
0.20
Total Debt to Equity (MRQ)
0.31
0.25
Interest Coverage (TTM)
18.90
19.10
Profitability Ratios (%)
Gross Margin (TTM)
63.20
62.50
Gross Margin - 5 Yr Avg
66.30
64.00
EBITD Margin (TTM)
26.20
25.00
EBITD - 5 Yr Avg
28.82
0.00
Pre-Tax Margin (TTM)
21.10
20.00
Pre-Tax Margin - 5 Yr Avg
21.60
18.80
Management Effectiveness (%)
Net Profit Margin (TTM)
17.10
17.65
Net Profit Margin - 5 Yr Avg
17.90
15.40
Return on Assets (TTM)
8.30
8.90
Return on Assets - 5 Yr Avg
8.90
8.00
Return on Investment (TTM)
11.90
12.30
Return on Investment - 5 Yr Avg
12.50
10.90
Efficiency
Revenue/Employee (TTM)
637,890.00
556,027.00
Net Income/Employee (TTM)
108,902.00
98,118.00
Receivable Turnover (TTM)
5.70
5.80
Inventory Turnover (TTM)
11.30
9.70
Asset Turnover (TTM)
0.50
0.50
[Finished in 2.0s]
Cleaning up the data is up to you.
One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.
Hopefully you learned something from this. Let us know if this helps and good luck.

Doesn't answer your specific question, but solves your problem.
http://www.dailyfinance.com/quotes/{Company Symbol}/{Stock Exchange}
Examples:
http://www.dailyfinance.com/quotes/AAPL/NAS
http://www.dailyfinance.com/quotes/IBM/NYSE
http://www.dailyfinance.com/quotes/CSCO/NAS
To get to the financial ratios page you could then employ something like this:
import urllib2
def financial_ratio_url(symbol, stock_exchange):
starturl = 'http://www.dailyfinance.com/quotes/'
starturl += '/'.join([symbol, stock_exchange])
req = urllib2.Request(starturl)
res = urllib2.urlopen(starturl)
return '/'.join([res.geturl(),'financial-ratios'])
Example:
financial_ratio_url('AAPL', 'NAS')
'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios'

Related

How to scrape yahoo finance news headers with BeautifulSoup?

I would like to scrape news from yahoo's finance, for a pair.
How does bs4's find() or find_all() work?
for this example:
with this link:
I'm traying to extract the data ... but no data is scraped. why? what's wrong?
I'm using this, but nothing is printed (except the tickers)
html = BeautifulSoup(source_s, "html.parser") # "html")
news_table_s = html.find_all("div",{"class":"Py(14px) Pos(r)"})
news_tables_s[ticker_s] = news_table_s
print("news_tables", news_tables_s)
I would like to extract the headers from a yahoo finance web page.

You have to iterate your ResultSet to get anything out.
for e in html.find_all("div",{"class":"Py(14px) Pos(r)"}):
print(e.h3.text)
Recommendation - Do not use dynamic classes to select elements use more static ids or HTML structure, here selected via css selector
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Example
from bs4 import BeautifulSoup
import requests
url='https://finance.yahoo.com/quote/EURUSD%3DX?p=EURUSD%3DX'
html = BeautifulSoup(requests.get(url).text)
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Output
EUR/USD steadies, but bears sharpen claws as dollar feasts on Fed bets
EUR/USD Weekly Forecast – Euro Gives Up Early Gains for the Week
EUR/USD Forecast – Euro Plunges Yet Again on Friday
EUR/USD Forecast – Euro Gives Up Early Gains
EUR/USD Forecast – Euro Continues to Test the Same Indicator
Dollar gains as inflation remains sticky; sterling retreats
Siemens Issues Blockchain Based Euro-Denominated Bond on Polygon Blockchain
EUR/USD Forecast – Euro Rallies
FOREX-Dollar slips with inflation in focus; euro, sterling up on jobs data
FOREX-Jobs figures send euro, sterling higher; dollar slips before CPI

How to get all At-The-Money options using yahoo_fin

I am trying to create a list of all At-The-Money (ATM) option contracts using yahoo_fin options module.
Yahoo_fin offers 2 methods for getting all call and put contracts:
from yahoo_fin import options as ops
# ops.get_call(Ticker, expiration_date=None)
# ops.get_pull(Ticker, expiration_date=None)
# If no expiration_date is passed, the nearest expiration date is used
ops.get_calls("aapl")
ops.get_puts("aapl")
These two methods return the following dataframes, respectively:
I have done some research at possibly using the strike price and comparing it with the underlying stock price. This is probably the most basic way, but the underlying stock may hay a price that is not exactly the same as an option's strike price. Another alternative I have read is to use delta. Can anybody provide insight into how I could find the ATM options using the data provided by yahoo_fin? Is it possible?

For ATM options the strike price is equal to the underlying asset’s current market price, as explained here.
However, there is no option for every possible market price, as options are oganized in grids. You could get the price of the option for which the strike price is closest to the underlying's market price. You can implement it as:
from yahoo_fin import options, stock_info
symbol = "AAPL"
last_adj_close = stock_info.get_data(symbol)["adjclose"][-1]
calls = options.get_calls("aapl")
puts = options.get_puts("aapl")
atm_call = calls.iloc[(calls["Strike"] - last_adj_close).abs().argsort()[:1]]
Output:
Contract Name Last Trade Date Strike Last Price Bid Ask Change % Change Volume Open Interest Implied Volatility
43 AAPL221118C00149000 2022-11-16 3:59PM EST 149.0 1.58 1.5 1.66 -1.12 -41.48% 22594 14120 40.09%
for the AAPL stock:
open high low close adjclose volume ticker
2022-11-14 148.970001 150.279999 147.429993 148.279999 148.279999 73374100 AAPL
2022-11-15 152.220001 153.589996 148.559998 150.039993 150.039993 89868300 AAPL
You can also obtain the two closest options by adjusting the parameter in
.argsort()[:2].

Cannot scrape from table in yahoo finance

I am trying to scrape data from yahoo finance, but I am only able to get data from certain tables on the statistics page at this link https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL. I am able to get data from the top table and the left tables, but I can't figure out why the following program won't scrape from the right tables with values like Beta (5Y Monthly), 52 Week Change,Last Split Factor and Last Split Date
stockStatDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
page = requests.get(URL, headers=headers, timeout=5)
soup = BeautifulSoup(page.content, 'html.parser')
# Find all tables on the page
stock_data = soup.find_all('table')
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
print('tr: ', tr)
print()
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
print('tds: ', tds)
print()
print()
if len(tds) > 0:
# Index 0 of tds will contain the measurement
# Index 1 of tds will contain the value
# Insert measurement and value into stockDict
stockStatDict[tds[0].get_text()] = [tds[1].get_text()]
stock_stat_df = pd.DataFrame(data=stockStatDict)
print(stock_stat_df.head())
print(stock_stat_df.info())
Any idea why this code isn't retrieving those fields and values?

To get correct response from the Yahoo server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for t in soup.select("table"):
for tr in t.select("tr:has(td)"):
for sup in tr.select("sup"):
sup.extract()
tds = [td.get_text(strip=True) for td in tr.select("td")]
if len(tds) == 2:
print("{:<50} {}".format(*tds))
Prints:
Market Cap (intraday) 2.34T
Enterprise Value 2.36T
Trailing P/E 31.46
Forward P/E 26.16
PEG Ratio (5 yr expected) 1.51
Price/Sales(ttm) 7.18
Price/Book(mrq) 33.76
Enterprise Value/Revenue 7.24
Enterprise Value/EBITDA 23.60
Beta (5Y Monthly) 1.21
52-Week Change 50.22%
S&P500 52-Week Change 38.38%
52 Week High 145.09
52 Week Low 89.14
50-Day Moving Average 129.28
200-Day Moving Average 129.32
Avg Vol (3 month) 82.16M
Avg Vol (10 day) 64.25M
Shares Outstanding 16.69B
Implied Shares Outstanding N/A
Float 16.67B
% Held by Insiders 0.07%
% Held by Institutions 58.54%
Shares Short (Jun 14, 2021) 108.94M
Short Ratio (Jun 14, 2021) 1.52
Short % of Float (Jun 14, 2021) 0.65%
Short % of Shares Outstanding (Jun 14, 2021) 0.65%
Shares Short (prior month May 13, 2021) 94.75M
Forward Annual Dividend Rate 0.88
Forward Annual Dividend Yield 0.64%
Trailing Annual Dividend Rate 0.82
Trailing Annual Dividend Yield 0.60%
5 Year Average Dividend Yield 1.32
Payout Ratio 18.34%
Dividend Date May 12, 2021
Ex-Dividend Date May 06, 2021
Last Split Factor 4:1
Last Split Date Aug 30, 2020
Fiscal Year Ends Sep 25, 2020
Most Recent Quarter(mrq) Mar 26, 2021
Profit Margin 23.45%
Operating Margin(ttm) 27.32%
Return on Assets(ttm) 16.90%
Return on Equity(ttm) 103.40%
Revenue(ttm) 325.41B
Revenue Per Share(ttm) 19.14
Quarterly Revenue Growth(yoy) 53.60%
Gross Profit(ttm) 104.96B
EBITDA 99.82B
Net Income Avi to Common(ttm) 76.31B
Diluted EPS(ttm) 4.45
Quarterly Earnings Growth(yoy) 110.10%
Total Cash(mrq) 69.83B
Total Cash Per Share(mrq) 4.18
Total Debt(mrq) 134.74B
Total Debt/Equity(mrq) 194.78
Current Ratio(mrq) 1.14
Book Value Per Share(mrq) 4.15
Operating Cash Flow(ttm) 99.59B
Levered Free Cash Flow(ttm) 80.12B

Trouble getting a particular item from a static webpage

I'm trying to parse only the currencies from a table in a webpage but I'm getting completely different results from that site. The missing currencies are available in the page source, so thay are static. Where I'm going wrong? This link is different from the one I used in another post. I thought to mention this for the sake of clarity.
Site address
I've tried:
import requests
from bs4 import BeautifulSoup
URL = "https://www.forexfactory.com/calendar.php?day=today"
res = requests.get(URL,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("tr.calendar_row"):
currency = item.select_one("td.calendar__currency").get_text(strip=True)
print(currency)
Output I'm getting (very different from the ones available in that site):
JPY
JPY
EUR
EUR
GBP
GBP
GBP
EUR
EUR
GBP
USD
USD
USD
GBP
JPY
AUD
AUD
CNY
CNY
CNY
CNY
How can I get all the currencies from that site using requests?

The cookies determine some form of validation and thus results you see. You only need two from your other answer. If you omit the second, for example of those shown below, your window shifts to start at 5:30am (Still returning the same number of results) which is the default return - choose any other value for apart from 1, for "ffverifytimes", and you will get this same window. I assume it is an adjustment to be time aware for the locale for home page?
If you omit "ffdstonoff" your window shifts to 2:30am start.
Add in cookie "fftimezoneoffset":"1" and you can shift window to start at 11:45pm of day before.
import requests
from bs4 import BeautifulSoup as bs
cookies={
"ffdstonoff":"1",
"ffverifytimes":"1"
}
r = requests.get('https://www.forexfactory.com/calendar.php?day=today', cookies = cookies)
soup = bs(r.content, 'lxml')
currencies = [item.text.strip() for item in soup.select('.currency')]
print(currencies)

Parsing Environment Canada Website

I am trying to scrape the weather forecast from "https://weather.gc.ca/city/pages/ab-52_metric_e.html". With the code below I am able to get the table containing the data but I'm stuck. During the day the second row contains Today's forecast and the third row contains tonight's forecast. At the end of the day the second row becomes Tonight's forecast and Today's forecast is dropped. What I want to do is parse through the table to get the forecast for Today, Tonight, and each continuing day even if Today's forecast is missing; something like this:
Today: A mix of sun and cloud. 60 percent chance of showers this afternoon with risk of a thunderstorm. Widespread smoke. High 26. UV index 6 or high.
Tonight: Partly cloudy. Becoming clear this evening. Increasing cloudiness before morning. Widespread smoke. Low 13.
Friday: Mainly cloudy. Widespread smoke. Wind becoming southwest 30 km/h gusting to 50 in the afternoon. High 24.
#using Beautiful Soup 3, Python 2.6
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://weather.gc.ca/city/pages/ab- 52_metric_e.html")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
data = soup.find("div", {"id": "mainContent"})
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"})

You could do something like iterate over each line in the table and get the value of the rows. An example would be:
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"}).find_all("tr")
for tr in forecast[1:]:
print " ".join(tr.text.split())
With this approach you get the contents of each lines (exclusive the first one which is some header.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.