How to scrape yahoo finance news headers with BeautifulSoup? - python

I would like to scrape news from yahoo's finance, for a pair.
How does bs4's find() or find_all() work?
for this example:
with this link:
I'm traying to extract the data ... but no data is scraped. why? what's wrong?
I'm using this, but nothing is printed (except the tickers)
html = BeautifulSoup(source_s, "html.parser") # "html")
news_table_s = html.find_all("div",{"class":"Py(14px) Pos(r)"})
news_tables_s[ticker_s] = news_table_s
print("news_tables", news_tables_s)
I would like to extract the headers from a yahoo finance web page.

You have to iterate your ResultSet to get anything out.
for e in html.find_all("div",{"class":"Py(14px) Pos(r)"}):
print(e.h3.text)
Recommendation - Do not use dynamic classes to select elements use more static ids or HTML structure, here selected via css selector
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Example
from bs4 import BeautifulSoup
import requests
url='https://finance.yahoo.com/quote/EURUSD%3DX?p=EURUSD%3DX'
html = BeautifulSoup(requests.get(url).text)
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Output
EUR/USD steadies, but bears sharpen claws as dollar feasts on Fed bets
EUR/USD Weekly Forecast – Euro Gives Up Early Gains for the Week
EUR/USD Forecast – Euro Plunges Yet Again on Friday
EUR/USD Forecast – Euro Gives Up Early Gains
EUR/USD Forecast – Euro Continues to Test the Same Indicator
Dollar gains as inflation remains sticky; sterling retreats
Siemens Issues Blockchain Based Euro-Denominated Bond on Polygon Blockchain
EUR/USD Forecast – Euro Rallies
FOREX-Dollar slips with inflation in focus; euro, sterling up on jobs data
FOREX-Jobs figures send euro, sterling higher; dollar slips before CPI

Related

How to make BeautifulSoup go to the specific webpage I want instead of a random one on the site?

I am trying to learn web scraping using BeautifulSoup by scraping UFC fight data off of the website Tapology. I have entered in the URL of a specific fight's webpage but every time I run the code it seems to jump to a new random fight on the page instead of this fight. Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.tapology.com/fightcenter/bouts/2093-ufc-76-the-dean-of-mean-keith-jardine-vs-chuck-the-iceman-liddell'
html_text = requests.get(url, timeout=5).text
soup = BeautifulSoup(html_text, 'html.parser')
fightstats = soup.find_all('td')
fightresult = soup.find_all('div', class_='boutResultHolder')
print(fightresult, fightstats)
Honestly I have no idea how it could be switching to other webpages when I have a very specific URL like the one I am using.
I got the same results ("..every time I run the code it seems to jump to a new random fight...") when I tried your code. Like some of the comments suggested, it's probably in an effort to evade bots. Maybe the right set of headers could resolve it, but I'm not very good with making requests imitate un-automated browsers - so in situations like these, I sometimes use HTMLSession (or cloudscraper or even ScrapingAnt, and finally selenium if none of the others work).
# from requests_html import HTMLSession
url = 'https://www.tapology.com/fightcenter/bouts/2093-ufc-76-the-dean-of-mean-keith-jardine-vs-chuck-the-iceman-liddell'
req = HTMLSession().get(url)
soup = BeautifulSoup(req.text, 'html.parser')
fightstats = soup.find_all('td')
fightresult = soup.find_all('div', class_='boutResultHolder')
print('\n\n\n'.join('\n'.join( # remove some whitespaces from text for better readability
' '.join(w for w in t.text.split() if w) for t in f if t.text.strip()
) for f in [fightresult, fightstats]))
For me, that prints
Keith Jardine defeats Chuck Liddell via 3 Round Decision #30 Biggest Upset of All Time #97 Greatest Light Heavy MMA Fight of All Time
13-3-1
Pro Record At Fight
20-4-0
Climbed to 14-3
Record After Fight
Fell to 20-5
+290 (Moderate Underdog)
Betting Odds
-395 (Moderate Favorite)
United States
Nationality
United States
Albuquerque, New Mexico
Fighting out of
San Luis Obispo, California
31 years, 10 months, 3 weeks, 1 day
Age at Fight
37 years, 9 months, 5 days
204.5 lbs (92.8 kgs)
Weigh-In Result
205.5 lbs (93.2 kgs)
6'1" (186cm)
Height
6'2" (188cm)
76.0" (193cm)
Reach
76.5" (194cm)
Jackson Wink MMA
Gym
The Pit
Invicta FC 50
11.16.2022, 9:00 PM ET
Bellator 288
11.18.2022, 6:00 PM ET
ONE on Prime Video 4
11.18.2022, 7:00 PM ET
LFA 147
11.18.2022, 6:00 PM ET
ONE Championship 163
11.19.2022, 4:30 AM ET
UFC Fight Night
11.19.2022, 1:00 PM ET
Cage Warriors 147: Unplugg...
11.20.2022, 12:00 PM ET
PFL 10
11.25.2022, 5:30 PM ET
Jiří "Denisa" Procházka
Glover Teixeira
Jan Błachowicz
Magomed Ankalaev
Aleksandar "Rocket" Rakić
Anthony "Lionheart" Smith
Jamahal "Sweet Dreams" Hill
Nikita "The Miner" Krylov

Scraping all entries of lazyloading page using python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.
I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links.
How to get the rest?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
driver.get(url)
element = driver.find_element_by_xpath('//*[#id="lazyload-container"]')
element = element.get_attribute('innerHTML')
The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(a.find_previous(class_="date").text, a.text)
Prints:
25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes
...
17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)
EDIT: To print links:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(
a.find_previous(class_="date").text,
a.text,
"https://www.ecb.europa.eu" + a["href"],
)
Prints:
...
15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html
...

Trouble getting a particular item from a static webpage

I'm trying to parse only the currencies from a table in a webpage but I'm getting completely different results from that site. The missing currencies are available in the page source, so thay are static. Where I'm going wrong? This link is different from the one I used in another post. I thought to mention this for the sake of clarity.
Site address
I've tried:
import requests
from bs4 import BeautifulSoup
URL = "https://www.forexfactory.com/calendar.php?day=today"
res = requests.get(URL,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("tr.calendar_row"):
currency = item.select_one("td.calendar__currency").get_text(strip=True)
print(currency)
Output I'm getting (very different from the ones available in that site):
JPY
JPY
EUR
EUR
GBP
GBP
GBP
EUR
EUR
GBP
USD
USD
USD
GBP
JPY
AUD
AUD
CNY
CNY
CNY
CNY
How can I get all the currencies from that site using requests?
The cookies determine some form of validation and thus results you see. You only need two from your other answer. If you omit the second, for example of those shown below, your window shifts to start at 5:30am (Still returning the same number of results) which is the default return - choose any other value for apart from 1, for "ffverifytimes", and you will get this same window. I assume it is an adjustment to be time aware for the locale for home page?
If you omit "ffdstonoff" your window shifts to 2:30am start.
Add in cookie "fftimezoneoffset":"1" and you can shift window to start at 11:45pm of day before.
import requests
from bs4 import BeautifulSoup as bs
cookies={
"ffdstonoff":"1",
"ffverifytimes":"1"
}
r = requests.get('https://www.forexfactory.com/calendar.php?day=today', cookies = cookies)
soup = bs(r.content, 'lxml')
currencies = [item.text.strip() for item in soup.select('.currency')]
print(currencies)

Parsing Environment Canada Website

I am trying to scrape the weather forecast from "https://weather.gc.ca/city/pages/ab-52_metric_e.html". With the code below I am able to get the table containing the data but I'm stuck. During the day the second row contains Today's forecast and the third row contains tonight's forecast. At the end of the day the second row becomes Tonight's forecast and Today's forecast is dropped. What I want to do is parse through the table to get the forecast for Today, Tonight, and each continuing day even if Today's forecast is missing; something like this:
Today: A mix of sun and cloud. 60 percent chance of showers this afternoon with risk of a thunderstorm. Widespread smoke. High 26. UV index 6 or high.
Tonight: Partly cloudy. Becoming clear this evening. Increasing cloudiness before morning. Widespread smoke. Low 13.
Friday: Mainly cloudy. Widespread smoke. Wind becoming southwest 30 km/h gusting to 50 in the afternoon. High 24.
#using Beautiful Soup 3, Python 2.6
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://weather.gc.ca/city/pages/ab- 52_metric_e.html")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
data = soup.find("div", {"id": "mainContent"})
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"})
You could do something like iterate over each line in the table and get the value of the rows. An example would be:
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"}).find_all("tr")
for tr in forecast[1:]:
print " ".join(tr.text.split())
With this approach you get the contents of each lines (exclusive the first one which is some header.

How do I scrape pages with dynamically generated URLs using Python?

I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios, but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. And the exact "full-company-name" isn't known in advance. Only the company symbol, "IBM" is known.
Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). But in this case, that can't be done.
For example, CSCO string is
http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios
and another example url string is AAPL:
http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios
So in order to get the url, I had to search the symbol in the input box on the main page:
http://www.dailyfinance.com/
I've noticed that when I type "CSCO" and inspect the search input at (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to
http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com
and that the referer actually gives the path that I want to capture
Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive
Sorry for the long explanation. So the question is how do I extract the url in the Referer? If that is not possible, how should I approach this problem? Is there another way?
I really appreciate your help.
I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.
First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?
When I land on the base URL, http://www.dailyfinance.com/, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. What does this mean?
In simple terms, I was transferred somewhere. The URL for this GET request is the following:
http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO
Now, we test if it works with AAPL by using a simple URL manipulation.
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url
The above gives the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]
See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code:
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url
When ran, this gives is the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]
Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:
from bs4 import BeautifulSoup as bsoup
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row
I then try running this code, only to encounter an error with the following traceback:
File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'
Of note is the line 'NoneType' object.... This means our target div does not exist! Egads, but why am I seeing the following?!
There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page.
I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:
A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:
WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com here completely. Our new url is as follows:
http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL
Let's try using that in our final scraper!
from bs4 import BeautifulSoup as bsoup
import requests as rq
csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick
r = rq.get(new_url)
soup = bsoup(r.content)
table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row.get_text()
And our raw results for CSCO's financial ratios data is as follows:
Company
Industry
Valuation Ratios
P/E Ratio (TTM)
15.40
14.80
P/E High - Last 5 Yrs
24.00
28.90
P/E Low - Last 5 Yrs
8.40
12.10
Beta
1.37
1.50
Price to Sales (TTM)
2.51
2.59
Price to Book (MRQ)
2.14
2.17
Price to Tangible Book (MRQ)
4.25
3.83
Price to Cash Flow (TTM)
11.40
11.60
Price to Free Cash Flow (TTM)
28.20
60.20
Dividends
Dividend Yield (%)
3.30
2.50
Dividend Yield - 5 Yr Avg (%)
N.A.
1.20
Dividend 5 Yr Growth Rate (%)
N.A.
144.07
Payout Ratio (TTM)
45.00
32.00
Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70
Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60
Growth Rates (%)
Sales - 5 Yr Growth Rate (%)
5.51
5.12
EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90
EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90
EPS - 5 Yr Growth Rate (%)
8.91
9.04
Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94
Financial Strength
Quick Ratio (MRQ)
2.40
2.70
Current Ratio (MRQ)
2.60
2.90
LT Debt to Equity (MRQ)
0.22
0.20
Total Debt to Equity (MRQ)
0.31
0.25
Interest Coverage (TTM)
18.90
19.10
Profitability Ratios (%)
Gross Margin (TTM)
63.20
62.50
Gross Margin - 5 Yr Avg
66.30
64.00
EBITD Margin (TTM)
26.20
25.00
EBITD - 5 Yr Avg
28.82
0.00
Pre-Tax Margin (TTM)
21.10
20.00
Pre-Tax Margin - 5 Yr Avg
21.60
18.80
Management Effectiveness (%)
Net Profit Margin (TTM)
17.10
17.65
Net Profit Margin - 5 Yr Avg
17.90
15.40
Return on Assets (TTM)
8.30
8.90
Return on Assets - 5 Yr Avg
8.90
8.00
Return on Investment (TTM)
11.90
12.30
Return on Investment - 5 Yr Avg
12.50
10.90
Efficiency
Revenue/Employee (TTM)
637,890.00
556,027.00
Net Income/Employee (TTM)
108,902.00
98,118.00
Receivable Turnover (TTM)
5.70
5.80
Inventory Turnover (TTM)
11.30
9.70
Asset Turnover (TTM)
0.50
0.50
[Finished in 2.0s]
Cleaning up the data is up to you.
One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.
Hopefully you learned something from this. Let us know if this helps and good luck.
Doesn't answer your specific question, but solves your problem.
http://www.dailyfinance.com/quotes/{Company Symbol}/{Stock Exchange}
Examples:
http://www.dailyfinance.com/quotes/AAPL/NAS
http://www.dailyfinance.com/quotes/IBM/NYSE
http://www.dailyfinance.com/quotes/CSCO/NAS
To get to the financial ratios page you could then employ something like this:
import urllib2
def financial_ratio_url(symbol, stock_exchange):
starturl = 'http://www.dailyfinance.com/quotes/'
starturl += '/'.join([symbol, stock_exchange])
req = urllib2.Request(starturl)
res = urllib2.urlopen(starturl)
return '/'.join([res.geturl(),'financial-ratios'])
Example:
financial_ratio_url('AAPL', 'NAS')
'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios'

Categories