Trouble getting a particular item from a static webpage

Trouble getting a particular item from a static webpage - python

I'm trying to parse only the currencies from a table in a webpage but I'm getting completely different results from that site. The missing currencies are available in the page source, so thay are static. Where I'm going wrong? This link is different from the one I used in another post. I thought to mention this for the sake of clarity.
Site address
I've tried:
import requests
from bs4 import BeautifulSoup
URL = "https://www.forexfactory.com/calendar.php?day=today"
res = requests.get(URL,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("tr.calendar_row"):
currency = item.select_one("td.calendar__currency").get_text(strip=True)
print(currency)
Output I'm getting (very different from the ones available in that site):
JPY
JPY
EUR
EUR
GBP
GBP
GBP
EUR
EUR
GBP
USD
USD
USD
GBP
JPY
AUD
AUD
CNY
CNY
CNY
CNY
How can I get all the currencies from that site using requests?

The cookies determine some form of validation and thus results you see. You only need two from your other answer. If you omit the second, for example of those shown below, your window shifts to start at 5:30am (Still returning the same number of results) which is the default return - choose any other value for apart from 1, for "ffverifytimes", and you will get this same window. I assume it is an adjustment to be time aware for the locale for home page?
If you omit "ffdstonoff" your window shifts to 2:30am start.
Add in cookie "fftimezoneoffset":"1" and you can shift window to start at 11:45pm of day before.
import requests
from bs4 import BeautifulSoup as bs
cookies={
"ffdstonoff":"1",
"ffverifytimes":"1"
}
r = requests.get('https://www.forexfactory.com/calendar.php?day=today', cookies = cookies)
soup = bs(r.content, 'lxml')
currencies = [item.text.strip() for item in soup.select('.currency')]
print(currencies)

Related

How to scrape yahoo finance news headers with BeautifulSoup?

I would like to scrape news from yahoo's finance, for a pair.
How does bs4's find() or find_all() work?
for this example:
with this link:
I'm traying to extract the data ... but no data is scraped. why? what's wrong?
I'm using this, but nothing is printed (except the tickers)
html = BeautifulSoup(source_s, "html.parser") # "html")
news_table_s = html.find_all("div",{"class":"Py(14px) Pos(r)"})
news_tables_s[ticker_s] = news_table_s
print("news_tables", news_tables_s)
I would like to extract the headers from a yahoo finance web page.

You have to iterate your ResultSet to get anything out.
for e in html.find_all("div",{"class":"Py(14px) Pos(r)"}):
print(e.h3.text)
Recommendation - Do not use dynamic classes to select elements use more static ids or HTML structure, here selected via css selector
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Example
from bs4 import BeautifulSoup
import requests
url='https://finance.yahoo.com/quote/EURUSD%3DX?p=EURUSD%3DX'
html = BeautifulSoup(requests.get(url).text)
for e in html.select(' div:has(>h3>a)'):
print(e.h3.text)
Output
EUR/USD steadies, but bears sharpen claws as dollar feasts on Fed bets
EUR/USD Weekly Forecast – Euro Gives Up Early Gains for the Week
EUR/USD Forecast – Euro Plunges Yet Again on Friday
EUR/USD Forecast – Euro Gives Up Early Gains
EUR/USD Forecast – Euro Continues to Test the Same Indicator
Dollar gains as inflation remains sticky; sterling retreats
Siemens Issues Blockchain Based Euro-Denominated Bond on Polygon Blockchain
EUR/USD Forecast – Euro Rallies
FOREX-Dollar slips with inflation in focus; euro, sterling up on jobs data
FOREX-Jobs figures send euro, sterling higher; dollar slips before CPI

How to get all At-The-Money options using yahoo_fin

I am trying to create a list of all At-The-Money (ATM) option contracts using yahoo_fin options module.
Yahoo_fin offers 2 methods for getting all call and put contracts:
from yahoo_fin import options as ops
# ops.get_call(Ticker, expiration_date=None)
# ops.get_pull(Ticker, expiration_date=None)
# If no expiration_date is passed, the nearest expiration date is used
ops.get_calls("aapl")
ops.get_puts("aapl")
These two methods return the following dataframes, respectively:
I have done some research at possibly using the strike price and comparing it with the underlying stock price. This is probably the most basic way, but the underlying stock may hay a price that is not exactly the same as an option's strike price. Another alternative I have read is to use delta. Can anybody provide insight into how I could find the ATM options using the data provided by yahoo_fin? Is it possible?

For ATM options the strike price is equal to the underlying asset’s current market price, as explained here.
However, there is no option for every possible market price, as options are oganized in grids. You could get the price of the option for which the strike price is closest to the underlying's market price. You can implement it as:
from yahoo_fin import options, stock_info
symbol = "AAPL"
last_adj_close = stock_info.get_data(symbol)["adjclose"][-1]
calls = options.get_calls("aapl")
puts = options.get_puts("aapl")
atm_call = calls.iloc[(calls["Strike"] - last_adj_close).abs().argsort()[:1]]
Output:
Contract Name Last Trade Date Strike Last Price Bid Ask Change % Change Volume Open Interest Implied Volatility
43 AAPL221118C00149000 2022-11-16 3:59PM EST 149.0 1.58 1.5 1.66 -1.12 -41.48% 22594 14120 40.09%
for the AAPL stock:
open high low close adjclose volume ticker
2022-11-14 148.970001 150.279999 147.429993 148.279999 148.279999 73374100 AAPL
2022-11-15 152.220001 153.589996 148.559998 150.039993 150.039993 89868300 AAPL
You can also obtain the two closest options by adjusting the parameter in
.argsort()[:2].

Scraping all entries of lazyloading page using python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.
I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links.
How to get the rest?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
driver.get(url)
element = driver.find_element_by_xpath('//*[#id="lazyload-container"]')
element = element.get_attribute('innerHTML')

The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(a.find_previous(class_="date").text, a.text)
Prints:
25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes
...
17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)
EDIT: To print links:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(
a.find_previous(class_="date").text,
a.text,
"https://www.ecb.europa.eu" + a["href"],
)
Prints:
...
15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html
...

BeautifulSoup Python Extracting Tag Title For Specific Tags With Attribute

I'm working on a scraper using beautifulsoup to pull concert information for certain artists on songkick. the url I'm working with is here https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1. I've been able to extract all artist, venue, city, and state info, the only thing I'm having trouble with is extracting the date of concerts.
In looking at the html elements, I see that the dates for shows are listed as the li title="Saturday 01 February 2020" values for example the children under ul class="event-listings". A method I was attempting to perform was extracting the time datetime values that are nensted under the li titles, but my output included the entire html markup for each li time datetime instead of just the datetime. I'm looking to either extract the li titles or the time datetime values. These li's don't have a class either.
Here is some of my code
import requests
from bs4 import BeautifulSoup as bs4
pages=[]
artists=[]
venues=[]
dates=[]
cities=[]
states=[]
pages_to_scrape=1
for i in range(1, pages_to_scrape+1):
url = 'https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page={}'.format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text, 'html.parser')
for m in soup.findAll('li', title=True):
date = m.find('time')
print(date)
Output:
<time datetime="2020-02-01T20:00:00-0800"></time>
<time datetime="2020-02-01T20:00:00-0800"></time>
<time datetime="2020-02-01T19:00:00-0800"></time>
<time datetime="2020-02-01T19:00:00-0800"></time>
<time datetime="2020-02-01T21:00:00-0800"></time>
etc...
Looking for output like this:
2020-02-01
2020-02-01
2020-02-01
etc...
Or if able to grab the title values of li's some how output like this:
Saturday 01 February 2020
Saturday 01 February 2020
Saturday 01 February 2020
Saturday 01 February 2020
etc...
I'm curious if I'm able to split at the " for the time datetime, but since it's not text I don't think that's possible. Also, I don't want to grab the first li class = "with-date" as that is just the headline for dates for the page as to why I'm not just grabbing all li's.

Try m.find('time')['datetime'] instead of m.find('time')

Here's a way to achieve this:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1")
soup = BeautifulSoup(p.content, "html.parser")
tags = soup.find_all("time")
[t["datetime"].split("T")[0] for t in tags]
Notes:
I'm quite sure that crawling Songkick in this way violates their terms and conditions.
You might consider using their API, which works well: https://www.songkick.com/developer

How do I scrape pages with dynamically generated URLs using Python?

I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios, but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. And the exact "full-company-name" isn't known in advance. Only the company symbol, "IBM" is known.
Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). But in this case, that can't be done.
For example, CSCO string is
http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios
and another example url string is AAPL:
http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios
So in order to get the url, I had to search the symbol in the input box on the main page:
http://www.dailyfinance.com/
I've noticed that when I type "CSCO" and inspect the search input at (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to
http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com
and that the referer actually gives the path that I want to capture
Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive
Sorry for the long explanation. So the question is how do I extract the url in the Referer? If that is not possible, how should I approach this problem? Is there another way?
I really appreciate your help.

I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.
First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?
When I land on the base URL, http://www.dailyfinance.com/, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. What does this mean?
In simple terms, I was transferred somewhere. The URL for this GET request is the following:
http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO
Now, we test if it works with AAPL by using a simple URL manipulation.
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url
The above gives the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]
See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code:
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url
When ran, this gives is the following result:
http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]
Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:
from bs4 import BeautifulSoup as bsoup
import requests as rq
apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row
I then try running this code, only to encounter an error with the following traceback:
File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'
Of note is the line 'NoneType' object.... This means our target div does not exist! Egads, but why am I seeing the following?!
There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page.
I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:
A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:
WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com here completely. Our new url is as follows:
http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL
Let's try using that in our final scraper!
from bs4 import BeautifulSoup as bsoup
import requests as rq
csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick
r = rq.get(new_url)
soup = bsoup(r.content)
table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row.get_text()
And our raw results for CSCO's financial ratios data is as follows:
Company
Industry
Valuation Ratios
P/E Ratio (TTM)
15.40
14.80
P/E High - Last 5 Yrs
24.00
28.90
P/E Low - Last 5 Yrs
8.40
12.10
Beta
1.37
1.50
Price to Sales (TTM)
2.51
2.59
Price to Book (MRQ)
2.14
2.17
Price to Tangible Book (MRQ)
4.25
3.83
Price to Cash Flow (TTM)
11.40
11.60
Price to Free Cash Flow (TTM)
28.20
60.20
Dividends
Dividend Yield (%)
3.30
2.50
Dividend Yield - 5 Yr Avg (%)
N.A.
1.20
Dividend 5 Yr Growth Rate (%)
N.A.
144.07
Payout Ratio (TTM)
45.00
32.00
Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70
Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60
Growth Rates (%)
Sales - 5 Yr Growth Rate (%)
5.51
5.12
EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90
EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90
EPS - 5 Yr Growth Rate (%)
8.91
9.04
Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94
Financial Strength
Quick Ratio (MRQ)
2.40
2.70
Current Ratio (MRQ)
2.60
2.90
LT Debt to Equity (MRQ)
0.22
0.20
Total Debt to Equity (MRQ)
0.31
0.25
Interest Coverage (TTM)
18.90
19.10
Profitability Ratios (%)
Gross Margin (TTM)
63.20
62.50
Gross Margin - 5 Yr Avg
66.30
64.00
EBITD Margin (TTM)
26.20
25.00
EBITD - 5 Yr Avg
28.82
0.00
Pre-Tax Margin (TTM)
21.10
20.00
Pre-Tax Margin - 5 Yr Avg
21.60
18.80
Management Effectiveness (%)
Net Profit Margin (TTM)
17.10
17.65
Net Profit Margin - 5 Yr Avg
17.90
15.40
Return on Assets (TTM)
8.30
8.90
Return on Assets - 5 Yr Avg
8.90
8.00
Return on Investment (TTM)
11.90
12.30
Return on Investment - 5 Yr Avg
12.50
10.90
Efficiency
Revenue/Employee (TTM)
637,890.00
556,027.00
Net Income/Employee (TTM)
108,902.00
98,118.00
Receivable Turnover (TTM)
5.70
5.80
Inventory Turnover (TTM)
11.30
9.70
Asset Turnover (TTM)
0.50
0.50
[Finished in 2.0s]
Cleaning up the data is up to you.
One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.
Hopefully you learned something from this. Let us know if this helps and good luck.

Doesn't answer your specific question, but solves your problem.
http://www.dailyfinance.com/quotes/{Company Symbol}/{Stock Exchange}
Examples:
http://www.dailyfinance.com/quotes/AAPL/NAS
http://www.dailyfinance.com/quotes/IBM/NYSE
http://www.dailyfinance.com/quotes/CSCO/NAS
To get to the financial ratios page you could then employ something like this:
import urllib2
def financial_ratio_url(symbol, stock_exchange):
starturl = 'http://www.dailyfinance.com/quotes/'
starturl += '/'.join([symbol, stock_exchange])
req = urllib2.Request(starturl)
res = urllib2.urlopen(starturl)
return '/'.join([res.geturl(),'financial-ratios'])
Example:
financial_ratio_url('AAPL', 'NAS')
'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.