I'm a new Python programmer and was trying do scrape some key metrics on Reuters but can't do it properly.
Here's my code:
import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
page = requests.get(url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_keys = bs.findAll('tr', class_='data')
key_name = bs.find('th', class_='MarketsTable-label-_JI6s').find('div', class_='TextLabel__text-label___3oCVw TextLabel__gray___1V4fk TextLabel__regular___2X0ym MarketsTable-label-_JI6s')
for key in key_name:
beta = key.find('Beta')
print(beta)
Beta is the metric I want. Gives me "-1" as answer. I want the value in 'span' related to the name.
What should I do?
I have changed a bit your instructions, I tried to simplify the class request and used .text that allows you to extrapolate the text inside.
I have used the assumption that th and td are the only ones in the row like in the url you asked for (to delete the class selection).
Note: I've used a for, you may use a filter
import requests
from bs4 import BeautifulSoup
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
page = requests.get(url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_keys = bs.findAll('tr', class_='data')
for key in list_all_keys:
title = key.find("th").text
if title == "Beta":
print(key.find("td").text)
A few ways you can do it. 1) You can find the element with string "Beta" in the html, then grab the next <td> element. 2) Use pandas' read_html() to get the table, then query/pull out what you specifically want from the table. or 3) I prefer this option as it gets you the raw, un-rendered data: just get the json response from the API. All solutions are below:
Solution 1:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
page = requests.get(url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'html.parser')
beta = bs('th',text=re.compile(r'Beta'))[0].find_next('td').text
print (beta)
Solution 2:
import pandas as pd
url = 'https://www.reuters.com/companies/AAPL.OQ/key-metrics'
df = pd.read_html(url)[0]
print (df[df[0] == 'Beta'].iloc[0,1])
Solution 3:
import requests
from pandas.io.json import json_normalize
url = 'https://www.reuters.com/companies/api/getFetchCompanyKeyMetrics/AAPL.OQ'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
print (jsonData['market_data']['beta'])
df = json_normalize(jsonData['market_data'])
df.to_csv('file.csv', index=False)
# Or to excel file
#df.to_excel('file.xls', index=False)
Output: in order of the solution provided:
1.14
1.13
1.13326
To get the JSON response:
If you go to Inspect the page, and look at the panel on the right. Go to Network -> XHR. Look through the requests on the left and see if the data you are after in there (might need to reload the page, then you'll need to click around to find it.)
If you find it, go to Headers to get the Request URL that you'll use to get that response.
Once you have the response, to convert to a table and output to excel, see the code for Solution 3.
Related
I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)
The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))
Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)
I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.
Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.
import requests
a = 'http://tmsearch.uspto.gov/bin/showfield?f=toc&state=4809%3Ak1aweo.1.1&p_search=searchstr&BackReference=&p_L=100&p_plural=no&p_s_PARA1={}&p_tagrepl%7E%3A=PARA1%24MI&expr=PARA1+or+PARA2&p_s_PARA2=&p_tagrepl%7E%3A=PARA2%24ALL&a_default=search&f=toc&state=4809%3Ak1aweo.1.1&a_search=Submit+Query'
a = a.format('coca-cola')
b = requests.get(a)
print(b.text)
print(b.url)
If you copy the printed url and paste it in browser, site will open with no problem, but if you do requests.get, i get some token? errors. Is there anything I can do?
VIA requests.get I url back, but no data if doing manually. It says: <html><head><TITLE>TESS -- Error</TITLE></head><body>
First of all, make sure you follow the website's Terms of Use and usage policies.
This is a little bit more complicated that it may seem. You need to maintain a certain state throughout the [web-scraping session][1]. And, you'll need an HTML parser, like BeautifulSoup along the way:
from urllib.parse import parse_qs, urljoin
import requests
from bs4 import BeautifulSoup
SEARCH_TERM = 'coca-cola'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
# get the current search state
response = session.get("https://tmsearch.uspto.gov/")
soup = BeautifulSoup(response.content, "html.parser")
link = soup.find("a", text="Basic Word Mark Search (New User)")["href"]
session.get(urljoin(response.url, link))
state = parse_qs(link)['state'][0]
# perform a search
response = session.post("https://tmsearch.uspto.gov/bin/showfield", data={
'f': 'toc',
'state': state,
'p_search': 'search',
'p_s_All': '',
'p_s_ALL': SEARCH_TERM + '[COMB]',
'a_default': 'search',
'a_search': 'Submit'
})
# print search results
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find("font", color="blue").get_text())
table = soup.find("th", text="Serial Number").find_parent("table")
for row in table('tr')[1:]:
print(row('td')[1].get_text())
It prints all the serial number values from the first search results page, for demonstration purposes.
I have been using following code to parse web page in the link https://www.blogforacure.com/members.php. The code is expected to return the links of all the members of the given page.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://www.blogforacure.com/members.php').read()
soup = BeautifulSoup(r,'lxml')
headers = soup.find_all('h3')
print(len(headers))
for header in headers:
a = header.find('a')
print(a.attrs['href'])
But I get only the first 10 links from the above page. Even while printing the prettify option I see only the first 10 links.
The results are dynamically loaded by making AJAX requests to the https://www.blogforacure.com/site/ajax/scrollergetentries.php endpoint.
Simulate them in your code with requests maintaining a web-scraping session:
from bs4 import BeautifulSoup
import requests
url = "https://www.blogforacure.com/site/ajax/scrollergetentries.php"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
session.get("https://www.blogforacure.com/members.php")
page = 0
members = []
while True:
# get page
response = session.post(url, data={
"p": str(page),
"id": "#scrollbox1"
})
html = response.json()['html']
# parse html
soup = BeautifulSoup(html, "html.parser")
page_members = [member.get_text() for member in soup.select(".memberentry h3 a")]
print(page, page_members)
members.extend(page_members)
page += 1
It prints the current page number and the list of members per page accumulating member names into a members list. Not posting what it prints since it contains names.
Note that I've intentionally left the loop endless, please figure out the exit condition. May be when response.json() throws an error.
It looks like python has trouble finding text when it's marked with display=none, what should I do to overcome this issue?
Here's my code
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/')
soup = BeautifulSoup(r.text, 'html.parser')
data = soup.find('div', {'id':'all-domains'})
data.text
the code returns []
I also tried with xpath:
from lxml import etree
data = etree.HTML(r.text)
anchor = data.xpath('//div[#id="all-domains"]/text()')
It returns the same thing...
Yes, the element with id="all-domains" is empty because it is either dynamically set by the javascript executed in the browser. With requests you only get the initial HTML page without the "dynamic" part, so to say. To get all domains, I'd just iterate over the table rows and extract the domain link texts. Working sample:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.domcop.com/domains/great-expired-domains/',
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"})
soup = BeautifulSoup(r.text, 'html.parser')
for domain in soup.select("tbody#domcop-table-body tr td a.domain-link"):
print(domain.get_text())
Prints:
u2tourfans.com
tvadsview.com
gfanatic.com
blucigs.com
...
twply.com
sweethomeparis.com
vvchart.com