How to scrape data in h4 with beautifulsoup? - python

I am trying to scrape the results data from this website (https://www.ufc.com/matchup/908/7717/post) and I am completely at a loss for why my proposed solution isn't working.
The outer html that I am trying to scrape is <h4 class="e-t5 winner">Jon Jones</h4>. I don't have a lot of experience with web scraping or HTML but all of the relevant information is contained in the h4 tag.
I have been successful in extracting the data from the h2 tag but I am confused as to why the same approach doesn't work for h4. For example, to extract the relevant data from <h2 class="field--name-name name_given red">Jon Jones <span class="field--field-rank rank"></span></h2> the following code works.
from requests import get
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
raw_html = get('https://www.ufc.com/matchup/908/7717/post', headers=headers)
html = BeautifulSoup(raw_html.content)
# this works
html.find_all('h2', attrs={'class': 'field--name-name name_given red'})[0].get_text().strip()
# this does not work?
html.find_all('h4', attrs={'class': 'e-t5 winner red'})
# this code gets me to the headers but not the actual listed data inside
html.find('div', attrs={'class': 'l-flex--4col-2to4'})
I am mostly confused as to why the above doesn't work and why the text I can see when inspecting the element in my browser, doesn't appear in the scraped HTML.

It is added dynamically. You can find the source in the network tab. Assuming there is always one winner you can use something like
import requests
r = requests.get('https://dvk92099qvr17.cloudfront.net/V1/908/Fnt.json').json()
winner = [fighter['FullName'] for fighter in r['FMLiveFeed']['Fights'][0]['Fighters'] if fighter['Outcome'] == 'Win'][0]
print(winner)

Related

Python, Scraping BS4

There are a lot of post about this subject but I still don't manage to achieve what I want so here is my problem:
I am trying to extract stock price from this site:
https://bors.e24.no/#!/instrument/NHY.OSE
and I would like extract the price: 57,12 from the "inspection" text:
<div class="number LAST" data-reactid=".g.1.2.0">
57,12</div>
Here is the code I tried which generate "AttributeError" and 'NoneType' object has no attribute 'text'.
I also tried to remove .text, in the PRICE line, and the result is 'Price is: None'
from bs4 import BeautifulSoup
import requests
url = 'https://bors.e24.no/#!/instrument/NHY.OSE'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
PRICE= soup.find('div', class_= "number LAST").text
print('Price is:',(PRICE))
Try this:
import requests
headers = {
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
}
api_url = "https://bors.e24.no/server/components?columns=ITEM, LAST, BID, ASK, CHANGE, CHANGE_PCT, TURNOVER, LONG_NAME&itemSector=NHY.OSE&type=table"
data = requests.get(api_url, headers=headers).json()
print(data["rows"][0]["values"]["LAST"])
Output:
56.92
This happens because your
requests.get(url)
Will not get all information in the page, including the price you are looking for, because the said webpage will load some parts of it and only then fetch more data. Because of that, trying to select the div with className="number LAST"
PRICE= soup.find('div', class_= "number LAST").text
Will throw an error because this doesn't exist, yet.
There are some ways to fix this problem:
You can try to use libraries like Selenium, which is often recommended for scraping more dynamic pages that rely on some Javascript and API calls to load content.
You can open your developer tools and inspect the Network tab where you might find the request that fetches the price you are trying to scrap.
I believe that in your case, after taking a look at the Network tab myself, the right URL to request could be 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history', which seems to return a dictionary with the price you are looking for.
import requests
url = 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history'
page = requests.get(url)
print(page.json()["rows"][0]["values"]["PRICE"])
If you are looking to scrap various links, you will need to find a way to dynamically change the previous link to one that matches others that you are trying to crawl. Which I guess would mean to change "NHY" and "ose" to something that would match other stock that you are looking for.

Is it even possible to webscrape this website [Unpredictive URL]?

This is the website in question (I want to extract the SMR Rating):
https://research.investors.com/stock-quotes/nasdaq-apple-inc-aapl.htm
If I have a list of stock names like AAPL, NVDA, TSM etc. and I want to iterate through them, how can I do it when the URL constantly changes in an unpredictable manner?
Take for example the same website with the ticker NVDA:
https://research.investors.com/stock-quotes/nasdaq-nvidia-corp-nvda.htm
It's not possible to append the ticker name to the URL and be done with it. I searched for a hidden API and I got this:
https://research.investors.com/services/ChartService.svc/GetData
This website gives me access to a json file but it doesn't contain the desired SMR Rating. Apart from that, I couldn't find anything else that would lead to the SMR Rating. Is this simply impossible?
Here's what I got so far, I can't get even past the HTML reading stage:
from bs4 import BeautifulSoup as bs
import json
import re
import pandas as pd
import requests
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'}
URL = "https://research.investors.com/stock-quotes/nasdaq-nvidia-corp-nvda.htm"
page = requests.get(URL, headers = header)
soup = bs(page.content, "html.parser")
print(soup)
As you can see, I can't load the full html code with beautiful soup, as the page assumes that some form of robotic activity is taking place (Error 405). Should I have specified a different header or is it indeed the case that webscraping isn't allowed on this webapge?

Get next page on Amazon.com using python requests library

I'm trying to scrape through an amazon offer. The code block below prints the titles of all results of the first page.
import requests
from bs4 import BeautifulSoup as BS
offer_url = 'https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU?ref=deals_deals_deals-grid_slot-15_8454_dt_dcell_img_14_f3724fb9#'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get(offer_url, headers=headers)
html = response.text
soup = BS(html)
links = soup.find_all('a',class_='a-size-base a-link-normal fw-line3-break a-text-normal')
for link in links:
title = link.text.strip() # remove surrounding spaces and linebreaks etc.
print(title)
So far, so good. Now, how do I access the second page? Clicking on the Next page button (in german it's "Weiter") at the bottom of the page does not add a page argument like ?page=2 to the URL through which I could access the next page.
The questions I have are: How do I access the content of the next page similarly to how I access the first page? Is there a POST request involved and if so: How do I figure out its params/data? How would I use requests to mimic pressing the Next Page button and get the respective page results?
The offer is scheduled to last until March 21st, 2021. Until then, the link provided in the code should be valid.
Maybe it's just a few lines of code, e.g. a tweak in my request. Thanks in advance! Have a wonderful day!
Edit:
Trying to fetch the second page using the following script only yields the results of the first page.
params = {"page":2}
html = requests.post('https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU', data=params, headers=headers).text

Scraping Schema with Beautiful Soup?

I'm trying to scrape a site that contains the following html code:
<div class="content-sidebar-wrap"><main class="content"><article
class="post-773 post type-post status-publish format-standard has-post-
thumbnail category-money entry" itemscope
itemtype="http://schema.org/CreativeWork">
This contains data I'm interested in... I've tried using BeautifulSoup to parse it, but the following returns:
<div class="content-sidebar-wrap"><main class="content"><article
class="entry">
<h1 class="entry-title">Not found, error 404</h1><div class="entry-content
"><p>"The page you are looking for no longer exists. Perhaps you can return
back to the site's "homepage and
see if you can find what you are looking for. Or, you can try finding it
by using the search form below.</p><form
action="http://www.totalsportek.com/" class="search-form"
itemprop="potentialAction" itemscope=""
itemtype="http://schema.org/SearchAction" method="get" role="search">
# I've made small modifications to make it readable
The beautiful soup element doesn't contain my desired code. I'm not too familiar with html, but I'm assuming this makes a call to some external service that returns the data..? I've read this has something to with Schema.
Is there anyway I can access this data?
You need to specify the User-Agent header when making a request. Working example that prints the article header and the content as well:
import requests
from bs4 import BeautifulSoup
url = "http://www.totalsportek.com/money/barcelona-player-salaries/"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")
article = soup.select_one(".content article.post.entry.status-publish")
header = article.header.get_text(strip=True)
content = article.select_one(".entry-content").get_text(strip=True)
print(header)
print(content)

Web Scraping - No content displayed

I am trying to fetch the stock of a company specified by a user by taking the input. I am using requests to get the source code and BeautifulSoup to scrape. I am fetching the data from google.com. I am trying the fetch only the last stock price (806.93 in the picture). When I run my script, it prints none. None of the data is being fetched. What am I missing ?
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
company = raw_input("Enter the company name:")
URL = "https://www.google.co.in/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q="+company+"+stock"
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})
print code.contents[0]
The source code of the page looks like this :
Looks like that source is from inspecting the element, not the actual source. A couple of suggestions. Use google finance to get rid of some noise - https://www.google.com/finance?q=googl would be the URL. On that page there is a section that looks like this:
<div class=g-unit>
<div id=market-data-div class="id-market-data-div nwp g-floatfix">
<div id=price-panel class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span id="ref_694653_l">806.93</span>
</span>
<div class="id-price-change nwp">
<span class="ch bld"><span class="chg" id="ref_694653_c">+9.68</span>
<span class="chg" id="ref_694653_cp">(1.21%)</span>
</span>
</div>
</div>
You should be able to pull the number out of that.
I went to
https://www.google.com/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q=+google+stock
, did a right click and "View Page Source" but did not see the code that you screenshotted.
Then I typed out a section of your code screenshot and created a BeautifulSoup object with it and then ran your find on it:
test_screenshot = BeautifulSoup('<div class="_F0c" data-tmid="/m/07zln7n"><span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span> = $0<span class ="_hgj">USD</span>')
test_screenshot.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})`
Which will output what you want:
<span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span>
This means that the code you are getting is not the code you expect to get.
I suggest using the google finance page:
https://www.google.com/finance?q=google (replace 'google' with what you want to search), which will give you wnat you are looking for:
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find("span",{'class':'pr'})
print code.contents
Will give you
[u'\n', <span id="ref_694653_l">806.93</span>, u'\n'].
In general, scraping Google search results can get really nasty, so try to avoid it if you can.
You might also want to look into Yahoo Finance Python API.
You're looking for this:
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
It might be because there's no user-agent specified in your request headers.
The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0'
params = {
'q': 'alphabet inc class a stock',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
print(current_price)
# 2,816.00
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast rather than figuring out why certain things don't work as expected and then to maintain it over time.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "alphabet inc class a stock",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
current_price = results['answer_box']['price']
print(current_price)
# 2,816.00
P.S - I wrote an in-depth blog post about how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.

Categories