Why doesn't my html parser download the entire html documentation?

Why doesn't my html parser download the entire html documentation? - python

I am using Beautiful Soup to scrape the following page: https://www.nyse.com/quote/XNYS:AAN
I want to the stock value below the name + abbreviation. However, when I run a script, it seems that soup.find() does not work because the entire html file is not being downloaded.
main_url = "https://www.nyse.com/quote/XNYS:AAN"
import requests
result = requests.get(main_url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.find("div", class_ = "d-dquote-symbol").prettify())
I expect to see the <div> that contains the <span> with the correct stock value. However, the print returns "none" because the script cannot find this tag. I know it exists because I used inspect element to find the tag in the first place.

This happens because the page you're scraping is not static.
you can see that it has a "spinner" before displaying the values, or by inspecting the network tab in your browser's debug tools.
requests.get doesn't make any "follow-up" requests so you only get the empty page.
to get the stock value (by HTML scraping...) you should use the request the site itself uses to get the stock value.
NOTE: it is better to look for an official API to get this kind of structured data.

You can use any browser simulator to get the quote. pyppeteer can be a good choice to do the trick. The script will wait for the quote to be available and then parse it.
import asyncio
from pyppeteer import launch
url = "https://www.nyse.com/quote/XNYS:AAN/QUOTE"
async def get_quote(link):
wb = await launch()
page = await wb.newPage()
await page.goto(link)
await page.waitForSelector(".d-dquote-bigContainer [class^='d-dquote-x']")
container = await page.querySelector(".d-dquote-bigContainer [class^='d-dquote-x']")
quote = await page.evaluate('(element) => element.innerText', container)
print(quote)
asyncio.get_event_loop().run_until_complete(get_quote(url))
Output at this moment:
60.09

Related

Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?

Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)

Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.

This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)

So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

Displaying all search results in Python web scraper

I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))

Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.

Python request.get() after few seconds

I want to get html-text a few seconds after opening url.
Here's the code:
import requests
url = "http://XXXXX…"
html = request.get(url).text

I want to get html-text few seconds after opening url.
Well, the webpage HTML stays the same right after you "get" the url using Requests, so there's no need to wait a few seconds as the HTML will not change.
I assume the reason that you would like to wait is for the page to load all the relevant resources (e.g. CSS/JS) that modifies the HTML?
If it's so, I wouldn't recommend you using the Requests module as you will have to manipulate and load all of the relevant resources by yourself.
I suggest you to have a look at Selenium for Python.
Selenium fully simulates a browser, hence you can wait and it will load all the resources for your webpage.

try using time.sleep(t)
response = request.get(url)
time.sleep(5) # suspend execution for 5 secs
html = response.text

You want to change the last line to:
html = requests.get(url).text

I have found the library requests-html handy for this purpose, though mostly I use Selenium (as already proposed in Danny answer).
from requests_html import HTMLSession, HTMLResponse
session = HTMLSession()
req = cast(HTMLResponse, session.get("http://XXXXX"))
req.html.render(sleep=5, keep_page=True)
Now, the req.html is a HTML object. In order to get the raw text or the html as a string you can use:
text = req.text
or:
text = req.html.html
Then you can parse your text string, e.g. with Beautiful Soup.

basically you can give a sleep to the request as a parameter as bellow:
import requests
import time
url = "http://XXXXX…"
seconds = 5
html = requests.get(url,time.sleep(seconds)).text #for example 5 seconds

Selenium download full html page

I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5
This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?
from selenium import webdriver
from bs4 import BeautifulSoup
googleURL = "http://www.google.com/trends/hottrends#pn=p5"
browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source
soup = BeautifulSoup(content)
print soup

Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why doesn't my html parser download the entire html documentation? - python

Related

Error while scraping image with beautifulsoup

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

Displaying all search results in Python web scraper

Python request.get() after few seconds

Selenium download full html page

Categories

Resources