How to scrape JavaScript page with Python

How to scrape JavaScript page with Python - python

I'm trying to scrape patentsview.org but I'm having an issue. When I try to scrape this page, it doesn't work well. Site using JavaScript to get data from their database. I tried to get the data using requests-html package but I didn't quite understand.
Here's what I tried:
# Import
import re
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
# Set requests
r = session.get('https://datatool.patentsview.org/#search/assignee&asn=1|Samsung')
r.html.render()
# Set BS and print
soup = BeautifulSoup(r.html.html, "lxml")
tags = soup.find_all("div", class_='summary')
print(tags)
This code gives me this result:
# Result
[<div class="summary"></div>]
But I want this:
This is the right div. But I can't see content of div with my code. How can I get the div's content? Hope you understand what I meant.

Use the browser dev tools. (Chrome. F12 - Network - XHR) and see the HTTP GET thst return the data (as JSON) you are looking for.
HTTP GET https://webapi.patentsview.org/api/assignees/query?q={%22_and%22:[{%22_or%22:[{%22_and%22:[{%22_contains%22:{%22assignee_first_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_last_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_organization%22:%22Samsung%22}}]}]}]}&f=[%22assignee_id%22,%22assignee_first_name%22,%22assignee_last_name%22,%22assignee_organization%22,%22assignee_lastknown_country%22,%22assignee_lastknown_state%22,%22assignee_lastknown_city%22,%22assignee_lastknown_location_id%22,%22assignee_total_num_patents%22,%22assignee_first_seen_date%22,%22assignee_last_seen_date%22,%22patent_id%22]&o={%22per_page%22:50,%22matched_subentities_only%22:true,%22sort_by_subentity_counts%22:%22patent_id%22,%22page%22:1}&s=[{%22patent_id%22:%22desc%22},{%22assignee_total_num_patents%22:%22desc%22},{%22assignee_organization%22:%22asc%22},{%22assignee_last_name%22:%22asc%22},{%22assignee_first_name%22:%22asc%22}]

Related

BeautifulSoup is missing content

I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?

It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace('&nbsp', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)

You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.

You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result

In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.

BeautifulSoup can't find HTML element by class

This is the website I'm trying to scrape with Python:
https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000
I want to access the 'ul' element with the class of 'srp-results srp-list clearfix'. This is what I tried with requests and BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = 'https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
uls = soup.find_all('ul', attrs = {'class': 'srp-results srp-list clearfix'})
And the output is always an empty string.
I also tried scraping the website with Selenium Webdriver and I got the same result.

First I was a little bit confused about your error but after a bit of debugging I figured out that: eBay dynamically generates that ul with JavaScript
So since you can't execute JavaScript with BeautifulSoup you have to use selenium and wait until the JavaScript loads that ul

It is probably because the content you are looking for is rendered by JavaScript After the page loads on a web browser this means that the web browser load that content after running javascript which you cannot get with requests.get request from python.
I would suggest to learn Selenium to Scrape the data you want

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.

.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.

As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python

If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.

This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .

I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.

Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.

Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url

I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()

The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape JavaScript page with Python - python

Related

BeautifulSoup is missing content

BeautifulSoup can't find HTML element by class

Python scraping website with flight tickets

Using Beautiful Soup in Python to check availability of a product online

Scraping all links using Python BeautifulSoup/lxml

Categories

Resources