Web scraping for multiple classes using python - python

I am trying to scrape address from 10K filing document in HTML: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm
It has multiple div class, and I want to scrape for address inside span.
Expected output:
1600 Amphitheatre parkway
I have tried few things like below:
from requests_html import HTMLSession
s = HTMLSession()
r = s.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
r
add1 = r.html.find_all('div')
add1
However, if you inspect the page it has many layers I am new to HTML and python. Please help

You could do it like this, but I'm not sure it's very robust, or applicable to many examples given how the ids look...
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')
content = soup.find(id="d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602")
print(content.text)
output
1600 Ampitheatre Parkway
Edit : I didn't see #baduker answer and I didn't know there was an API, he is right, use the API

Related

BeautifulSoup is missing content

I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?
It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace('&nbsp', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)
You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.
You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result
In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.

webscraping in python: copying specific part of HTML for each webpage

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<span class="js-enhanced-ecommerce-data hidden" data-product-title="Illamasqua Expressionist Artistry Palette" data-product-id="12024086" data-product-category="" data-product-is-master-product-id="false" data-product-master-product-id="12024086" data-product-brand="Illamasqua" data-product-price="£39.00" data-product-position="1">
</span>
I want to select the data-product-brand="Illamasqua" , specifically the Illamasqua. I am not sure how to grab this using html requests or Beautifulsoup. I tried:
r.html.find("span.data-product-brand", first=True)
But this was unsuccesful. Any help would be appreiciated.
Because you tagged beautifulsoup, here's a solution for using that package
from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
soup = BeautifulSoup(page.content, "html.parser")
# there are multiple matches for the class that contains the word 'Illamasqua', which is what I think you want in the end???
# you can loop through and get the brand like this; in this case there are three
for l in soup.find_all(class_="js-enhanced-ecommerce-data hidden"):
print(l.get('data-product-brand'))
# if it's always going to be the first, you can just do this
soup.find(class_="js-enhanced-ecommerce-data hidden").get('data-product-brand')
You can get element(s) with specified data attribute directly:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
span=r.html.find('[data-product-brand]',first=True)
print(span)
3 results, and you need just first, i guess.

How to scrape JavaScript page with Python

I'm trying to scrape patentsview.org but I'm having an issue. When I try to scrape this page, it doesn't work well. Site using JavaScript to get data from their database. I tried to get the data using requests-html package but I didn't quite understand.
Here's what I tried:
# Import
import re
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
# Set requests
r = session.get('https://datatool.patentsview.org/#search/assignee&asn=1|Samsung')
r.html.render()
# Set BS and print
soup = BeautifulSoup(r.html.html, "lxml")
tags = soup.find_all("div", class_='summary')
print(tags)
This code gives me this result:
# Result
[<div class="summary"></div>]
But I want this:
This is the right div. But I can't see content of div with my code. How can I get the div's content? Hope you understand what I meant.
Use the browser dev tools. (Chrome. F12 - Network - XHR) and see the HTTP GET thst return the data (as JSON) you are looking for.
HTTP GET https://webapi.patentsview.org/api/assignees/query?q={%22_and%22:[{%22_or%22:[{%22_and%22:[{%22_contains%22:{%22assignee_first_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_last_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_organization%22:%22Samsung%22}}]}]}]}&f=[%22assignee_id%22,%22assignee_first_name%22,%22assignee_last_name%22,%22assignee_organization%22,%22assignee_lastknown_country%22,%22assignee_lastknown_state%22,%22assignee_lastknown_city%22,%22assignee_lastknown_location_id%22,%22assignee_total_num_patents%22,%22assignee_first_seen_date%22,%22assignee_last_seen_date%22,%22patent_id%22]&o={%22per_page%22:50,%22matched_subentities_only%22:true,%22sort_by_subentity_counts%22:%22patent_id%22,%22page%22:1}&s=[{%22patent_id%22:%22desc%22},{%22assignee_total_num_patents%22:%22desc%22},{%22assignee_organization%22:%22asc%22},{%22assignee_last_name%22:%22asc%22},{%22assignee_first_name%22:%22asc%22}]

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

How to use python to press the “load more” in imdb to get more reviews

I am creating a web crawler now and I want to scrape the user reviews from imdb. It's easy to directly get the 10 reviews and rate from the origin page. For example http://www.imdb.com/title/tt1392170/reviews The problem is to get all reviews, I need to press the "load more" so that more reviews will be shown while the url address doesn't change! So I don't know how can I get all the reviews in Python3. What I use now are requests, bs4.
My code now:
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup
url_link='http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
html=urlopen(url_link)
content_bs=BeautifulSoup(html)
for b in content_bs.find_all('div',class_='text'):
print(b)
for rate_score in content_bs.find_all('span',class_='rating-other-user-rating'):
print(rate_score)
You can't press the load more button without initiating click event. However, BeautifulSoup doesn't have that property. But, what you can do to get the full content is something like what i've demonstrated below. It will fetch you all the review title along with reviews:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
main_content = urljoin(url,soup.select(".load-more-data")[0]['data-ajaxurl']) ##extracting the link leading to the page containing everything available here
response = requests.get(main_content)
broth = BeautifulSoup(response.text,"lxml")
for item in broth.select(".review-container"):
title = item.select(".title")[0].text
review = item.select(".text")[0].text
print("Title: {}\n\nReview: {}\n\n".format(title,review))

Categories