BeautifulSoup not Retrieving Accurate HTML - Requests_HTML - python

I am trying to parse a picture off of this page. Specifically, I am trying to parse the image under the div class "gOenxf". When you inspect the webpage, the HTML elements show an "encrypted" image URL, which is useful to me and what I am trying to retrieve. However, when I parse that same page/class the image is retrieved get as a "Data URL" which is not very useful to me. I am using request_html because I need something faster than selenium. I am also using BeautifulSoup because it is easier than request_html's built-in ".find" system. Does anyone know why this is happening or a solution to the problem?
for google_post in google_initiate():
post_image_url = str(google_post.find(class_='gOenxf'))
post_image_url = post_image_url[post_image_url.find('src="') + len('src="'):post_image_url.rfind('"')]
print(post_image_url)
def google_initiate():
url = 'https://www.google.com/search?tbm=shop&q=desk'
session = HTMLSession()
data = session.get(url)
google_soup = BeautifulSoup(data.text, features='html.parser')
google_parsed = google_soup.find_all('div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']})
google_initiate.google_parse_page = URL
session.close()
return google_parsed

Related

BeautifulSoup is missing content

I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?
It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace('&nbsp', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)
You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.
You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result
In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.

How to scrape JavaScript page with Python

I'm trying to scrape patentsview.org but I'm having an issue. When I try to scrape this page, it doesn't work well. Site using JavaScript to get data from their database. I tried to get the data using requests-html package but I didn't quite understand.
Here's what I tried:
# Import
import re
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
# Set requests
r = session.get('https://datatool.patentsview.org/#search/assignee&asn=1|Samsung')
r.html.render()
# Set BS and print
soup = BeautifulSoup(r.html.html, "lxml")
tags = soup.find_all("div", class_='summary')
print(tags)
This code gives me this result:
# Result
[<div class="summary"></div>]
But I want this:
This is the right div. But I can't see content of div with my code. How can I get the div's content? Hope you understand what I meant.
Use the browser dev tools. (Chrome. F12 - Network - XHR) and see the HTTP GET thst return the data (as JSON) you are looking for.
HTTP GET https://webapi.patentsview.org/api/assignees/query?q={%22_and%22:[{%22_or%22:[{%22_and%22:[{%22_contains%22:{%22assignee_first_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_last_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_organization%22:%22Samsung%22}}]}]}]}&f=[%22assignee_id%22,%22assignee_first_name%22,%22assignee_last_name%22,%22assignee_organization%22,%22assignee_lastknown_country%22,%22assignee_lastknown_state%22,%22assignee_lastknown_city%22,%22assignee_lastknown_location_id%22,%22assignee_total_num_patents%22,%22assignee_first_seen_date%22,%22assignee_last_seen_date%22,%22patent_id%22]&o={%22per_page%22:50,%22matched_subentities_only%22:true,%22sort_by_subentity_counts%22:%22patent_id%22,%22page%22:1}&s=[{%22patent_id%22:%22desc%22},{%22assignee_total_num_patents%22:%22desc%22},{%22assignee_organization%22:%22asc%22},{%22assignee_last_name%22:%22asc%22},{%22assignee_first_name%22:%22asc%22}]

How to take multiple images links

def get_links(statu, data, n_img, url, agent):
if statu==0:
print("The website doesn't response. Please try again later",end=" ")
else:
img_links=[]
r=requests.get(url,headers=agent).text
soup=BeautifulSoup(r,"lxml")
results=soup.find_all("div",attrs={"class":"view"})
results=soup.find_all("div",attrs={"class":"view"})
results=soup.find_all("div",attrs={"class":"interaction-view"})
results=soup.find_all("div",attrs={"class":"photo-list-photo-interaction"})
# results=soup.find_all("a",attrs={"class":"overlay"},limit=n_img)
print(results)
for result in results:
link=result.get("href")
img_links.append(link)
return img_links
In order to download multiple image, I try to get links from Flickr. To do that, I write above code, and everything was good until come that the line "results=soup.find_all("div",attrs={"class":"photo-list-photo-interaction"})". Before that line I can take HTML code. However, in that line I couldn't got it.
How can I solve that problem. Thank you!
Instead of scraping with Beautiful Soup, why not use the API instead? Alternatively, you could use Flickr's RSS Feeds and parse them with the feedparser module.
If you still want to use BeautifulSoup:
def flickr_photos(url):
img_urls = []
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
photos = soup.find_all('div', {'class': 'view'})
for photo in photos:
try:
img = photo['style'].split('(//').pop()
if img.startswith('live'):
img_urls.append(f'https://{img[:-1]}')
except:
pass
return img_urls
The reason your code doesn't work is because Flickr has the image's url in the background-image style attribute.

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

Scraping File Paths from GitHub Repo yields 400 Response, but viewing in browser works fine

I’m trying to scrape all the file paths from links like this: https://github.com/themichaelusa/Trinitum/find/master, without using the GitHub API at all.
The link above contains a data-url attribute in the HTML (table, id=‘tree-finder-results’, class=‘tree-browser css-truncate’), which is used to make a URL like this: https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd
which displays this dictionary:
{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}
when you view it in a browser like Chrome. However, GET request yields a <[400] Response>.
Here is the code I used:
username, repo = ‘themichaelusa’, ‘Trinitum’
ghURL = 'https://github.com'
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo))
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
repoContent = soup.find('div', class_='tree-finder clearfix')
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url'])
filePaths = requests.get(fileLinksURL)
print(filePaths)
Not sure what is wrong with it. My theory is that the first link creates a cookie that allows the second link to show the file paths of the repo we are targeting. I'm just unsure how to achieve this via code. Would really appreciate some pointers!
Give it a go. The links containing .py files are generated dynamically, so to catch them you need to use selenium. I think this is what you expected.
from selenium import webdriver ; from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://github.com/themichaelusa/Trinitum/find/master'
driver=webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
for link in soup.select('#tree-finder-results .js-tree-finder-path'):
print(urljoin(url,link['href']))
Partial results:
https://github.com/themichaelusa/Trinitum/blob/master
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE
https://github.com/themichaelusa/Trinitum/blob/master/README.md
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py

Categories