BS fails to get section id in Selenium retrieved page - python

I have a problem with the following code
import re
from lxml import html
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests
import sys
import datetime
print ('start!')
print(datetime.datetime.now())
list_file = 'list2.csv'
#This should be the regular input list
url_list=["http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3"]
#This is an example input instead
binary = FirefoxBinary('C:/Program Files (x86)/Mozilla Firefox/firefox.exe')
#Read somewhere it could be a variable useful to supply but anyway, the program fails randomly at time with [WinError 6] Invalid Descriptor while having nothing different from when it is able to at least get the webpage; even when not able to perform further operation.
for page in url_list:
print(page)
browser = webdriver.Firefox(firefox_binary=binary)
#I tried this too to solve the [WinError 6] but it is not working
browser.get(page)
print ("TEST BEGINS")
soup=BS(browser.page_source,"lxml")
soup=soup.find("summaries")
# This fails here. It finds nothing, while there is a section id termed summaries. soup.find_all("p") works but i don't want all the p's outside of summaries
print(soup) #It prints "None" indeed.
print ("TEST ENDS")
I am positive source code includes "summaries". First there is
<li> Summaries</li>
then there is
<section id="summaries" data-ga-label="Summaries" data-section="Summaries">
As suggested here (Webscraping in python: BS, selenium, and None error) by #alexce, I tried
summary = soup.find('section', attrs={'id':'summaries'})
(Edit: the suggestion was _summaries but I did tested summaries too)
but it does not work either.
So my questions are:
why does BS not find the summaries, and why does selenium keep breaking when I use the script too much in a row (restarting a console works, on the other hand, but this is tedious), or with a list comprising more than four instances?
Thanks

This:
summary = soup.find('section', attrs={'id':'_summaries'})
Search for element section that have the attribute id set to _summaries:
<section id="_summary" />
There is no element with these attribute in the page.
The one that you want is probably <section id="summaries" data-ga-label="Summaries" data-section="Summaries">. And can be matched with:
results = soup.find('section', id_='summaries')
Also, side note on why you use Selenium. The page will return an error if you do not forward cookies. So in order to use requests, you need to send cookies.
My full code:
1 from __future__ import unicode_literals
2
3 import re
4 import requests
5 from bs4 import BeautifulSoup as BS
6
7
8 data = requests.get(
9 'http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3',
10 cookies={
11 'nlbi_146342': '+fhjaf6NSntlOWmvFHlFeAAAAAAwHqv5tJUsy3kqgNQOt77C',
12 'visid_incap_146342': 'tEumui9aQoue4yMuu9tuUcly6VYAAAAAQUIPAAAAAABcQsCGxBC1gj0OdNFoMEx+',
13 'incap_ses_189_146342': 'bNY8PNPZJzroIFLs6nefAspy6VYAAAAAYlWrxz2UrYFlrqgcQY9AuQ=='
14 }).content
15
16 soup=BS(data)
17 results=soup.find_all(string=re.compile('summary', re.I))
18 print(results)
19 summary_re = re.compile('summary', re.I)
20 results = soup.find('section', id_='summaries')
21 print(results)

The element is probably not yet on the page. I would wait for the element before parsing the page source with BS:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "summaries")))
soup = BS(driver.page_source,"lxml")
I noticed that you never call driver.quit(), this may be the reason of your breaking issues.
So make sure to call it or try to reuse the same session.
And to make it more stable and performant, I would try to work with the Selenium API as mush as possible since pulling and parsing the page source is expensive.

Related

BeautifulSoup saying tag has no attributes while looking for sibling or parent tags

I'm attempting to extract the tax bill for the following web page (http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051).
The tax bill is the $8,084.54 value directly following the Taxes & Assessments string.
I need to use some static object to go off of because the code will be working over multiple pages.
The "Taxes & Assessments" string is a constant between all pages and always precedes the full tax bill, while the tax bill changes between pages.
My thought was that I could find the "Taxes & Assessment" string, then traverse the BeautifulSoup tree and find the Tax Bill. This is my code:
soup = BeautifulSoup(html_content,'html.parser') #Soupify the HTML content
tagTandA = soup.body.find(text = "Taxes & Assessments")
taxBill = tagTandA.find_next_sibling.text
This returns an error of:
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
In fact, any fcn of parent, next_sibling, find_next_sibling, or anything of the sorts returns this object has no attribute error.
I have tried looking for other explicit text, just to test that it's not this specific text that is giving me an issue, and the no attribute error is still thrown.
When running just the following code, it returns "None":
tagTandA = soup.body.find(text = "Taxes & Assessments")
How can I find the "Taxes & Assessments" tag in order to navigate the tree to find and return the Tax Bill?
If I'm not mistaken you're trying to use a requests & bs based solution to scrape a (very) JS heavy website, with a redirect, and some iframes.
I don't think it will work.
Here is one way of getting that information (you can improve that hardcoded wait if you want, I just didn't have time to fiddle) using Selenium (there are some unused imports, you can get rid of them):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051'
driver.get(url)
t.sleep(15)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#name="body"]')))
total_taxes = wait.until(EC.element_to_be_clickable((By.XPATH, "//font[contains(text(), 'Taxes & Assessments')]/ancestor::td/following-sibling::td")))
print('Tax bill: ', total_taxes.text)
Result in terminal:
Tax bill: $8,084.54
See Selenium documentation for more details.
A very kind person (u/commandlineuser) answered the question in BS code here: https://www.reddit.com/r/learnpython/comments/10hywbs/beautifulsoup_saying_tag_has_no_attributes_while/
Here's said code:
import re
import requests
from bs4 import BeautifulSoup
url = ""
r1 = requests.get(url)
soup1 = BeautifulSoup(r1.content, "html.parser")
base = r1.url[:r1.url.rfind("/") + 1]
href1 = soup1.find("frame").get("src")
r2 = requests.get(base + href1)
soup2 = BeautifulSoup(
r2.content
.replace(b"<!--", b"") # basic attempt at stripping comments
.replace(b"-->", b""),
"html.parser"
)
href2 = soup2.find("voicemax").get_text(strip=True)
r3 = requests.get(base + href2)
soup3 = BeautifulSoup(r3.content, "html.parser")
total = (
soup3.find(text=re.compile("Taxes & Assessments"))
.find_next()
.get_text(strip=True)
)
print(total)
find_next_sibling is a function/method, use find_next_sibling().
There is a similar property so I can see the confusion.

Selenium dynamic scraping code only working when ran multiple times in python

I have been trying to dynamically scrape a news website using python and return back the text version of the live headlines. For now, I have decided to just return the div. I had success with sometimes making it work. If I run the code at least three times in quick succession, it returns back what I am looking for. However, when ran once, it returns back a "Loading articles..." text instead of the headlines. I have tried buffering the code (thought that maybe it had to do with connection or the article loading on the software run browser but that wasn't the case). Any suggestions?
Here's the code:
import bs4 as bs
import urllib.request
from urllib.request import Request, urlopen
from selenium import webdriver
import time
url = 'https://newsfilter.io/latest/merger-and-acquisitions'
browser = webdriver.Chrome('C:\\Users\\sam\\Documents\\chromedriver_win32\\chromedriver.exe')
browser.get(url)
sauce= browser.execute_script('return document.documentElement.outerHTML')
browser.quit()
soup = bs.BeautifulSoup(sauce, 'lxml')
for i in soup.find_all('div'):
print(i.text)
The contents are loading dynamically, I would loop the website continuously to scrape the data like this,
# SWAMI KARUPPASWAMI THUNNAI
import bs4 as bs
import urllib.request
from urllib.request import Request, urlopen
from selenium import webdriver
import time
url = 'https://newsfilter.io/latest/merger-and-acquisitions'
browser = webdriver.Chrome()
browser.get(url)
unique_div = []
while True:
soup = bs.BeautifulSoup(browser.page_source, 'html.parser')
for i in soup.find_all('div'):
if i.text not in unique_div:
unique_div.append(i.text)
print(i.text)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
unique_div will contain unique elements.
Note: The above program does not know when to stop, So you can write something like to check the length of unique elements before scraping and the length of unique elements after scraping. If the length remains same then no new elements are found in the website. Something like this,
unique_div = []
while True:
previous_length = len(unique_div)
time.sleep(3)
soup = bs.BeautifulSoup(browser.page_source, 'html.parser')
for i in soup.find_all('div'):
if i.text not in unique_div:
unique_div.append(i.text)
#print(i.text)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
after_scraped = len(unique_div)
if previous_length == after_scraped:
print("Scraped Everything")
break
I would go for WebdriverWait insteand of time.sleep(secs) anyways. Waits in Selenium.

how to find proper xpath for selenium?

I'm trying to scrape this page : https://www.bitmex.com/app/trade/XBTUSD
to get the Open Interest data on the left side of the page. I am at this stage
import bs4
from bs4 import BeautifulSoup
import requests
import re
from selenium import webdriver
import urllib.request
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
url = "https://www.bitmex.com/app/trade/XBTUSD"
page = urllib.request.urlopen('https://www.bitmex.com/app/trade/XBTUSD')
soup = bs4.BeautifulSoup(r.text, 'xml')
resultat = soup.find_all(text=re.compile("Open Interest"))
driver = webdriver.Firefox(executable_path='C:\\Users\\Samy\\Desktop\\geckodriver\\geckodriver.exe')
results = driver.find_elements_by_xpath("//*[#class='contractStats hoverContainer block']//*[#class='value']/html/body/div[1]/div/span/div[1]/div/div[2]/li/ul/div/div/div[2]/div[4]/span[2]/span/span[1]")
print(len(results))
I get 0 as a result. I tried several different things for the results variable (also driver.find_elements_by_xpath("//span[#class='price']/text()"), but can't seem to find the way. I know the problem is when I copy the XML path, but can't seem to understand clearly the issue despite reading Why does this xpath fail using lxml in python? and https://stackoverflow.com/a/43095252/7937578
I was using only the XML path obtained by copying, but after reading those SO questions I added the part at the begining[#class....] but I'm missing something. Thank you if you know how to help !
If I understood your requirements correctly, the following script should fetch you the required content from that page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.bitmex.com/app/trade/XBTUSD"
with webdriver.Firefox() as driver:
driver.get(link)
wait = WebDriverWait(driver,10)
items = [item.text for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//*[#class='lineItem']/span[#class='hoverHidden'][.//*[contains(.,'Open Interest')]]//span[#class='key' or #class='value']")))]
print(items)
Output at his moment:
['Open Interest', '640,089,423 USD']
I don't know why it fails, but I think the best way to find any element is by full XPath.
Something that look like this:
homebutton = driver.find_element_by_xpath("/html/body/header/div/div[1]/a[2]/span")
Give it a try.
Full path is not the best one, also it's harder to read it. The XPath is 'filter', try to find some unique attributes for needed control, or some unique description of parent one. Look, the needed span has 'value' class, and it is located inside span with 'tooltipWrapper' class, also the parent span has another child with 'key' class and 'Open Interest' text. There are thousands of locators, I can saggest two:
//span[#class = 'tooltipWrapper' and span[string() = 'Open Interest']]//span[#class = 'value']
//span[#class = 'key' and text() = 'Open Interest']/..//span[#class = 'value']

Beautiful Soup 4 findall() not matching elements from the <img> tag

I am trying to use Beautiful Soup 4 to help me download an image from Imgur, although I doubt the Imgur part is relevant. As an example, I'm using the webpage here: https://imgur.com/t/lenovo/mLwnorj
My code is as follows:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
res = requests.get(https://imgur.com/t/lenovo/mLwnorj)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
imageElement = soup.findAll('img', {'class': 'post-image-placeholder'})
print(imageElement)
The HTML code on the Imgur link contains a part that reads as:
<img alt="" src="//i.imgur.com/JfLsH5y.jpg" class="post-image-placeholder" style="max-width: 100%; min-height: 546px;" original-title="">
which I found by picking the first image element on the page using the point and click tool in Inspect Element.
The problem is that I would expect there to be two items in imageElement, one for each image, however, the print function shows []. I have also tried other forms of soup.findAll('img', {'class': 'post-image-placeholder'}) such as soup.findall("img[class='post-image-placeholder']") but that made no difference.
Furthermore, when I used
imageElement = soup.select("h1[class='post-title']")
,just to test, the print function did return a match, which made me wonder if it had something to do with the tag.
[<h1 class="post-title">Cable management increases performance. </h1>]
Thank you for your time and effort
The fundamental problem here seems to be that the actual <img ...> element is not present when the page is first loaded. The best solution to this, in my opinion, would be to take advantage of the selenium webdriver that you already have available to grab the image. Selenium will allow the page to properly render (with JavaScript and all), and then locate whatever elements you care about.
For example:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
# For pretty debugging output
import pprint
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
# Give the page up to 10 seconds of a grace period to finish rendering
# before complaining about images not being found.
browser.implicitly_wait(10)
# Find elements via Selenium's search
selenium_image_elements = browser.find_elements_by_css_selector('img.post-image-placeholder')
pprint.pprint(selenium_image_elements)
# Use page source to attempt to find them with BeautifulSoup 4
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
soup_image_elements = soup.findAll('img', {'class': 'post-image-placeholder'})
pprint.pprint(soup_image_elements)
I cannot say that I have tested this code yet on my side, but the general concept should work.
Update:
I went ahead and tested this on my side, fixed some errors in the code, and I then got the results I was hoping to see:
If a website will insert objects after page load you will need to use Selenium instead of requests.
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://imgur.com/t/lenovo/mLwnorj'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
images = soup.find_all('img', {'class': 'post-image-placeholder'})
[print(image['src']) for image in images]
# //i.imgur.com/JfLsH5yr.jpg
# //i.imgur.com/lLcKMBzr.jpg

Print elements using dt class name selenium python

I am trying to write a simple scraper for Sales Navigator in Linkedin and this is the link I am trying to scrape . It has search results for specific filter options selected for account results.
The goal I am trying to achieve is to retrieve every company name among the search results. Upon inspecting the link elements carrying the company name (eg : Facile.it, AGT international), I see the following js script, showing the dt class name
<dt class="result-lockup__name">
<a id="ember208" href="/sales/company/2429831?_ntb=zreYu57eQo%2BSZiFskdWJqg%3D%3D" class="ember-view"> Facile.it
</a> </dt>
I basically want to retrieve those names and open the url represented in href.
It can be noted that all the company name links had the same dt class result-lockup__name. The following script is an attempt to collect the list of all company names displayed in the search result along with its elements.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
def scrape_accounts(url):
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
search_results = []
search_results = driver.find_elements_by_class_name("result-lockup__name")
print(search_results)
if __name__ == "__main__":
scrape_accounts("lol")
however, the result prints an empty list. I am trying to learn scraping different parts of web page and different elements,and thus I am not sure if I got this correct. What would be the right way?
I'm afraid I can't get to the page that you're after, but I notice that you're importing beautiful soup but not using it.
Try:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
def scrape_accounts(url = url):
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
for link in search_results:
print(link.text.strip(), link['href'])

Categories