I can't seem to scrape hrefs or text from web - python

I've tried various methods including searching through full xpath and finding via class then search for the href via the a tag. I've been at it for hours and I just cant seem to get it. Any help would be greatly appreciated.
from selenium import webdriver
import time
namelist=[]
driver=webdriver.Chrome()
driver.get("https://waxpeer.com/")
time.sleep(15)
when i use .text it just doesn't work
search=driver.find_elements_by_xpath('//*[#id="container"]/div/div/a')
print(search)
namelist.append(search)

Hey you can use time module give delay for loading the elements so it can find the associate tag
As you mention from this code you can extract href links from div tag container
Code:
from selenium import webdriver
import time
path="C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(path)
driver.get("https://waxpeer.com/")
time.sleep(10)
main_div=driver.find_elements_by_xpath('//*[#id="container"]')
for divs in main_div:
links=div.find_elements_by_tag_name("a")
for link in links:
print(link.get_attribute('href'))
Output:
https://waxpeer.com/sport-gloves-vice-field-tested/item/21642893513
...

Related

Python Search multiple html files in a variable

I have used Selenium driver to crawl through many site pages. Every time I get a new page I append the html to a variable called "All_APP_Pages". The variable All_APP_Pages is a variable holding html for many pages. Did not post code because its long and no relevant to issue. Python list "All_APP_Pages" as being of type bytes.
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET
from selenium.webdriver.common.by import By
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
print(link)
Once all pages have been scanned I need to get the link from this xpath
"//tr[.//span[contains(.,'Product Data Solutions (ABC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
The xpath listed here works. However it only works with the selenium driver if driver is on the page where this link exists. That is why all page are in one variable since I dont know what page the link will be on. The print shows this result
[<Element a at 0x1c39dea1180>]
How do I get this value from link I so can check if value is correct?
You need to iterate the list and get the href value
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
hrefs=[l.attrib["href"] for l in link]
print(hrefs)

Beautiful Soup 4 findall() not matching elements from the <img> tag

I am trying to use Beautiful Soup 4 to help me download an image from Imgur, although I doubt the Imgur part is relevant. As an example, I'm using the webpage here: https://imgur.com/t/lenovo/mLwnorj
My code is as follows:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
res = requests.get(https://imgur.com/t/lenovo/mLwnorj)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
imageElement = soup.findAll('img', {'class': 'post-image-placeholder'})
print(imageElement)
The HTML code on the Imgur link contains a part that reads as:
<img alt="" src="//i.imgur.com/JfLsH5y.jpg" class="post-image-placeholder" style="max-width: 100%; min-height: 546px;" original-title="">
which I found by picking the first image element on the page using the point and click tool in Inspect Element.
The problem is that I would expect there to be two items in imageElement, one for each image, however, the print function shows []. I have also tried other forms of soup.findAll('img', {'class': 'post-image-placeholder'}) such as soup.findall("img[class='post-image-placeholder']") but that made no difference.
Furthermore, when I used
imageElement = soup.select("h1[class='post-title']")
,just to test, the print function did return a match, which made me wonder if it had something to do with the tag.
[<h1 class="post-title">Cable management increases performance. </h1>]
Thank you for your time and effort
The fundamental problem here seems to be that the actual <img ...> element is not present when the page is first loaded. The best solution to this, in my opinion, would be to take advantage of the selenium webdriver that you already have available to grab the image. Selenium will allow the page to properly render (with JavaScript and all), and then locate whatever elements you care about.
For example:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
# For pretty debugging output
import pprint
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
# Give the page up to 10 seconds of a grace period to finish rendering
# before complaining about images not being found.
browser.implicitly_wait(10)
# Find elements via Selenium's search
selenium_image_elements = browser.find_elements_by_css_selector('img.post-image-placeholder')
pprint.pprint(selenium_image_elements)
# Use page source to attempt to find them with BeautifulSoup 4
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
soup_image_elements = soup.findAll('img', {'class': 'post-image-placeholder'})
pprint.pprint(soup_image_elements)
I cannot say that I have tested this code yet on my side, but the general concept should work.
Update:
I went ahead and tested this on my side, fixed some errors in the code, and I then got the results I was hoping to see:
If a website will insert objects after page load you will need to use Selenium instead of requests.
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://imgur.com/t/lenovo/mLwnorj'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
images = soup.find_all('img', {'class': 'post-image-placeholder'})
[print(image['src']) for image in images]
# //i.imgur.com/JfLsH5yr.jpg
# //i.imgur.com/lLcKMBzr.jpg

Web scraping using selenium and beautifulsoup.. trouble in parsing and selecting button

I am trying to web scrape the following website "url='https://angel.co/life-sciences'
". The website contains more than 8000 data. From this page I need the information like company name and link, joined date and followers. Before that I need to sort the followers column by clicking the button. then load more information by clicking more hidden button. The page is clickable (more hidden) content at the max 20 times, after that it doesn't load more information. But I can take only top follower information by sorting it. Here I have implemented the click() event but it's showing error.
Unable to locate element: {"method":"xpath","selector":"//div[#class="column followers sortable sortable"]"} #before edit this was my problem, using wrong class name
So do I need to give more sleep time here?(tried giving that but same error)
I need to parse all the above information then visit individual link of those website to scrape content div of that html page only.
Please suggest me a way to do it
Here is my current code, I have not added html parsing part using beautifulsoup.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
#import urlib2
driver = webdriver.Chrome()
url='https://angel.co/life-sciences'
driver.get(url)
sleep(10)
driver.find_element_by_xpath('//div[#class="column followers sortable"]').click()#edited
sleep(5)
for i in range(2):
driver.find_element_by_xpath('//div[#class="more hidden"]').click()
sleep(8)
sleep(8)
element = driver.find_element_by_id("root").get_attribute('innerHTML')
#driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'more hidden')))
'''
results = html.find_elements_by_xpath('//div[#class="name"]')
# wait for the page to load
for result in results:
startup = result.find_elements_by_xpath('.//a')
link = startup.get_attribute('href')
print(link)
'''
page_source = driver.page_source
html = BeautifulSoup(element, 'html.parser')
#for link in html.findAll('a', {'class': 'startup-link'}):
# print link
divs = html.find_all("div", class_=" dts27 frw44 _a _jm")
The above code was working and was giving me html source before I have added the Followers click event.
My final goal is to import all these five information like Name of the company, Its link, Joined date, No of Followers and the company description (which to be obtained after visiting their individual links) into a CSV or xls file.
Help and comments are apprecieted.
This is my first python work and selenium, so little confused and need guidance.
Thanks :-)
The click method is intended to emulate a mouse click; it's for use on elements that can be clicked, such as buttons, drop-down lists, check boxes, etc. You have applied this method to a div element which is not clickable. Elements like div, span, frame and so on are used to organise HTML and provide for decoration of fonts, etc.
To make this code work you will need to identify the elements in the page that are actually clickable.
Oops my typing mistake or some silly mistake here, I was using the div class name wrong it is "column followers sortable" instead I was using "column followers sortable selected". :-(
Now the above works pretty good.. but can anyone guide me with beautifulsoup HTML parsing part?

Generate a list from HTML elements using Python

I am using selenium and BeautifulSoup to create a few lists from wikipedia pages. When I look at the page source, the links I want to get the information from are always structured as:
<li>town_name, state</li>
There is a link within the tag that you can click on that will direct you to that town's wiki page. It is always /wiki/town_name,_California
I want to use a for loop in Python to find every item with this structure but am unclear how to write the regular expression. I tried:
my_link = "//wiki//*,California"
and
my_link = "//wiki//*,_California"
But when I tried to run:
br.find_element_by_link_text(my_link)
These returned similar errors:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"//wiki//*,_California"}
I also tried:
import selenium, time
import html5lib
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
pg_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
br = webdriver.Chrome()
url = "http://somewikipage.org"
br.get(url)
lnkLst = []
for lnk in br.find_element_by_partial_link_text(",_California"):
lnkLst.append(lnk)
and got this:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":",_California"}
Is there any way I can correct this code so I can build a list of my targeted links?
As you mentioned in your Question that br.find_element_by_partial_link_text(",_California") didn't work that's because ,_California is not really the link_text as per the HTML you provided.
As per your question we need to find the <a> tage which contains the attribute href="/wiki/town_name,_California". So you can use any of the following options:
css_selector:
br.find_element_by_css_selector("a[href=/wiki/town_name,_California]")
xpath:
br.find_element_by_xpath("//a[#href='/wiki/town_name,_California']")
Read up on css selectors, they are your friend. I think the following should work.
hrefs = [a.href for a in soup.select('li a[href^="/wiki/"]')]

Parsing HTML Content using BeautifulSoup & Selenium

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import csv
import requests
import re
driver2 = webdriver.Chrome()
driver2.get("http://www.squawka.com/match-results?ctl=10_s2015")
soup=BeautifulSoup(driver2.page_source)
print soup
driver2.quit()
I'm trying to get the HREF of every "td", "Class":"Match Centre" and I need to use selenium to navigate through the pages but im struggling to incorporate the two so I can change the menu options and navigate through the different pages while feeding the links into my other code.
I've researched and tried ('inner-html') and the page.source currently in the code, but it doesn't get any of the web links I need.
Does anyone have a solution to get these links and navigate on the page. Could there be a way to get the XML of this page to get all the links?
Not sure why would you need BeautifulSoup (BS) here. Selenium alone is capable of locating elements and navigating through links on a page. For example, to get all the links to the match details page you can do as follow :
>>> matches = driver.find_elements_by_xpath("//td[#class='match-centre']/a")
>>> print [match.get_attribute("href") for match in matches]
As for navigating through the pages, you can use the following XPath :
//span[contains(#class,'page-numbers')]/following-sibling::a[1]
The above XPath finds link to the next page. To navigate through all the pages, you can try using a while loop; while the link to the next page is found :
perform a click action on the link,
grab all the href from current page,
locate the next page link.

Categories