I am scraping public linkedIn data from specific people.
here is the code inside the while loop. For you to know, I used time.sleep() for the first 400 profils urls and it worked. However, it is not working anymore as it makes my firefox browser crash. I am pretty sure that the bug comes from the time.sleep() function that I tried to modify using implictly_wait() and WebdriverWait. However, none of this tries worked ;(
Here the code inside the while loop with the time.sleep() function that worked for around 400urls:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Firefox()
browser.get("https://www.linkedin.com/uas/login")
time.sleep(4)
username = browser.find_element_by_id("session_key-login")
password = browser.find_element_by_id("session_password-login")
username.send_keys("yourmail")
password.send_keys("yourpassword")
login_attempt = browser.find_element_by_xpath("//*[#type='submit']")
login_attempt.submit()
time.sleep(4)
browser.get(the profile link I wanna scrap)
html = browser.page_source
soup = BeautifulSoup(html,"html.parser")
formation = soup.find_all('div', {'class': "education"})
nom = soup.find_all('span', {'class': "full-name"})
for a in nom:
for b in formation:
print(a.text,b.text)
time.sleep(4)
browser.close()
I tried to replace the time.sleep() by Implicitly_wait() but it is not working. The browser does not wait at all.
I also tried this
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("the profile url I wanna scrap")
delay = 30 # seconds
try:
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by('education'))
print("Page is ready!")
except TimeoutException:
print("Loading took too much time!")
But it is still not working.
Do you have any idea on how to solve the issue ?
If I could make the browser wait without using time.sleep() (which makes my browser crash) without any conditions that would be amazing !
other question ? If I use chrome instead of firefox, do I have a chance to overcome the problem ?
Thanks for your answers,
Raphaƫl
With WebDriverWait, the browser waits but firefox crashes again: here the code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://www.linkedin.com/uas/login")
username = browser.find_element_by_id("session_key-login")
password = browser.find_element_by_id("session_password-login")
username.send_keys("mail")
password.send_keys("password")
login_attempt = browser.find_element_by_xpath("//*[#type='submit']")
login_attempt.submit()
try:
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "content")))
finally:
browser.get("profil linkedin to scrap")
html = browser.page_source
soup = BeautifulSoup(html,"html.parser")
formation = soup.find_all('div', {'class': "education"})
nom = soup.find_all('span', {'class': "full-name"})
for a in nom:
for b in formation:
print(a.text,b.text)
browser.close()
Related
I've been at this for hours and haven't made any progress. I'm trying to click on the next button on this page here
Here's my code:
#!/usr/local/bin python3
import sys
import time
import re
import logging
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as options
from bs4 import BeautifulSoup as bs
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
_USE_VIRTUAL_DISPLAY = False
_FORMAT = '%(asctime)s - %(levelname)s - %(name)s - %(message)s'
# logging.basicConfig(filename=LOG_FILENAME,level=logging.DEBUG)
logging.basicConfig(format=_FORMAT, level=logging.INFO)
_LOGGER = logging.getLogger(sys.argv[0])
_DEFAULT_SLEEP = 0.5
try:
options = options()
# options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r"/usr/local/bin/geckodriver")
print("Started Browser and Driver")
except:
_LOGGER.info("Can not run headless mode.")
url = 'https://www.govinfo.gov/app/collection/uscourts/district/alsd/2021/%7B%22pageSize%22%3A%22100%22%2C%22offset%22%3A%220%22%7D'
driver.get(url)
time.sleep(5)
page = driver.page_source
soup = bs(page, "html.parser")
next_page = WebDriverWait(driver,5).until(EC.element_to_be_clickable((By.XPATH,'//*[#id="collapseOne1690"]/div/span[1]/div/ul/li[8]/a')))
if next_page:
print('*****getting next page*****')
# driver.execute_script('arguments[0].click()', next_page)
next_page.click()
time.sleep(3)
else:
print('no next page')
driver.quit()
I get a timeout error. I've tried changing the XPath. I've tried ActionChains to scroll into view and none have worked. Any help appreciated.
1 Your XPATH does not work because it uses dynamic class name collapseOne1690, as was mentioned earlier.
Also, it's not very stable even if you used a part of this class name.
If you prefer XPaths, I'd suggest this one: //span[#class='custom-paginator']//li[#class='next fw-pagination-btn']/a or just //li[#class='next fw-pagination-btn']/a. You can also use css selector: .next.fw-pagination-btn
2 I got rid of logging code because it also has some issues, re-check it.
3 5 seconds explicit wait is too small. Make it at least 10 seconds, better 15. It's just a suggestion.
The smallest reproducible code which clicks the button and uses Firefox is:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as options
from bs4 import BeautifulSoup as bs
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
options = options()
# options.headless = True
driver = webdriver.Firefox(options=options)
print("Started Browser and Driver")
url = 'https://www.govinfo.gov/app/collection/uscourts/district/alsd/2021/%7B%22pageSize%22%3A%22100%22%2C%22offset%22%3A%220%22%7D'
driver.get(url)
page = driver.page_source
soup = bs(page, "html.parser")
print(soup)
next_page = WebDriverWait(driver, 15).until(
EC.element_to_be_clickable((By.XPATH, "//span[#class='custom-paginator']//li[#class='next fw-pagination-btn']/a")))
next_page.click()
# driver.quit()
It appears when I load up this page that the div id's are assigned dynamically. The first time I loaded the page, the id was collapseOne5168, the second time it was collapseOne1136
You might consider using find_element_by_class_name("next fw-pagination-btn") instead?
I'm trying to extract real estate listing info from a site using selenium and beautiful soup using this tutorial: https://medium.com/#ben.sturm/scraping-house-listing-data-using-selenium-and-beautiful-soup-1cbb94ba9492
Aim is to gather all the href links from the first page before finding the 'next page' button, navigating to next and collecting all links on that page and so on.
Tried with a single function to achieve this and repeat for each page but can't figure out why it's not working. New to learning code and have seems too trivial to find an answer yet. Would appreciate any help
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import sys
import numpy as np
import pandas as pd
import regex as re
driver = webdriver.Chrome
url = "http://property.shw.co.uk/searchproperties/Level2-0/Level1-0-181-236-167-165/Units/Development-or-House-and-Flat-or-Investment-or-Land-or-Office-or-Other/UnitIds-0/For-Sale"
driver.get(url)
try:
wait = WebDriverWait(driver, 3)
wait.until(EC.presence_of_element_located((By.ID, "body1")))
print("Page is Ready!")
except TimeoutException:
print("page took too long to load")
def get_house_links(url, driver, pages=3):
house_links = []
driver.get(url)
for i in range(pages):
soup = BeautifulSoup(driver.page_source, 'html.parser')
listings = soup.find_all("a", class_="L")
page_data = [row['href'] for row in listings]
house_links.append(page_data)
time.sleep(np.random.lognormal(0, 1))
next_button = soup.find_all("a", class_="pageingBlock darkBorder")
next_button_link = ['http://property.shw.co.uk'+row['href'] for row in next_button]
if i < 3:
driver.get(next_button_link[0])
return house_links
get_house_links(url, driver)
class_="pageingBlock darkBorder" match the previous page button as well, so next_button_link[0] send you back to previous page. You need more precise locator
next_button = soup.select('img[src*="propNext"]')
if next_button:
next_button = next_button[0].find_parent('a')
next_button_link = 'http://property.shw.co.uk' + next_button['href']
driver.get(next_button_link)
I'm trying to get the URL of a video, but every time it doesn't show in my output. I try request, urllib and even selenium, but it just doesn't show part of the code in my result, it's like it is blocked.
The url is https://unitplay.net/tt0089222, and here is my code:
from selenium import webdriver
browser=webdriver.Chrome('path/chromedriver.exe')
type(browser)
browser.get('https://unitplay.net/tt0089222')
elem = browser.page_source
print(elem)
browser.quit()
Here is the part it doesn't show and I want to get the src from it:
<div class="jw-media jw-reset"><video class="jw-video jw-reset" x-webkit-airplay="allow" webkit-playsinline="" playsinline="" preload="auto" jw-loaded="data" src="https://unitplay.net//file/others/DA6BB292BA130B6A825B62B96BD929F811EBF7BFEC748F8E2609004F5D96D0F5DD7025F4450289E31279E9F621883D048C869F15520DBE571D8FA35EBCCACD75" __idm_id__="64900097" jw-played=""></video></div>
You can wait for the element to appear using selenium.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome('path/chromedriver.exe')
browser.get('https://unitplay.net/tt0089222')
elem = browser.page_source
try:
element = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "video"))
)
print(element.get_attribute("src"))
finally:
browser.quit()
This should tell selenium to wait up to 10 seconds for a video element to appear and then print out it's source.
I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)
You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.
#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)
I've written some code in python in combination with selenium to parse different product names from a webpage. There are few load more buttons visible if the browser is made to scroll downward. The webpage displays it's full content if the page is made to scroll downmost until there is no load more button to click. My scraper seems to be doing good but I'm not getting all the results. There are around 200 products in that page but I'm getting 90 out of them. What change should I bring about in my scraper to get them all? Thanks in advance.
The webpage I'm dealing with: Page_Link
This is the script I'm trying with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("put_above_url_here")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,".listing_item")))
for scroll in range(17):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
try:
load = driver.find_element_by_css_selector(".lm-btm")
load.click()
except Exception:
pass
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()
Try below code to get required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
wait = WebDriverWait(driver, 10)
header = driver.find_element_by_tag_name("header")
driver.execute_script("arguments[0].style.display='none';", header)
while True:
try:
page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".listing_item")))
driver.execute_script("arguments[0].scrollIntoView();", page)
page.send_keys(Keys.END)
load = wait.until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "LOAD MORE")))
driver.execute_script("arguments[0].scrollIntoView();", load)
load.click()
wait.until(EC.staleness_of(load))
except:
break
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[id^=item_]"))):
name = item.find_element_by_css_selector(".pro-name.el2").text
print(name)
driver.quit()
You should only Use Selenium as a last resort.
A simple look around in the webpage showed the API it called to get your data.
It returns a JSON output with all the details:
Link
You can now just loop over and store in a dataframe easily.
Very fast, fewer errors than selenium.