Scraping URLs with Python and selenium

Scraping URLs with Python and selenium - python

I am trying to get a python selenium script working that should do the following:
Take text file, BookTitle.txt that is a list of Book Titles.
Using Python/Selenium then searches the site, GoodReads.com for that title.
Takes the URL for the result and makes a new .CSV file with column 1=book title and column 2=Site URL
I hope that we can get this working, then please help me with step by step to get it to run.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.options import Options
from pyvirtualdisplay import Display
#from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common import keys
import csv
import time
import json
class Book:
def __init__(self, title, url):
self.title = title
self.url = url
def __iter__(self):
return iter([self.title, self.url])
url = 'https://www.goodreads.com/'
def create_csv_file():
header = ['Title', 'URL']
with open('/home/l/gDrive/AudioBookReviews/WebScraping/GoodReadsBooksNew.csv', 'w+', encoding='utf-8') as csv_file:
wr = csv.writer(csv_file, delimiter=',')
wr.writerow(header)
def read_from_txt_file():
lines = [line.rstrip('\n') for line in open('/home/l/gDrive/AudioBookReviews/WebScraping/BookTitles.txt', encoding='utf-8')]
return lines
def init_selenium():
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
options = Options()
options.add_argument('--headless')
global driver
driver = webdriver.Chrome("/home/l/gDrive/AudioBookReviews/WebScraping/chromedriver", chrome_options=chrome_options)
driver.get(url)
time.sleep(30)
driver.get('https://www.goodreads.com/search?q=')
def search_for_title(title):
search_field = driver.find_element_by_xpath('//*[#id="search_query_main"]')
search_field.clear()
search_field.send_keys(title)
search_button = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[1]/div[1]/div[2]/form/div[1]/input[3]')
search_button.click()
def scrape_url():
try:
url = driver.find_element_by_css_selector('a.bookTitle').get_attribute('href')
except:
url = "N/A"
return url
def write_into_csv_file(vendor):
with open('/home/l/gDrive/AudioBookReviews/WebScraping/GoodReadsBooksNew.csv', 'a', encoding='utf-8') as csv_file:
wr = csv.writer(csv_file, delimiter=',')
wr.writerow(list(vendor))
create_csv_file()
titles = read_from_txt_file()
init_selenium()
for title in titles:
search_for_title(title)
url = scrape_url()
book = Book(title, url)
write_into_csv_file(book)
Running the above, I get the following errors:
Traceback (most recent call last): File
"/home/l/gDrive/AudioBookReviews/WebScraping/GoodreadsScraper.py",
line 68, in
init_selenium() File "/home/l/gDrive/AudioBookReviews/WebScraping/GoodreadsScraper.py",
line 41, in init_selenium
driver = webdriver.Chrome("/home/l/gDrive/AudioBookReviews/WebScraping/chromedriver",
chrome_options=chrome_options) File
"/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py",
line 81, in init
desired_capabilities=desired_capabilities) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py",
line 157, in init
self.start_session(capabilities, browser_profile) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py",
line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py",
line 321, in execute
self.error_handler.check_response(response) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py",
line 242, in check_response
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error:
Chrome failed to start: exited abnormally (unknown error:
DevToolsActivePort file doesn't exist) (The process started from
chrome location /usr/bin/google-chrome is no longer running, so
ChromeDriver is assuming that Chrome has crashed.) (Driver info:
chromedriver=2.44.609551
(5d576e9a44fe4c5b6a07e568f1ebc753f1214634),platform=Linux
4.15.0-60-generic x86_64)

There are couple of errors I cansee for now:
1) you have to uncomment chrome options and comment firefox' as you're passing the chromedriver later in code
# from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
Btw, that pyvirtualdisplay is an alternative for headless chrome, you don't need it imported.
2) you have instantiated Options two times and you're using only the first one. Change your code to:
def init_selenium():
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
I guess these two are just for start, edit your question when you encounter the next problem you can't solve.

You are using chrome driver, but you comment it out at import.
from selenium.webdriver.chrome.options import Options
In the search function, the process is:
get page -> find search box -> input value -> enter keys -> grab results.
Something like this:
def search_for_title(title):
driver.get('https://www.goodreads.com/search?q=')
search_field = driver.find_element_by_name('q')
search_field.clear()
search_field.send_keys(title)
search_field.send_keys(keys.Keys.RETURN) # you missed this part
url = driver.find_element_by_xpath(
'/html/body/div[2]/div[3]/div[1]/div[2]/div[2]/table/tbody/tr[1]/td[2]/a')
print(url.get_attribute('href'))

Related

Hello. How to open URLs step by step form exel or csv and make an action in this links. And how to save session in chromedriver

So, I have a website and need to log in to this site, open links from the excel or CSV file(these links from the website), and make an action. Could you help me? Please
It works when I open links from driver.get('link')
The code is working :
import pathlib, pickle
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import undetected_chromedriver as chromedriver
import time
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path = str(pathlib.Path(__file__).parent.absolute()) + 'chromedriver.exe')
driver.get('https://..')
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
cookies = pickle.load(open("webauto\cookies.pkl1", "rb"))
for cookie in cookies:
driver.add_cookie(cookie)
driver.get('https://...')
time.sleep(30)
But If I want to open links from csv file in the cycle it doesn't to work.
The code isn't working example:
import pathlib, pickle
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import undetected_chromedriver as chromedriver
import time
from selenium.webdriver.common.by import By
import csv
driver = webdriver.Chrome(executable_path = str(pathlib.Path(__file__).parent.absolute()) + 'chromedriver.exe')
driver.get('https://../')
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
cookies = pickle.load(open("webauto\cookies.pkl1", "rb"))
for cookie in cookies:
driver.add_cookie(cookie)
with open('webauto\Prjct.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
# print(line)
driver = webdriver.Chrome(chrome_options=options)
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
driver.get(line[0])
time.sleep(30)
But it isn's work. Sorry for my shitcode. I just a want automate my action in differnet site.
Errors:
PS C:\Users\пользователь\Desktop\py\strt> & C:/Python310/python.exe c:/Users/пользователь/Desktop/py/strt/webauto/djinivana.py
c:\Users\пользователь\Desktop\py\strt\webauto\djinivana.py:9: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
driver = webdriver.Chrome(executable_path = str(pathlib.Path(__file__).parent.absolute()) + 'chromedriver.exe')
DevTools listening on ws://127.0.0.1:59150/devtools/browser/78603287-0828-48b9-87d9-2720e64541a7
Traceback (most recent call last):
File "c:\Users\пользователь\Desktop\py\strt\webauto\djinivana.py", line 22, in <module>
for line in csv_reader:
ValueError: I/O operation on closed file.

Selenium / Try to click button in shadow-root?

i try to click the cookie all accept button on this page:
https://www.tiktok.com/#shneorgru
with the following code
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from sys import platform
import os, sys
import xlwings as xw
from datetime import datetime, timedelta
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from fake_useragent import UserAgent
if __name__ == '__main__':
SAVE_INTERVAL = 5
WAIT = 1
print(f"Checking chromedriver...")
os.environ['WDM_LOG_LEVEL'] = '0'
ua = UserAgent()
userAgent = ua.random
options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 1})
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument(f'user-agent={userAgent}')
srv=Service(ChromeDriverManager().install())
driver = webdriver.Chrome (service=srv, options=options)
waitWebDriver = WebDriverWait (driver, 10)
link = f"https://www.tiktok.com/#shneorgru"
driver.get (link)
time.sleep(WAIT)
driverElem = driver.find_element(By.XPATH,"//tiktok-cookie-banner")
root1 = driver.execute_script("return arguments[0].shadowRoot", driverElem)
root1.find_element(By.XPATH,"(//button)[2]").click()
time.sleep(WAIT)
I try to select first the tag which is around the shadwo-root.
Then try to execute the script for the shadowRoot.
And then find the element inside the shadowRoot and finally click the button.
But i allways get the following error:
$ python collTikTok.py
Checking chromedriver...
Traceback (most recent call last):
File "C:\Users\Polzi\Documents\DEV\Fiverr\TRY\rosen771\collTikTok.py", line 56, in <module>
root1.find_element(By.XPATH,"(//button)[2]").click()
File "C:\Users\Polzi\Documents\DEV\.venv\selenium\lib\site-packages\selenium\webdriver\remote\shadowroot.py", line 59, in find_element
return self._execute(
File "C:\Users\Polzi\Documents\DEV\.venv\selenium\lib\site-packages\selenium\webdriver\remote\shadowroot.py", line 94, in _execute
return self.session.execute(command, params)
File "C:\Users\Polzi\Documents\DEV\.venv\selenium\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 425, in execute
self.error_handler.check_response(response)
File "C:\Users\Polzi\Documents\DEV\.venv\selenium\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid locator
How can i click this cookie accept button?
Here is the code how it looks like in the shadow-root:

Python selenium “while true:” script can't execute driver.get(“URL”)

My script has a while True: script running 24/7. It always hits print ('False') every sleep(55) seconds, and once the current_time is between specified times, it calls driver.get("URL") and does a few actions on a website.
But sometimes it hits an error where it cannot open driver.get('https://www.instagram.com/accounts/login/?source=auth_switcher') and prints the error below.
Traceback (most recent call last):
File "/home/matt/insta/users/nycforest/lcf.py", line 254, in <module>
lcf_time(input_begin_time,input_end_time,input_begin_time2,input_end_time2)
File "/home/matt/insta/users/nycforest/lcf.py", line 241, in lcf_time
login()
File "/home/matt/insta/users/nycforest/lcf.py", line 40, in login
driver.get('https://www.instagram.com/accounts/login/?source=auth_switcher')
File "/home/matt/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/home/matt/.local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/matt/.local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout
(Session info: headless chrome=78.0.3904.70)
(Driver info: chromedriver=2.42.591071 (0b695ff80972cc1a65a5cd643186d2ae582cd4ac),platform=Linux 5.4.0-1020-aws x86_64)
I tried adding driver.set_page_load_timeout(20) after driver.get('URL') but it hits the error above
from pyvirtualdisplay import Display
from datetime import datetime, time
from itertools import islice
from random import randint
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
input_begin_time = time(18,40)
input_end_time = time(18,50)
input_begin_time2 = time(21,20)
input_end_time2 = time(21,30)
opts = webdriver.ChromeOptions()
opts.add_argument('--headless')
opts.add_argument('--disable-gpu')
opts.add_argument('--no-sandbox')
opts.add_argument('--disable-dev-shm-usage')
opts.add_argument('--enable-features=NetworkService,NetworkServiceInProcess')
def login():
sleep(2)
driver.get('https://www.instagram.com/accounts/login/?source=auth_switcher')
driver.set_page_load_timeout(20)
sleep(3)
input_username = driver.find_element_by_name('username').send_keys(username)
input_password = driver.find_element_by_name('password').send_keys(password)
button_login = driver.find_element_by_css_selector('#loginForm > div > div:nth-child(3) > button')
button_login.click()
# def other_functions()
driver = webdriver.Chrome(options=opts)
def lcf_time(time_begin1, time_end1, time_begin2, time_end2, curren_time=None):
current_time = datetime.now().time()
if time_begin1 < current_time < time_end1 or time_begin2 < current_time < time_end2:
login()
# other_functions()
driver.close()
else:
print("False")
while True:
lcf_time(input_begin_time,input_end_time,input_begin_time2,input_end_time2)
sleep(55)
The odd thing is: I have a few different scripts running the exact same script with different variables and it some scripts are fine and others consistently hit this 60% of the time. I need a more reliable way to driver.get("URL") 24/7
Is this a compatibility issue? Am I doing something wrong?

Python Selenium checking checkbox: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element

I have written code to select the checkbox at following website: https://www.theatlantic.com/do-not-sell-my-personal-information/
I have tried following versions:
Version 1:
ele = driver.find_element_by_id('residency')
driver.execute_script("arguments[0].click()",ele)
Version 2: checkBox1 = driver.find_element_by_css_selector("input[id='residency']")
Version 3: driver.find_element_by_xpath("//input[#type='checkbox']")
However, for all of these versions I get following error:
Traceback (most recent call last):
File "website-functions/theatlantic.py", line 43, in <module>
atlantic_DD_formfill(california_resident, email, zipcode)
File "website-functions/theatlantic.py", line 30, in atlantic_DD_formfill
driver.find_element_by_xpath("//input[#type='checkbox']")
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 394, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//input[#type='checkbox']"}
(Session info: headless chrome=80.0.3987.87)
Here you can see the full code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import os
import time
def atlantic_DD_formfill(california_resident, email, zipcode):
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=chrome_options)
driver.set_page_load_timeout(10)
driver.set_window_size(1124, 850) # set browser size
# link to data delete form
print("opening data delete form")
driver.get("https://www.theatlantic.com/do-not-sell-my-personal-information/")
#Select California Resident Field:
#ele = driver.find_element_by_id('residency')
#driver.execute_script("arguments[0].click()",ele)
#checkBox1 = driver.find_element_by_css_selector("input[id='residency']")
#if(NOT(checkBox1.isSelected())):
# checkBox1.click()
driver.find_element_by_xpath("//input[#type='checkbox']")
print("California Resident Field selected")
driver.find_element_by_id("email").send_keys(email)
driver.find_element_by_id("zip-code").send_keys(email)
# KEEP THIS DISABLED BC IT ACTUALLY SUBMITS
# driver.find_element_by_id("SubmitButton2").send_keys(Keys.ENTER)
print("executed")
time.sleep(4)
driver.quit()
return None
california_resident=True
email = "joe#musterman.com"
zipcode=12345
atlantic_DD_formfill(california_resident, email, zipcode)

There is an iframe present on the page, so you need to first switch to that iframe and then click on the element and as an another element is placed above the checkbox element, you need to use java script click method to click on the checkbox.
You can do it like:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import os
import time
def atlantic_DD_formfill(california_resident, email, zipcode):
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=chrome_options)
driver.set_page_load_timeout(10)
driver.set_window_size(1124, 850) # set browser size
# link to data delete form
print("opening data delete form")
driver.get("https://www.theatlantic.com/do-not-sell-my-personal-information/")
driver.switch_to.frame(driver.find_element_by_tag_name('iframe'))
element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//input[#type='checkbox']")))
driver.execute_script("arguments[0].click();", element)

Python Selenium WebDriver unable to find element at all

Selenium webdriver has been unable to find any elements on page using different methods: class_name, id, & xpath.
Here's my code:
from selenium import webdriver
##from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import time
import random
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(executable_path=r'C:\Users\acer\Downloads\chromedriver_win32\chromedriver.exe', chrome_options=chrome_options)
time.sleep(2)
driver.get('https://www.reddit.com/r/AskReddit/comments/fi04fh/what_are_some_spoilers_for_the_next_month_of_2020/')
time.sleep(2)
print(driver.title)
time.sleep(2)
element = driver.find_element_by_id("header")
print("done")
The title prints successfully but it fails on the line of driver.find_element_by_id("header").
In fact, I am trying to find the element whose class is "top-matter" (using find_by_class_name) but since this wasn't working, I tested it for other elements ("header") using respective methods ("xpath", "id") but nothing is working for me.
Can anyone provide some insight into the issue?
EDIT: Here's the error:
Traceback (most recent call last):
File "C:/Python34/reddit_test.py", line 20, in <module>
element = driver.find_element_by_id("header")
File "C:\Python34\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 269, in find_element_by_id
return self.find_element(by=By.ID, value=id_)
File "C:\Python34\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 752, in find_element
'value': value})['value']
File "C:\Python34\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
self.error_handler.check_response(response)
File "C:\Python34\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"id","selector":"header"}
(Session info: headless chrome=80.0.3987.132)
(Driver info: chromedriver=2.41.578737 (49da6702b16031c40d63e5618de03a32ff6c197e),platform=Windows NT 6.1.7601 SP1 x86_64)
Here's proof that the element exists...

in your url there is no header id
to ignore this exception
try this code:
from selenium import webdriver
##from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import time
import random
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(executable_path=r'C:\Users\acer\Downloads\chromedriver_win32\chromedriver.exe', chrome_options=chrome_options)
time.sleep(2)
driver.get('https://www.reddit.com/r/AskReddit/comments/fi04fh/what_are_some_spoilers_for_the_next_month_of_2020/')
time.sleep(2)
print(driver.title)
time.sleep(2)
try:
element = driver.find_element_by_id("header")
except:
print("The Header isd dose not exist!")
exit()
print("done")
In Your Url The Header is dose not exist
You Can See By This Image!

The issue was that I use the old version of reddit whereas the default which Selenium and Hamza were opening is the new version which doesn't contain the elements I was trying to find

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping URLs with Python and selenium - python

Related

Hello. How to open URLs step by step form exel or csv and make an action in this links. And how to save session in chromedriver

Selenium / Try to click button in shadow-root?

Python selenium “while true:” script can't execute driver.get(“URL”)

Python Selenium checking checkbox: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element

Python Selenium WebDriver unable to find element at all

Categories

Resources