I'm currently working on a youtube webscraper for comments.
I want to scape the comments and put them in a dataframe. My code can only print the text but I'm unable to put the text into a dataframe. When I check the output's type, it is a ' <class 'str'> ' I'm able to get the text through this code:
try:
# Extract the elements storing the usernames and comments.
username_elems = driver.find_elements_by_xpath('//*[#id="author-text"]')
comment_elems = driver.find_elements_by_xpath('//*[#id="content-text"]')
except exceptions.NoSuchElementException:
error = "Error: Double check selector OR "
error += "element may not yet be on the screen at the time of the find operation"
print(error)
for com_text in comment_elems:
print(com_text.text)
If I check the text through this code at the end of my function.
for com_text in comment_elems:
print(type(com_text.text)
then the result is <class 'str'>. And then I am unable to put this in a dataframe.
When I do try to put this <class 'str'> object in a dataframe, I get the error: TypeError: 'WebElement' object does not support item assignment
This is the code that I use when trying to put the text in a dataframe:
for username, comment in zip(username_elems, comment_elems):
comment_section['comment'] = comment.text
data.append(comment_section)
I'm hoping there is a way to convert the <class 'str'> object into a regular string type or if there is another step I can take to extract the text from the object.
Here is my full code
def gitscrape(url):
# Note: replace argument with absolute path to the driver executable.
driver = webdriver.Chrome('chromedriver/windows/chromedriver.exe')
# Navigates to the URL, maximizes the current window, and
# then suspends execution for (at least) 5 seconds (this gives time for the page to load).
driver.get(url)
driver.maximize_window()
time.sleep(5)
#empty subjects
comment_section =[]
comment_data = []
try:
# Extract the elements storing the video title and
# comment section.
title = driver.find_element_by_xpath('//*[#id="container"]/h1/yt-formatted-string').text
comment_section = driver.find_element_by_xpath('//*[#id="comments"]')
except exceptions.NoSuchElementException:
# Note: Youtube may have changed their HTML layouts for videos, so raise an error for sanity sake in case the
# elements provided cannot be found anymore.
error = "Error: Double check selector OR "
error += "element may not yet be on the screen at the time of the find operation"
print(error)
# Scroll into view the comment section, then allow some time
# for everything to be loaded as necessary.
driver.execute_script("arguments[0].scrollIntoView();", comment_section)
time.sleep(7)
# Scroll all the way down to the bottom in order to get all the
# elements loaded (since Youtube dynamically loads them).
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
# Scroll down 'til "next load".
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
# Wait to load everything thus far.
time.sleep(2)
# Calculate new scroll height and compare with last scroll height.
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# One last scroll just in case.
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
try:
# Extract the elements storing the usernames and comments.
username_elems = driver.find_elements_by_xpath('//*[#id="author-text"]')
comment_elems = driver.find_elements_by_xpath('//*[#id="content-text"]')
except exceptions.NoSuchElementException:
error = "Error: Double check selector OR "
error += "element may not yet be on the screen at the time of the find operation"
print(error)
# for com_text in comment_elems:
# print(type(com_text.text)
# data.append(comment_section)
for username, comment in zip(username_elems, comment_elems):
comment_section['comment'] = comment.text
data.append(comment_section)
video1_comments = pd.DataFrame(data)
<class 'str'> is used to represent normal strings in Python. In the code below both print statements print out <class 'str'>. You must be facing a different issue.
a = 12345
a = str(a)
print(type(a))
b = "12345"
print(type(b))
Your error occurs in the line comment_section['comment'] = comment.text. You write in your text that you encounter that error when you try to put a string into a dataframe, but neither comment_section nor comment is a dataframe. In your title you write that adding a string to a list which throws an error, but comment_section is also not a list (and if it where, the syntax wouldn't make any sense). Coding is very sensitive to what you're actually doing, so having a dataframe or a list makes a big difference.
What is comment_section actually for a type? If you scroll up throw your code the last assignment is the following: comment_section = driver.find_element_by_xpath('//*[#id="comments"]') so comment_section is actually neither a dataframe nor a list, but a webelement! Now the error you got also makes sense, it says TypeError: 'WebElement' object does not support item assignment and indeed you're trying to assign comment.text to the comment key of the WebElement comment_section, but the WebElement does not support this.
You can repair this by not overwriting comment_sectin but by using a different name.
Related
I am using Selenium library and trying to iterate through list of items look them up on web and while my loop is working when item are found I am having hard time handling the exception when item is not find on the web page. For this instance I know that if the item is not found the page will show " No Results For" within span to which I can access with:
browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]')[0].text
Now the problem is that this span only appear when item loop is searching is not found. So I tried this logic, if this span doesn't exist than it means item is found so execute rest of the loop, if the the span does exist and is equal to " No Results For", then go and search for next item. Here is my code:
data = pd.DataFrame()
for i in lookup_list:
start_url = f"https://www.amazon.com/s?k=" + i +"&ref=nb_sb_noss_1"
browser.visit(start_url)
if browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]') is not None :
#browser.find_by_xpath("//a[#class='a-size-medium a-color-base']"):
item = browser.find_by_xpath("//a[#class='a-link-normal']")
item.click()
html = browser.html
soup = bs(html, "html.parser")
collection_dict ={
'PART_NUMBER': getmodel(soup),
'DIMENSIONS': getdim(soup),
'IMAGE_LINK': getImage(soup)
}
elif browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]')[0].text != 'No results for':
continue
data = data.append(collection_dict, ignore_index=True)
The error I am getting is:
AttributeError: 'ElementList' object has no attribute 'click'
I do understand that the error I am getting is because I cant access attribute click since it the list has multiple items and therefore i cant click on all of them. But what im trying to do is to avoid even trying to access it if the page showes that the item is not found, i want the script to simply go to next item and search.
How do I modify this?
Thank you in advance.
Using a try-except with a pass is what you want in this situation, like #JammyDodger said. Although using this typically isn't a good sign because you don't want to simply ignore errors most of the time. pass will simply ignore the error and continue the rest of the loop.
try:
item.click()
except AttributeError:
pass
In order to skip to the next iteration of the loop, you may want to use the continue keyword.
try:
item.click()
except AttributeError:
continue
I have a for loop that itterates over a list of URLS. The urls are then loaded by the chrome driver. Some urls will load a page that is in a 'bad' format and it will fail the first xpath test. If it does I want it to go back to the next element in the loop. My cleanup code works but I can't seem to get it to go to next element in the for loop. I have an except that closes my websbrowser but nothing I tried would all me to then loop back to 'for row in mysql_cats'
for row in mysql_cats :
print ('Here is the url -', row[1])
cat_url=(row[1])
driver = webdriver.Chrome()
driver.get(cat_url); #Download the URL passed from mysql
try:
CategoryName= driver.find_element_by_xpath('//h1[#class="categoryL3"]|//h1[#class="categoryL4"]').text #finds either L3 or L4 catagory
except:
driver.close()
#this does close the webriver okay if it can't find the xpath, but not I cant get code here to get it to go to the next row in mysql_cats
I hope that you're closing the driver at the end of this code also if no exceptions occurs.
If you want to start from the beginning of the loop when an exception is raised, you may add continue, as suggested in other answers:
try:
CategoryName=driver.find_element_by_xpath('//h1[#class="categoryL3"]|//h1[#class="categoryL4"]').text #finds either L3 or L4 catagory
except NoSuchElementException:
driver.close()
continue # jumps at the beginning of the for loop
since I do not know your code, the following tip may be useless, but a common way to handle this cases is a try/except/finally clause:
for row in mysql_cats :
print ('Here is the url -', row[1])
cat_url=(row[1])
driver = webdriver.Chrome()
driver.get(cat_url); #Download the URL passed from mysql
try:
# my code, with dangerous stuff
except NoSuchElementException:
# handling of 'NoSuchElementException'. no need to 'continue'
except SomeOtherUglyException:
# handling of 'SomeOtherUglyException'
finally: # Code that is ALWAYS executed, with or without exceptions
driver.close()
I'm also assuming that you're creating new drivers each time for a reason. If it is not voluntary, you may use something like this:
driver = webdriver.Chrome()
for row in mysql_cats :
print ('Here is the url -', row[1])
cat_url=(row[1])
driver.get(cat_url); #Download the URL passed from mysql
try:
# my code, with dangerous stuff
except NoSuchElementException:
# handling of 'NoSuchElementException'. no need to 'continue'
except SomeOtherUglyException:
# handling of 'SomeOtherUglyException'
driver.close()
In this way, you have only one driver that manages all the pages you're trying to open in the for loop
have a look somewhere about how the try/except/finally is really useful when handling connections and drivers.
As a foot note, I'd like you to notice how in the code I always specify which exception I am expecting: catching all the exception can be dangerous. BTW, probably no one will die if you simply use except:
I am trying to scrape webpages using python and selenium. I have a url which takes a single parameter and a list of valid parameters. I navigate to that url with a single parameter at a time and click on a link, a pop up window opens with a page.
The pop window automatically opens a print dialogue on page load.
Also the url bar is disabled for that popup.
My code:
def packAmazonOrders(self, order_ids):
order_window_handle = self.driver.current_window_handle
for each in order_ids:
self.driver.find_element_by_id('sc-search-field').send_keys(Keys.CONTROL, "a")
self.driver.find_element_by_id('sc-search-field').send_keys(Keys.DELETE)
self.driver.find_element_by_id('sc-search-field').send_keys(each)
self.driver.find_element_by_class_name('sc-search-button').click()
src = self.driver.page_source.encode('utf-8')
if 'Unshipped' in src and 'Easy Ship - Schedule pickup' in src:
is_valid = True
else:
is_valid = False
if is_valid:
print 'Packing Slip Start - %s' %each
self.driver.find_element_by_link_text('Print order packing slip').click()
handles = self.driver.window_handles
print handles
try:
handles.remove(order_window_handle)
except:
pass
self.driver.switch_to_window(handles.pop())
print handles
packing_slip_page = ''
packing_slip_page = self.driver.page_source.encode('utf-8')
if each in packing_slip_page:
print 'Packing Slip Window'
else:
print 'not found'
self.driver.close()
self.driver.switch_to_window(order_window_handle)
Now I have two questions:
How can I download that pop up page as pdf?
For first parameter every thing works fine. But for another parameters in the list the packing_slip_page does not update (which i think because of the disabled url bar. But not sure though.) I tried the print the handle (print handles) for each parametre but it always print the same value. So how to access the correct page source for other parameters?
I have some experience in Python, but I have never used try & except functions to catch errors due to lack of formal training.
I am working on extracting a few articles from wikipedia. For this I have an array of titles, a few of which do not have any article or search result at the end. I would like the page retrieval function just to skip those few names and continue running the script on the rest. Reproducible code follows.
import wikipedia
# This one works.
links = ["CPython"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
#The sequence breaks down if there is no wikipedia page.
links = ["CPython","no page"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
The library running it uses a method like this. Normally it would be really bad practice, but since this is just for a one-off data extraction, I am willing to change the local copy of the library to get it to work. Edit I included the complete function now.
def page(title=None, pageid=None, auto_suggest=True, redirect=True, preload=False):
'''
Get a WikipediaPage object for the page with title `title` or the pageid
`pageid` (mutually exclusive).
Keyword arguments:
* title - the title of the page to load
* pageid - the numeric pageid of the page to load
* auto_suggest - let Wikipedia find a valid page title for the query
* redirect - allow redirection without raising RedirectError
* preload - load content, summary, images, references, and links during initialization
'''
if title is not None:
if auto_suggest:
results, suggestion = search(title, results=1, suggestion=True)
try:
title = suggestion or results[0]
except IndexError:
# if there is no suggestion or search results, the page doesn't exist
raise PageError(title)
return WikipediaPage(title, redirect=redirect, preload=preload)
elif pageid is not None:
return WikipediaPage(pageid=pageid, preload=preload)
else:
raise ValueError("Either a title or a pageid must be specified")
What should I do to retreive only the pages that do not give the error. Maybe there is a way to filter out all items in the list that give this error or an error of some kind. Returning "NA" or similar would be fine with pages that don't exist. Skipping them without notice would be fine too. Thanks!
The function wikipedia.page will raise a wikipedia.exceptions.PageError if the page doesn't exist. That's the error you want to catch.
import wikipedia
links = ["CPython","no page"]
test=[]
for link in links:
try:
#try to load the wikipedia page
page=wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
#if a "PageError" was raised, ignore it and continue to next link
continue
You have to surround the function wikipedia.page by a try block, so I'm afraid you can't use list comprehension.
Understand that this will be bad practice, but for a one off quick and dirty script you can just:
edit: Wait, sorry. I've just noticed the list comprehension. I'm actually not sure if this will work without breaking that down:
links = ["CPython", "no page"]
test = []
for link in links:
try:
page = wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
pass
test = [testitem.content for testitem in test]
print(test)
pass Tells python to essentially to trust you and ignore the error so that it can continue on about its day.
I have built a web scraper. The program enters searchterm into a searchbox and grabs the results. Pandas goes through a spreadsheet line-by-line in a column to retrieve each searchterm.
Sometimes the page doesn't load properly, prompting a refresh.
I need a way for it to repeat the function and try the same searchterm if it fails. Right now, if I return, it would go on to the next line in the spreadsheet.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
df = pd.read_csv(searchterms.csv, delimiter=",")
def scrape(searchterm):
#Loads url
searchbox = driver.find_element_by_name("searchbox")
searchbox.clear()
searchbox.send_keys(searchterm)
print "Searching for %s ..." % searchterm
no_result = True
while no_result is True:
try:
#Find results, grab them
no_result = False
except:
#Refresh page and do the above again for the current searchterm - How?
driver.refresh()
return pd.Series([col1, col2])
df[["Column 1", "Column 2"]] = df["searchterm"].apply(scrape)
#Executes crawl for each line in csv
The try except construct comes with else clause. The else block is executed if everything goes OK. :
def scrape(searchterm):
#Loads url
no_result = True
while no_result:
#Find results, grab them
searchbox = driver.find_element_by_name("searchbox")
searchbox.clear()
try: #assumes that an exception is thrown if there is no results
searchbox.send_keys(searchterm)
print "Searching for %s ..." % searchterm
except:
#Refresh page and do the above again for the current searchterm
driver.refresh()
else: # executed if no exceptions were thrown
no_results = False
# .. some post-processing code here
return pd.Series([col1, col2])
(There is also a finally block that is executed no matter what, which is useful for cleanup tasks that don't depend on the success or failure of the preceding code)
Also, note that empty except catches any exceptions and is almost never a good idea. I'm not familiar with how selenium handles errors, but when catching exceptions, you should specify which exception you are expecting to handle. This how, if an unexpected exception occurs, your code will abort and you'll know that something bad happened.
That is why you should also try keeping as few lines a possible within the try block.