looping in beautiful soup / no errors - python

I have been writing a program that would hypothetically find items on a website as soon as they were loaded onto the website. As of now the script takes as input, two different values (keywords) used to describe an item and a color used to pick the color of the item. The parsing is spot on with items that are already on the website but lets say that I run my program before the website loads the items, instead of having to re run the entire script i'd like for it to just refresh the page and re-parse the data until it found it. I also included no errors in my question because from my example run of the script I entered Keywords and Color not pertaining to item on the website and instead of getting an error, I just got " Process finished with exit code 0". Thank you in advance to any who take the time to help !
Here is my code:

As another user suggested, you're probably better off using Selenium for the entire process rather than using it for only parts of your code and swapping between BSoup and Selenium.
As for reloading the page if certain items are not present, if you already know which items are supposed to be on the page then you can simply search for each item by id with selenium and if you can't find one or more then refresh the page with the following line of code:
driver.refresh()

Related

Selenium to simulate click without loading link?

I'm working on a project trying to autonomously monitor item prices on an Angular website.
Here's what a link to a particular item would look like:
https://www.<site-name>.com/categories/<sub-category>/products?prodNum=9999999
Using Selenium (in Python) on a page with product listings, I can get some useful information about the items, but what I really want is the prodNum parameter.
The onClick attribute for the items = clickOnItem(item, $index).
I do have some information for items including the presumable item and $index values which are visible within the html, but I'm doubtful there is a way of seeing what is actually happening in clickOnItem.
I've tried looking around using dev-tools to find where clickOnItem is defined, but I haven't been successful.
Considering that I don't see any way of getting prodNum without clicking, I'm wondering, is there's a way I could simulate a click to see where it would redirect to, but without actually loading the link- as this would take way too much time to do for each item?
Note: I want to get the specific prodNumber. I want to be able to hit the item page directly without first going though the main listing page.

Extracting links from website with selenium bs4 and python

Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.
Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)

Selenium Webscraping for some reason data only brings back partial instead of everything. Not sure if any dynamic data is in background

Python and Selenium beginner here. I'm trying to scrape the title of the sections of an Udemy class. I've tried using the find_elements_by_class_name and others but for some reason only brings back partial data.
page I'm scraping: https://www.udemy.com/selenium-webdriver-with-python3/
1) I want to get the title of the sections. They are the bold titles.
2) I want to get the title of the subsections.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.udemy.com/selenium-webdriver-with-python3/'
driver.get(url)
main_titles = driver.find_elements_by_class_name("lecture-title-text")
sub_titles = driver.find_elements_by_class_name("title")
Problem
1) Using main_titles, I got the length to be only 10. It only goes from Introduction to Modules. Working With Files and ones after all don't come out. However, the class names are exactly the same. Not sure why it's not. Modules / WorkingWithFiles is basically the cutoff point. The elements in the inspection also looks different at this point. They all have same span class tag but not sure why only partial is being returned
<span class="lecture-title-text">
Element Inspection between Modules title and WorkingWithFiles title
At this point the webscrape breaks down. Not sure why.
2) Using sub_titles, I got length to be 58 items but when I print them out, I only get the top two:
Introduction
How to reach me anytime and ask questions? *** MUST WATCH ***
After this, it's all blank lines. Not sure why it's only pulling the top two and not the rest when all the tags have
<div class='title'>
Maybe I could try using BeautifulSoup but currently I'm trying to get better using Selenium. Is there a dynamic content throwing off the selenium scrape or am I not scraping it in a proper way?
Thank you guys for the input. Sorry for the long post. I wanted to make sure I describe the problem correctly.
The reason why your only getting the first 10 sections is because only the first ten courses are shown. You might be logged in on your browser, so when you go to check it out, it shows every section. But for me and your scraper it's only showing the first 10. You'll need to click that .section-container--more-sections button before looking for the titles.
As for the weird case of the titles not being scraped properly: It's because when a element is hidden text attribute will always be undefined, which is why it only works for the first section. I'd try using the WebElement.get_attribute('textContent') to scrape the text.
Ok I've went through the suggestions in the comments and have solved it. I'm writing it here in case anyone in future wants to see how solution went.
1) Using suggestions, I made a command to click on the '24 more sections' to expand the tab and then scrape it, which worked perfectly!
driver.find_element_by_class_name("js-load-more").click()
titles = driver.find_elements_by_class_name("lecture-title-text")
for each in titles:
print (each.text)
This pulled all 34 section titles.
2) Using Matt's suggestion, I found the WebElement and used get_attribute('textContent') to pull out the text data. There were bunch of spaces so I used split() to get strings only.
sub_titles = driver.find_elements_by_class_name("title")
for each in sub_titles:
print (each.get_attribute('textContent').strip())
This pulled all 210 subsection titles!

Extracting info from dynamic page element in Python without "clicking" to make visible?

For the life of me I can't think of a better title...
I have a Python WebDriver-based scraper that goes to Google, enters a local search such as chiropractors+new york+ny, which, after clicking on More chiropractors+New York+NY, ends up on a page like this
The goal of the scraper is to grab the phone number and full address (including suite# etc.) of each of the 20 results on such a results page. In order to do so, I need to have WebDriver click 20 entries needs to be clicked on the bring up an overlay over the Google Map:
This is mighty slow. Were it not having to trigger each of these overlays, I would be able to do everything up to that point with the much faster lxml, by going straight to the ultimate URL of the results page and then extracting via XPath. But I appear to be stuck with not being able to get data from the overlay without first clicking on the link that brings up the overlay.
Is there a way to get the data out of this page element without having to click the associated links?

Python script on refreshing a web page, count value varies

I'm new to developing python scripts, but i'm trying to develop a script that will inform me when web page has been updated. For each check I use a counter to see how many times the program has run until the site has updated in someway.
My doubt is that, when I feed the url "stackoverflow.com", my program can run upto 6 times, however when I feed the url "stackoverflow.com/questions" the program runs at most once. Both sites on refreshing seems to be updating their questions often. But could someone explain to me why is there a big difference on the number of times the program runs?
import urllib2
import time
refreshcnt=0
url=raw_input("Enter the site to check");
x=raw_input("Enter the time duration to refresh");
x=int(x);
url="http://"+url
response = urllib2.urlopen(url)
html = response.read()
htmlnew=html
while html==htmlnew:
time.sleep(x)
try:
htmlnew=urllib2.urlopen(url).read()
except IOError:
print "Can't open site"
break
refreshcnt+=1
print "Refresh Count",refreshcnt
print("The site has updated!");
Just add this little loop to the end of your code and see what's changing:
for i in xrange(min(len(htmlnew),len(html))):
if htmlnew[i] != html[i]:
print(htmlnew[i-20:i+20])
print(html[i-20:i+20])
break
I tried it quick and it appears that there is a ServerTime key that is updated every second. For one reason or another, it would appear that this key is updated every second on the "/questions" page, but is only updated every half a minute or so on the homepage.
However, doing a couple other quick checks, this is certainly not the only part of the HTML being updated on the "stackoverflow.com/questions" page. Just comparing the entire HTML against the old one probably won't work in many situations. You'll likely want to search for a specific part of the HTML and then see if that piece has changed. For example, look for the HTML signifying the title newest question on SO and see if that title is different than before.

Categories