I'm using chromedriver + selenium to try to loop through a website with the same url structure for all their pages like so:
for i in range(1,3):
#iterate pages and get url
ureq = "https://somewebsite.com/#page=" + str(i)
driver.get(ureq.strip())
#create
soup = []
#main code here
try:
#Wait for page to load
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"some element in the DOM")))
src = driver.page_source
#Parse page with bs
soup = bs(src, "lxml")
except TimeoutException:
print("Timed out")
driver.quit()
#main code
driver.quit()
The problem is when the loop fires a second time and the url changes to "#page=2", I can see the webpage and url has changed in the webdriver but the script just hangs. There is no timeout or error message, the script just freezes.
I've also tried placing a print statement before "webDriverWait" to see where the program hangs but that also doesn't fire. I think for some reason, the second get url request is the culprit.
Why is that, or is something else here the issue?
If you can obtain the url directly from the href attribute of link element, the url should work when you enter into address bar directly. But it's not always work, see the below explain for click event
But if you obtain the url from address bar after click on some element, you will fail to open the destination page by enter the url into address bar.
Because when you click on the element, there maybe a click event triggered which executed a javascript code in background to fetch data from backend or whatever.
For those background stuff, you can't trigger them by enter the url in address bar. So the safety way is to click on the element.
I know Angularjs App acted as such way in most time.
Related
I am trying to get a list of all the movie/series on my personal IMDb watchlist. I am using selenium to click the load more button so everything shows up in the html code. However, when I try and scrape that data, only the first 100 movies show up.
Nothing past 'page3' shows up.
The image below shows the part of the html that connotes page 3:
After clicking the load button with selenium, all the movies are shown in the chrome pop up. However, only the first 100/138 are printed to my console.
Here is the URL: https://www.imdb.com/user/ur130279232/watchlist
This is my current code:
URL = "https://www.imdb.com/user/ur130279232/watchlist"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,20)
driver.get(URL)
while True:
try:
watchlist = driver.find_element_by_xpath("//div[#class='lister-list mode-detail']")
watchlistHTML = watchlist.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='load-more']")
soup = BeautifulSoup(watchlistHTML, 'html.parser')
content = soup.find_all('h3', class_ ='lister-item-header')
#pdb.set_trace()
print('length: ',len(content))
for elem in content:
print(elem.find('a').contents[0])
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
Even though after clicking load more button, "lister-list mode-detail" includes everything up until Sound of Music?
Rest of the data is returned by doing HTTP GET to (scroll down and hit Load more)
https://www.imdb.com/title/data?ids=tt0144117,tt0116996,tt0106179,tt0118589,tt11252090,tt13132030,tt6083778,tt0106611,tt0115685,tt1959563,tt8385148,tt0118971,tt0340855,tt8629748,tt13932270,tt11185940,tt5580390,tt4975722,tt2024544,tt1024648,tt1504320,tt1010048,tt0169547,tt0138097,tt0112573,tt0109830,tt0108052,tt0097239,tt0079417,tt0071562,tt0068646,tt0070735,tt0067116,tt0059742,tt0107207,tt0097937&tracking_tag=&pageId=ls089853956&pageType=list&subpageType=watchlist
What #balderman mentioned works if you can access the HTTP GET.
The main thing is that there's a delayed loading of the titles and it doesn't load the later ones until the earlier ones are loaded. I don't know if they only load if you're in the right region, but a janky way to get around it is to programmatically scroll through the page and let it load.
the webpage is : https://www.vpgame.com/market/gold?order_type=pro_price&order=desc&offset=0
As you can see there are 25 items in the selling part of this page that when you click them it opens a new tab and show you that specific item details.
Now I want to make a program to get those 25 item URLs and save them in a list, and my problem is as you can see in page inspect, their tags are which should be and also I can't find any 'href' attributes that related to them.
# using selenium and driver = webdriver.Chrome()
link = driver.find_elements_by_tag_name('a')
link2 = [l.get_attribute('href') for l in link]
I thought I can do it with above code but the problem is what I said. any suggestion?
Looks like you are trying to scrape a page that is powered by react. There are no href tags because javascript is powering all the linking. Your best bet is to use selenium to execute a click on each of the div objects, switch to the newly tabe, and use something like this code to get the URL of the page it's taken you to:
import time
links = driver.find_elements_by_class_name('card-header')
urls = []
for link in links:
new_page = link.click()
driver.switch_to.window(driver.window_handles[1])
url = driver.current_url
urls.append(url)
driver.close()
driver.switch_to.window(driver.window_handles[0])
time.sleep(1)
Note that the code closes the new tab each time and goes back to the main tab. I added time.sleep() so it doesn't go too fast.
My problem is that I am attempting to scrape the titles of Netflix movies and shows from a website that lists them on 146 different pages, so I made a loop to try and capture data from all the pages, however, when using the loop it makes my URL malformed and I don't know how to fix it.
I have made sure the webdriver part of the code works, meaning if I type in the URL to the driver.get it gives me the information I need, however when using the loop it pops up multiple firefox windows and doesnt put any URL into any of the windows. I also added a time delay to try and see if it was changing the URL before it got used but it still didn't work.
from selenium import webdriver
import time
for i in range(1,3):
URL = "https://flixable.com/?min-rating=0&min-year=1920&max-year=2019&order=date&page={}"
newURL = URL.format(i)
print(newURL)
time.sleep(10)
driver = webdriver.Firefox()
driver.get('newURL')
titles = driver.find_elements_by_css_selector('#filterContainer > div > div > p > strong > a')
for post in titles:
print(post.text)
I'm trying to get the number of plays for the top songs from a number of artists on Spotify using python and splinter.
If you fill in the username and password below with yours, you should be able to run the code.
from splinter import Browser
import time
from bs4 import BeautifulSoup
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(10)
So far, so good. If you open up firefox, you'll can see Miley Cyrus's artist page, including the number of plays for top tracks.
If you open up the Firefox Developer Tools Inspector and hover, you can see the name of the song in .tl-highlight elements, and the number of plays in .tl-listen-count elements. However, I've found it impossible (at least on my machine) to access these elements using splinter. Moreover, when I try to get the source for the entire page, the elements that I can see by hovering my mouse over them in Firefox don't show up in what is ostensibly the page source.
html = browser.html
soup = BeautifulSoup(html)
output = soup.prettify()
with open('miley_cyrus_artist_page.html', 'w') as output_f:
output_f.write(output)
browser.quit()
I don't think I know enough about web programming to know what the issue is here--Firefox sees all the DOM elements clearly, but splinter that is driving Firefox does not.
The key problem is that there is an iframe containing the artist's page with list of tracks. You need to switch into it's context before searching for elements:
frame = browser.driver.find_element_by_css_selector("iframe[id^=browse-app-spotify]")
browser.driver.switch_to.frame(frame)
Many thanks to #alecxe, the following code works to pull the information on the artist.
from splinter import Browser
import time
from bs4 import BeautifulSoup
import codecs
browser = Browser()
url = 'http://play.spotify.com'
browser.visit(url)
time.sleep(2)
button = browser.find_by_id('has-account')
button.click()
time.sleep(1)
browser.fill('username', 'your_username')
browser.fill('password', 'your_password')
buttons = browser.find_by_css('button')
visible_buttons = [button for button in buttons if button.visible]
login_button = visible_buttons[-1]
login_button.click()
time.sleep(1)
browser.visit('https://play.spotify.com/artist/5YGY8feqx7naU7z4HrwZM6')
time.sleep(30)
CORRECT_FRAME_INDEX = 6
with browser.get_iframe(CORRECT_FRAME_INDEX) as iframe:
html = iframe.html
soup = BeautifulSoup(html)
output = soup.prettify()
with codecs.open('test.html', 'w', 'utf-8') as output_f:
output_f.write(output)
browser.quit()
I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.