I am running multiple browsers using multiprocessing Pool. Each process will be opening and closing browsers over and over as well. The reason for each process to close the browser and open a new one is that the sites I am visiting will block with captcha on the second visit if not. If each process calls browser.quit() will it affect chrome instances running in other processes? I am having trouble with some sites failing when I know they are good URLs.
EDIT:
Let me explain further. Selenium visits the site and I am returning the HTML for scraping. The error I receive in the log file:
Failed While scraping
object of type 'NoneType' has no len()
example selenium code:
def get_page(url):
browser = webdriver.Chrome(executable_path=r'chromedriver.exe')
time.sleep(2)
browser.get(url)
# verify page
if browser.current_url[-19:] == 'noResultsFound=true' or browser.current_url[-13:] == 'error404.html':
browser.quit()
return None
else:
html = browser.page_source
browser.quit()
return html
example scraping code:
def scrape(html):
soup = BeautifulSoup(html, 'html.parser')
search_items = soup.find('div', {'class': 'row product-grid results'})
if search_items is not None:
search_items = search_items.find_all('div', {'class': 'col-xs-12 col-xs-6 col-sm-4 col-md-3 text-center'})
for i in range(len(search_items)):
# scrape each search result
I can visit the URL and verify that the <div>s do exists but I am failing on the for loop by taking the length. My thought was either JavaScript is not fully loading on page before returning the page_source or another processes calling browser.quit() affects the other processes.
Related
I am trying to scrape the twitter username of crypto currencies from coinmarketcap (https://coinmarketcap.com/currencies/ethereum/social/). Some of them don't have the twitter iframe like (https://coinmarketcap.com/currencies/bitcoin/social/).
The problem is that the iframe loads in around 3 seconds. But I tested my program many times and I found that the iframe does not always load even after I wait for 5 seconds. Sometimes I manually tried to open the page and it didn't even appear on the screen (but very rare).
I am expecting that it should work perfectly and scrape everything, but it seems that it is prone to error as it depends on loading time and server response?
Is there a better more stable way of doing this? This is my first web scraping project and it seems like the only solution that could work
Is there another method which I could use while waiting?
I know that you can get the source from the iframe and scrape it but I was not able to find it.
Here is my function:
def get_crypto_currency_social(slug):
url = "https://coinmarketcap.com/currencies/"+slug+"/social/"
browser = webdriver.Chrome('./chromedriver')
# .add_argument('headless')
browser.get(url)
try:
wait(browser, 5).until(EC.presence_of_element_located((By.ID, "twitter-widget-0")))
except:
pass
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
market_cap = soup.find('div', {'class': 'statsValue___2iaoZ'}).text.split('$')[-1]
coin_name = soup.find('small', {'class': 'nameSymbol___1arQV'}).text
coin_rank = soup.find('div', {'class': 'namePillPrimary___2-GWA'}).text.split('#')[-1]
try:
iframe = browser.find_elements_by_tag_name('iframe')[0]
browser.switch_to.frame(iframe)
twitter_username = browser.find_element_by_class_name("customisable-highlight").text
except NoSuchElementException:
twitter_username = ""
except:
print("Error getting twitter username")
finally:
browser.quit()
return {
"coin_rank": coin_rank,
"market_cap": market_cap,
"coin_name": coin_name,
"twitter_username": twitter_username
}
If there is a random delay b/w times you could probably make use of WebDriverWait class from selenium.
Sample code :
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"YOUR IFRAME XPATH")))
shoe = input('Shoe name: ')
URL = 'https://stockx.com/search?s='+shoe
page = requests.get(URL, headers= headers)
soup = BeautifulSoup(page.content, 'html.parser')
time.sleep(2) #this was to ensure the webpage was having enough time to load so that it wouldn't try to scrape a prematurely loaded website.
test = soup.find(class_ = 'BrowseSearchDescription__SearchConfirmation-sc-1mt8qyd-1 dcjzxm')
print(test) #returns none
print(URL) #prings the URL (which is the correct URL of the website I'm attempting to scrape)
I understand that I could easily do this with Selenium, however, it is very inefficient as it loads up the chrome tab and navigates to the web page. I'm trying to make this efficient, and my original "prototype" did use Selenium however it was always detected as a bot and my whole code was stopped by captchas. Am I doing something wrong that is causing the code to return 'None' or is that specific webpage unscrape-able. If you need, the specific URL is https://stockx.com/search?s=yeezy
I tried your code and here is the result.
Code
shoe = 'yeezy'
URL = 'https://stockx.com/search?s='+shoe
page = requests.get(URL)
soup = bs.BeautifulSoup(page.content, 'html.parser')
And when I see what's inside the soup, here is the result.
Result
..
..
<div id="px-captcha">
</div>
<p> Access to this page has been denied because
we believe you are using automation tools to browse the website.</p>
..
..
Yes I guess the developers didn't want the website being scraped.
I'm using chromedriver + selenium to try to loop through a website with the same url structure for all their pages like so:
for i in range(1,3):
#iterate pages and get url
ureq = "https://somewebsite.com/#page=" + str(i)
driver.get(ureq.strip())
#create
soup = []
#main code here
try:
#Wait for page to load
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"some element in the DOM")))
src = driver.page_source
#Parse page with bs
soup = bs(src, "lxml")
except TimeoutException:
print("Timed out")
driver.quit()
#main code
driver.quit()
The problem is when the loop fires a second time and the url changes to "#page=2", I can see the webpage and url has changed in the webdriver but the script just hangs. There is no timeout or error message, the script just freezes.
I've also tried placing a print statement before "webDriverWait" to see where the program hangs but that also doesn't fire. I think for some reason, the second get url request is the culprit.
Why is that, or is something else here the issue?
If you can obtain the url directly from the href attribute of link element, the url should work when you enter into address bar directly. But it's not always work, see the below explain for click event
But if you obtain the url from address bar after click on some element, you will fail to open the destination page by enter the url into address bar.
Because when you click on the element, there maybe a click event triggered which executed a javascript code in background to fetch data from backend or whatever.
For those background stuff, you can't trigger them by enter the url in address bar. So the safety way is to click on the element.
I know Angularjs App acted as such way in most time.
This is a continuation of my previous question "Web scraping using selenium and beautifulsoup.. trouble in parsing and selecting button". I could solve the previous problem, but I am now stuck on the below.
I got the links from previously stored in an array.
Then, I am trying to visit all the links stored in a list named StartupLink.
The information I need to scrape and store in an array is in div class=content tag. For some link, the above div tag contains div hidden_more with javascript enabled click events. So I am handling the exception. However, the loop runs fine and visits links but after first two links it gives NA output even though there is the presence of div content tag, it also shows no error (that's unacceptable).
The array contains 400 links to visit with similar div content element.
Where am I going wrong here?
Description=[]
driver = webdriver.Chrome()
for link in StartupLink:
try:
driver.get(link)
sleep(5)
more = driver.find_element_by_xpath('//a[#class="hidden_more"]')
element = WebDriverWait(driver, 10).until(EC.visibility_of(more))
sleep(5)
element.click()
sleep(5)
page = driver.find_element_by_xpath('//div[#class="content"]').text
sleep(5)
except Exception as e:# NoSuchElementException:
driver.start_session()
sleep(5)
page = driver.find_element_by_xpath('//div[#class="content"]').text
sleep(5)
print(str(e))
if page == '':
page = "NA"
Description.append(page)
else:
Description.append(page)
print(page)
I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.