I am building a program in Python that interacts with an online store. So far I am able to find the desired item and navigate to the page using BeautifulSoup, but I am having issues clicking the "Add to cart" button. Most of the solutions I've found online using robobrowser and similar would work except that they are dealing with the tag which has a method attribute. The for on the site I am dealing with looks like this:
<input class="button" name="commit" type="submit" value="add to cart">
How would I go about "clicking" this button? What libraries would I need. I'm using python 3 by the way so I can't use mechanize. Thanks in advance for the help.
You can consider using Selenium in Python.
Please use the code snippet below as a reference:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("url")
button = driver.find_element_by_css_selector("input[class='button']")
button.click()
In case you get multiple matches, you can narrow it down by involving more attributes:
button = driver.find_element_by_css_selector("input[class='button'][name='commit']")
Please refer to this link for more examples on Python Selenium.
http://selenium-python.readthedocs.io/locating-elements.html
Related
If you go to the site, you'd notice that there is an age confirmation window which I want to bypass through scrapy but I messed up with that and I had to move on to selenium webdriver and now I'm using
driver.find_element_by_xpath('xpath').click()
to bypass that age confirmation window. Honestly I don't want to go with selenium webdriver because of its time consumption. Is there any way to bypass that window?
I searched a lot in stackoverflow and google
but didn't get any answer which may resolves my problem. If you've any link or idea of resolving it by Scrapy, that'd be appreciated. A single helpful comment will be up-voted!
To expand on Chillie's answer.
The age verification is irrelavant here. The data you are looking for is loaded via AJAX request:
See related question: Can scrapy be used to scrape dynamic content from websites that are using AJAX? to understand how they work.
You need to figure out how https://ns5bwtai8m-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.19.1&x-algolia-application-id=NS5BWTAI8M&x-algolia-api-key=e676b05f3844d3adf54a29732af6e43c url works and how can you retrieve in it scrapy.
But the age verification "window" is just a div that gets hidden when you press the button, not a real separate window:
<div class="age-check-modal" id="age-check-modal">
You can use the browser's Network tab in developer tools to see that no new info is uploaded or sent when you press the button. So everything is already loaded when you request a page. The "popup" is not even a popup, just an element whose display is changed to none when you click the button.
So Scrapy doesn't really care what's meant to be displayed as long as all html is loaded. If the elements are loaded, they are accessible. Or have you seen some information being unavailable without pressing the button?
You should inspect the html code more to see what each website does, this might make your scraping tasks easier.
Edit: After inspecting the original html you can see the following:
<div class="products-list">
<div class="products-container-block">
<div class="products-container">
<div id="hits" class='row'>
</div>
</div>
</div>
</div>
You can also see a lot of JS script tags.
The browser element inspector shows us the following:
The ::before part gives away that this was manipulated by JS, as you cannot do this with simple CSS. See Granitosaurus' answer for details on this.
What this means is that you need to somehow execute the arbitrary JS code on those pages. So you either need a solution with Scrapy, or just use Selenium, as many do, and as you already have.
I'm using splinter to try and click a button that has no attributes. The html code in developer tools is
<div class="numeric">
<button>0</button>
<button>1</button>
<button>...</button>
<button>9</button>
</div>
that executes a jquery.
My code looks like this
browser.find_by_css("numeric").find_by_text("0").first.click()
However I am getting that numeric is not found. I have also tried #numeric as well.
That's not really the css for the div class.
You could try to find it by xpath using this:
browser.find_by_xpath("//div[contains(#class, 'numeric')]")
i unfortunately am not able to post code to reproduce this problem, since it involves signing into a site that is not a public site. but my question is more general than code problems. essentially, driver.page_source does not match what shows up in the browser it is driving. this is not an issue with elements not loading fully because i am testing this while executing code line by line in a python terminal. i am looking at the page source in the browser after right clicking and going to "view page source", and but if i print driver.page_source or attempt to find_element_by_[...], it shows slightly different code with entire elements missing. here is the html in question:
<nav role="navigation" class="utility-nav__wrapper--right">
<input id="hdn_partyId" value="1965629" type="hidden">
<input id="hdn_firstName" value="CHARLES" type="hidden">
<input id="hdn_sessionId" value="uHxQhlARvzA7N16uh+KJAdNFIcY6D8f9ornqoPQ" type="hidden">
<input id="hdn_cmsAlertRequest" type="hidden" value="Biennial Plus">
<ul class="h-list h-list--middle">
[...]
</ul>
i need all 4 of the input elements, however, hdn_partyId and hdn_sessionId elements do not appear in selenium's .page_source and if i try to get them with .find_element_by_[...] i get a NoSuchElementException
i even ran a check on finding all input elements and listing them, and these 2 do not show up.
does anyone have any idea why selenium would not provide the same content as directly looking at the browser it is driving?
EDIT: to clarify... i am driving Chrome with Chromedriver through Selenium. this is not an issue with the page not fully loading. as i mentioned, i am running this manually line by line through a python terminal and not executing a script. so the browser pops up, loads the page, logs in, and then i manually check the browser's page source and see the element, then i print driver.page_source and it's not there, and if i run session_id = driver.find_element_by_id('hdn_sessionId') i get a NoSuchElementException. there are also no frames at all in the page, nor any additional windows.
A coworker of mine has figured out the issue and a workaround. Essentially, after the page is done loading, it runs a javascript command that cleans up the DOM. What the "view page source" in the browser shows is not what the current state is. So running print driver.page_source or using any form of driver.find_element_by_[...] is pulling from the newest and freshest page data, while the browser's "view page source" only shows what was provided when the page first loaded. If you start 'inspecting' the page in Chrome, you will see the HTML is different than what the browser says the "page source" is. After reverse engineering the Javascript, we are able to run partyid = driver.execute_script('return accountdata.$partyId.val();') and get what was originally assigned. I hope this is enough info to help other people who may run into this issue in the future.
try like this you will get source code keyword "view-source:" which can be different according to your browser this is for the chrome
driver.get("view-source:"+url)
sourcecode=driver.find_element_by_tag_name('body').text
If you locate the 'body' of the page then use get_attribute('innerHTML') you can access everything from the page.
Quite often when using selenium, waiting does the trick without needing a lot of extra code (i.e. giving a few seconds for the full DOM to load). So in the example below, the HTML that was gathered reflected what one would see when one 'inspects' as opposed to using 'view source', which displayed pre-JS DOM
from time import sleep
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
sleep(10)
HTML = driver.page_source
So I'm performing a scrape of omegle trying to scrape the users online.
This is the HTML code:
<div id="onlinecount">
<strong>
30,000+
</strong>
</div>
Now I would presume that using LXML it would be //div[#id="onlinecount"] to scrape any text within the , I want to get the numbers from the tags, but when I try to scrape this, I just end up with an empty list
Here's my relevant code:
print "\n Grabbing users online now from",self.website
site = requests.get(self.website)
tree = html.fromstring(site.text)
users = tree.xpath('//div[#id="onlinecount"]')
Note that the self.website variable is just http://www.omegle.com
Any ideas what I'm doing wrong? Note I can scrape other parts just not the number of online users.
I ended up using a different set of code which I learned from a friend.
Here's my full code for anyone interested.
http://pastebin.com/u1kTLZtJ
When you send a GET request to "http://www.omegle.com" using requests python module,what I observed is that there is no "onlinecount" in site.text. The reason is that part gets rendered by a javascript. You should use a library that is able to execute the javascript and give you the final html source that is rendered in a browser. One such third party library is Selenium http://selenium-python.readthedocs.org/. The only downside is that it opens a real web browser.
Below is a working code using selenium and an attached screenshot:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.omegle.com")
element = browser.find_element_by_id("onlinecount")
onlinecount = element.find_element_by_tag_name("strong")
You can also use GET method on this http://front1.omegle.com/status
that will return the count of online users and other details in JSON form
I have done a bit of looking at this and that particular part of the page is not XML but Javascript.
Here is the source (this is what the requests library is returning in your program)
<div id="onlinecount"></div>
<script>
if (IS_MOBILE) {
$('sharebuttons').dispose();
$('onlinecount').dispose();
}
</script>
</div>
As you can see, in lxml's eyes there is nothing but a script in the onlinecount div.
I agree with Praveen.
If you want to avoid launching a visible browser, you could use PhantomJS which also has a selenium driver :
http://phantomjs.org/
PhantomJS is a headless WebKit scriptable with a JavaScript API
Instead of selenium scripts, you could also write PhantomJS js scripts (but I assume you prefer to stay in Python env ;))
I have one button from one LinkedIn page with this code:
<div class="primary-action-button"><a class="primary-action label" href="/requestList?displayProposal=&destID=39959446&creationType=DC&authToken=Yr4_&authType=OUT_OF_NETWORK&trk=vsrp_people_res_pri_act&trkInfo=VSRPsearchId%3A2998448551382744275729%2CVSRPtargetId%3A39959446%2CVSRPcmpt%3Aprimary">Send InMail</a></div>
Is there any way to click on an element just by its href link? Thanks
Using selenium you could use the following code:
driver.find_element_by_link_text("Send InMail").click()
The above answer driver.findElement(By.linkText("Send InMail")).click(); is in Java. In python, use find_element_by_link_text:
driver.find_element_by_link_text('Send InMail').click()
or something like this is sometimes helpful
driver.find_element_by_partial_link_text('Send').click()