Parsing a Dynamic Web Page using Python - python

I am trying to parse a WebPage whose html source code changes when I press a arrow-key to get a drop-down list.
I want to parse the contents of that drop down list. How can I do that?
Example of the Problem: If you go to this site: http://in.bookmyshow.com/hyderabad and select the arrow button on comboBox "Select Movie" a drop-down list of movies appears. I want to get a list of these movies.
Thanks in advance.

The actual URL with the data used to populate the drop-down box is here:
http://in.bookmyshow.com/getJSData/?file=/data/js/GetEvents_MT.js&cmd=GETEVENTSWEB&et=MT&rc=HYD&=1425299159643&=1425299159643
I'd be a bit careful though and double-check with the site terms of use or if there are any APIs that you could use instead.

You may want to have a look at selenium. It allows you to reproduce exacly the same steps as you do because it also uses the browser (Firefox, Chrome, etc).
Ofc, it's not as fast as using mechanize, urllib, beautifulsoup and all this stuff, but it is worth a try.

You will need to dig into the JavaScript to see how that menu gets populated. If it is getting populated via AJAX, then it might be easy to get that content by re-doing a request to the same URL (e.g., do a GET to "http://www.example.com/get_dropdown_entries.php").

Related

Selenium to simulate click without loading link?

I'm working on a project trying to autonomously monitor item prices on an Angular website.
Here's what a link to a particular item would look like:
https://www.<site-name>.com/categories/<sub-category>/products?prodNum=9999999
Using Selenium (in Python) on a page with product listings, I can get some useful information about the items, but what I really want is the prodNum parameter.
The onClick attribute for the items = clickOnItem(item, $index).
I do have some information for items including the presumable item and $index values which are visible within the html, but I'm doubtful there is a way of seeing what is actually happening in clickOnItem.
I've tried looking around using dev-tools to find where clickOnItem is defined, but I haven't been successful.
Considering that I don't see any way of getting prodNum without clicking, I'm wondering, is there's a way I could simulate a click to see where it would redirect to, but without actually loading the link- as this would take way too much time to do for each item?
Note: I want to get the specific prodNumber. I want to be able to hit the item page directly without first going though the main listing page.

How do I use lxml to interact with the page and pull up a menu to be scraped?

For reference, this is the page that I will use as an example. It is the one that best demonstrates what I am trying to accomplish. If you look at the page, there is a brands banner at the top of the screen. In the top right hand corner, there is a see all button which pulls up a menu. The data from this menu is not in the html, it is generated by the click of that button. Is there any way to have lxml perform the action of clicking that button and pulling up that menu?
I took a look at the network log. There does not appear to me that there is any file or url in there that would contain the data from that menu. I believe selenium does have this functionality, but I would prefer to not have to use only lxml.
lxml is a parser, so it cannot click button elements on the page. Unfortunately, using a tool like Selenium is what you need to do to accomplish this.
I know you mentioned looking at the network log. Usually in these cases it is best to try to find the endpoint and issue the request directly, but if you tried and can't find the request then use Selenium.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Can selenium be used to highlight sections of a web page?

Can I have any highlight kind of things using Python 2.7? Say when my script clicking on the submit button,feeding data into the text field or selecting values from the drop-down field, just to highlight on that element to make sure to the script runner that his/her script doing what he/she wants.
EDIT
I am using selenium-webdriver with python to automate some web based work on a third party application.
Thanks
This is something you need to do with javascript, not python.
[NOTE: I'm leaving this answer for historical purposes but readers should note that the original question has changed from concerning itself with Python to concerning itself with Selenium]
Assuming you're talking about a browser based application being served from a Python back-end server (and it's just a guess since there's no information in your post):
If you are constructing a response in your Python back-end, wrap the stuff that you want to highlight in a <span> tag and set a class on the span tag. Then, in your CSS define that class with whatever highlighting properties you want to use.
However, if you want to accomplish this highlighting in an already-loaded browser page without generating new HTML on the back end and returning that to the browser, then Python (on the server) has no knowledge of or ability to affect the web page in browser. You must accomplish this using Javascript or a Javascript library or framework in the browser.

How to fill a textArea in an online form automatically using Python?

I am wondering how I can fill an online form automatically. I have researched it and it tuned out that, one can uses Python ( I am more interested to know how to do it with Python because it is a scripting language I know) but documentation about it is not very good. This is what I found:
Fill form values in a web page via a Python script (not testing)
Even the "mechanize" package itself does not have enough documentation:
http://wwwsearch.sourceforge.net/mechanize/
More specifically, I want to fill the TextArea in this page (Addresses):
http://stevemorse.org/jcal/latlonbatch.html?direction=forward
so I don't know what I should look for? Should I look for "id" of the the textArea? ?It doesn't look like that it has "id" (or I am very naive!). How I can "select_form"?
Python, web gurus, please help.
Thanks
See if my answer to the other question you linked helps:
https://stackoverflow.com/a/5685569/711017
EDIT:
Here is the explicit code for your example. Now, I don't have mechanize installed right now, so I haven't been able to check the code. No online IDE's I checked have it either. But even if it doesn't work, toy around with it, and you should eventually get there:
import re
from mechanize import Browser
br = Browser()
br.open("http://stevemorse.org/jcal/latlonbatch.html?direction=forward")
br.select_form(name="display")
br["locations"] = ["Hollywood and Vine, Hollywood CA"]
response = br.submit()
print response.read()
Explanation: br emulates a browser that opens your url and selects the desired form. It's called display in the website. The textarea to enter the address is called locations, into which I fill in the address, then submit the form. Whatever the server returns is the string response.read(), in which you should find your Lat-Longs somewhere. Install mechanize and check it out.

Categories