How to turn display from none to block in Scrapy?

How to turn display from none to block in Scrapy? - python

I'm trying to scrape data from a drop down menu(Here is the link). During inspecting to get the xpath, I realized that the display is none. So is there any way to scrape data from that drop down manu(Fits the following cars) who's display has set to none. If yes/no, how/why?

The data you want to scrape gets populated via Ajax call. So, you need to find out the url of the Ajax call. Once, you get that ,your work is easy.
Follow the steps below.
Open Chrome
Open the link
Open Developers console
Go to network tab
Now click on "Fits the following cars"
In the network tab ,see the call happening
In your case, it's a post request that happens over the fly.
Here is the pic of the call
Therefore, you need to find the url and the request parameters passed during the request.
You can see that the request parameters are as follows:
catentryId: 31426
techDocId: 33503
Now you got the url and data, it's just a matter of few lines of code.

Related

Scrapy: webpage next button uses WebForm_DoPostBackWithOptions()

I am new to scrapy and trying to scrape https://www.sakan.co/result?srv=1&prov=&cty=&maintyp=1&typ=5&minpr=&maxpr=&bdrm=&blk=
This webpage is using a href with the following:
href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$Content$rptPaging$ctl02$lbPaging", "", true, "", "", false, true))"
Data is getting loaded dynamically. I am trying to find the source (API call if any) for data that is getting loaded but could not find any. How can I navigate to next page and scrape data using Scrapy.

What this js effectively do is trigger a POST request, you can check the details of the request in the browsers developer tools, network tab. (F12 in Firefox - Open the tab and click the link)
Your Scrapy needs to reproduce that same POST request. All the information in the body is available in the page, just keep in mind that those fields that start with __, like __VIEWSTATE, are instance dependent, so you need to retrieve their values from the page your Scrapy loads, copy and paste will usually fail.
The easier way to do this is using the FormRequest.from_response() method. However, its important to check if the method is producing a request body that is the same your browser, quite often the method skips a required field or adds an extra one. (It relies on the page's <form>)
You can read more on scraping this kind of page in this link from Scrapy FAQ.
Finally one last tip: If your request body is the just like the browser, but the request still fails, you might need to reproduce the request headers as well.

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.

It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?

If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.

1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial

for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Python Scrapy : response object different from source code in browser

I'm working on a project using Scrapy.
All wanted fields but one get scraped perfectly. The content of the missing field simply doesn't show up in the Scrapy response (as checked in the scrapy shell), while it does show up when i use my browser to visit the page. In the scrapy response, the expected tags are there, but not the text between the tags.
There's no JavaScript involved, but it is a variable that is provided by the server (it's the current number of visits to that particular page). No iframe involved either.
Already set the user agent (in the settings-file) to match my browser.
Already set the download delay (in the settings-file) to 5.
EDIT (addition):
The page : http://www.fincaraiz.com.co/apartamento-en-venta/bogota/salitre-det-1337688.aspx
Xpath to the wanted element : //*[#id="numAdvertVisits"]
What could be the cause of this mystery ?

It's an ajax/javascript loaded value.
What steps did you take to determine there is no JS involved? I loaded the page w/o javascript, and while that area of the page had the stub content ("Visitas"), the actual data was written there with an ajax request.
You can still load that data using scrapy, it'll just take an additional request to the URL endpoint normally accessed via on-page ajax. The server returns the number of visits in XML, via the script at http://www.fincaraiz.com.co/WebServices/Statistics.asmx/GetAdvertVisits?idAdvert=1337688&idASource=40&idType=1001 (try loading that script and you'll see the # of visits for the page you provided in the original email).
There is another ajax request that returns "True" for that page, but I'm not sure what the data's actual meaning is. Still, it may be useful:
http://www.fincaraiz.com.co/WebServices/Statistics.asmx/DetailAdvert?idAdvert=1337688&idType=1001&idASource=40&strCookie=13/11/2014:19-05419&idSession=10hx5wsfbqybyxsywezx0n1r&idOrigin=44

urllib2 not retrieving url with hashes on it

I'm trying to get some data from a webpage, but I found a problem. Whenever I want to go to the next page (i.e. page 2) to keep retrieving the data on it, I keep receiving the data from page 1. Apparently something goes wrong trying to switch to the next page.
The thing is, I haven't had problems with urls like this:
'http://www.webpage.com/index.php?page=' + str(pageno)
I can just start a while statement and I'll just jump to page 2 by adding 1 to "pageno"
My problem comes in when I try to open an url with this format:
'http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=' + str(pageno)
As
urllib2.urlopen('http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=4').read()
will retrieve the source code from http://www.webpage.com/search/?show_all=1
There is no other way to retrieve other pages without using the hash, as far as I'm concerned.
I guess it's just urllib2 ignoring the hash, as it is normally used to specify a starting point for a browser.

The fragment of the url after the hash (#) symbol is for client-side handling and isn't actually sent to the webserver. My guess is there is some javascript on the page that requests the correct data from the server using AJAX, and you need to figure out what URL is used for that.
If you use chrome you can watch the Network tab of the developer tools and see what URLs are requested when you click the link to go to page two in your browser.

that's because hash are not part of the url that is sent to the server, it's a fragment identifier that is used to identify elements inside the page. Some websites misused the hash fragment for JavaScript hook for identifying pages though. You'll either need to be able to execute the JavaScript on the page or you'll need to reverse engineer the JavaScript and emulate the true search request that is being made, presumably through ajax. Firebug's Net tab will be really useful for this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to turn display from none to block in Scrapy? - python

I'm trying to scrape data from a drop down menu(Here is the link). During inspecting to get the xpath, I realized that the display is none. So is there any way to scrape data from that drop down manu(Fits the following cars) who's display has set to none. If yes/no, how/why?

Related

Scrapy: webpage next button uses WebForm_DoPostBackWithOptions()

Scraping the content of a box contains infinite scrolling in Python

How to scrape value from page that loads dynamicaly?

Python Scrapy : response object different from source code in browser

urllib2 not retrieving url with hashes on it

Categories

Resources