Scraping the content of a box contains infinite scrolling in Python - python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.

It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

Related

Unable to get all the links within a page

I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!

How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs
First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

How to scrape content rendered in popup window with javascript: links using scrapy

I'm trying to use scrapy to get content rendered only after a javascript: link is clicked. As the links don't appear to follow a systematic numbering scheme, I don't know how to
1 - activate a javascript: link to expand a collapsed panel
2 - activate a (now visible) javascript: link to cause the popup to be rendered so that its content (the abstract) can be scraped
The site https://b-com.mci-group.com/EventProgramme/EHA19.aspx contains links to abstracts that will be presented at a conference I plan to attend. The site's export to PDF is buggy, in that it duplicates a lot of data at PDF generation time. Rather than dealing with the bug, I turned to scrapy only to realize that I'm in over my head. I've read:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
and
How to scrape coupon code of coupon site (coupon code comes on clicking button)
But I don't think I'm able to connect the dots. I've also seen mentions to Selenium, but am not sure that I must resort to that.
I have made little progress, and wonder if I can get a push in the right direction, with the following information in hand:
In order to make the POST request that will expand the collapsed panel (item 1 above) I have a traced that the on-page JS javascript:ShowCollapsiblePanel(116114,1695,44,191); will result in a POST request to TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml with payload:
{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}
The parameters for eventSessionID and eventSessionWebSiteSetupViewID are clearly in the javascript:ShowCollapsiblePanel text.
How do I use scrapy to iterate over all of the links of form javascript:ShowCollapsiblePanel? I tried to use SgmlLinkExtractor, but that didn't return any of the javascript:ShowCollapsiblePanel() links - I suspect that they don't meet the criteria for "links".
UPDATE
Making progress, I've found that SgmlLinkExtractor is not the right way to go, and the much simpler:
sel.xpath('//a[contains(#href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')
in scrapy console returns me all of the numeric parameters for each javascript:ShowCollapsiblePanel() (of course, right now they are all in one long string, but I'm just messing around in the console).
The next step will be to take the first javascript:ShowCollapsiblePanel() and generate the POST request and analyze the response to see if the response contains what I see when I click the link in the browser.
I fought with a similar problem and after much pulling out hair I pulled the data set I needed with import.io which has a visual type scraper but it's able to run with javascript enabled which did just what I needed and it's free. There's also a fork on git hub I saw last night of scrapy that looked just like the import io scraper it called ..... give me a min
Portia but I don't know if it'll do what you want
https://codeload.github.com/scrapinghub/portia/zip/master
Good

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

Categories