Scraping Dynamic Content - Page Source Different than Website - python

I am trying to scrape the content from this website:
http://america.aljazeera.com/search.html?q=blizzard
The page source is different than what appears on the page and I think it is because of the dynamic content. Are there any simple ways to scrape this page the way that it actually appears? I am using python.

Related

Scraping a web page through a link.

I am try to scrape a web page that you have to use a specific link to access the website. The issue is that this link takes you to the home page and that each product on the website has a unique url. My question is what would I do to access these product pages in order to scrape and download the PDF?
I am used to just looping thru the URLs directly but have never had to go thru one link to access the other urls. Any help would be great.
I am using Python and bs4.

Web Scraping Javascript Using Python

I am used to using BeautifulSoup to scrape a website, however this website is different. Upon soup.prettify() I get back Javascript code, lots of stuff. I want to scrape this website for the data on the actual website (company name, telephone number etc). Is there a way of scraping these scripts such as Main.js to retrieve the data that is displayed on the website to me?
Clear version:
Code is:
<script src="/docs/Main.js" type="text/javascript" language="javascript"></script>
This holds the text that is on the website. I would like to scrape this text however it is populated using JS not HTML (which I used to use BeautifulSoup for).
You're asking if you can scrape text generated at runtime by Javascript. The answer is sort-of.
You'd need to run some kind of headless browser, like PhantomJS, in order to let the Javascript execute and populate the page. You'd then need to feed the HTML that the headless browser generates to BeautifulSoup in order to parse it.

fetch text from web with Angular JS tags such as ng-view

I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view>, So how can I use python to scrap the elements within this ng-view tags.Thanks in advance..
To answer your question
how can I use python to scrap the elements within this ng-view tags
You can't.
The content you want to scrape renders on the client side(browser), what scrapy get's you is just static content from server, your browser than interprets the HTML code and renders the JS code. And JS code than fetches different content from server again and makes some stuff with it.
Can it be done?
Yes!
One of the ways is to use some sort oh headless browser like http://phantomjs.org/ to fetch all the content. Once you have the content you can save it and scrape it as you wish. The thing is that this kind of web scraping is not as easy and straight forward as just scraping regular HTML. There is a reason why Google still doesn't scrape web pages that render their content via JS.

Scrape and Get main content of any web site

I want to scrape web page an get title and main content of any web site. i see this. if you copy any url (for example copy http://en-maktoob.news.yahoo.com/pakistani-army-fuels-anger-securing-swat-taliban-025337458.html to textbox and press enter) from any article this web page get title and article and summarize it. it's work for the most websites. i want to know how is work without using html tag parsing for any website? how get main article of each webpage?

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

Categories