Scrape and Get main content of any web site

Scrape and Get main content of any web site - python

I want to scrape web page an get title and main content of any web site. i see this. if you copy any url (for example copy http://en-maktoob.news.yahoo.com/pakistani-army-fuels-anger-securing-swat-taliban-025337458.html to textbox and press enter) from any article this web page get title and article and summarize it. it's work for the most websites. i want to know how is work without using html tag parsing for any website? how get main article of each webpage?

Related

Getting URL of an XHR request in Python?

I'm working on a web scraping project for scraping products off a website. I've been doing projects like this for a few months with pretty good success, but this most recent website is giving me some trouble. This is the website I'm scraping: https://www.phoenixcontact.com/online/portal/us?1dmy&urile=wcm%3apath%3a/usen/web/home. Here's an example of a product page: https://www.phoenixcontact.com/online/portal/us/?uri=pxc-oc-itemdetail:pid=3248125&library=usen&pcck=P-15-11-08-02-05&tab=1&selectedCategory=ALL. I have a program that lets me navigate to each product page and extract a majority of the information using BeautifulSoup.
The place I run into issues is trying to get the product number of all the products under the "Accessories Tab". I tried to use Selenium rather than Beautiful Soup to pull up the page and actually click through the Accessories pages. The website throws a 403 error if you try to update the page by clicking on the page numbers or arrow or change the displayed number of products. The buttons themselves don't have an actual link, the href tag = "#" to take you back to the top of the section after it updates the list. I have found that the request URL in the XHR request when you click on one of those page links would take you to a page that has the product information. From there I can make slight changes to the site= and itemsPerPage= parts of the URL and scrape the information pretty easily.
I am scraping 30,000 of these product pages and each one has a different request URL for the XHR request, but there's no recognizable relationship between the page URL and the request URL. Any ideas on how to get the XHR request URL from each page?
I'm pretty fluent in Selenium and Beautiful soup, but any other web scraping packages are unfamiliar and would warrant a little extra explanation.
EDIT: This shows what happens when I try to use Selenium to navigate through the pages. The product list doesn't change, and it gives that error. Selenium Error
This shows the XHR request that I've found. I just need a way to retrieve that URL to give to Beautiful Soup. XHR Request

Webscraping Multi page issue

Hello I am trying to scrape the following link "https://eprocure.gov.in/eprocure/app;jsessionid=9AD8A7A17E1B2868527E25799DBE45A2.eprocgep2?page=FrontEndLatestActiveTenders&service=page" with bs4 in python .For the first page everything seems to be ok .But When I am navigating to next page the URL pattern is changing completely .Now here is the next page URL pattern :"https://eprocure.gov.in/eprocure/app?component=%24TablePages.linkPage&page=FrontEndLatestActiveTenders&service=direct&session=T&sp=AFrontEndLatestActiveTenders%2Ctable&sp=2"..Due to the pattern change I can not automate the scraping process for every page ..But when I try to scrape the second page manually the soup object can not fetch any of the tags .But in network inspect showing those tags for second page ...can any one solve the issue ?? scrape all of the pages.. please share your solution

Scraping a web page through a link.

I am try to scrape a web page that you have to use a specific link to access the website. The issue is that this link takes you to the home page and that each product on the website has a unique url. My question is what would I do to access these product pages in order to scrape and download the PDF?
I am used to just looping thru the URLs directly but have never had to go thru one link to access the other urls. Any help would be great.
I am using Python and bs4.

fetch text from web with Angular JS tags such as ng-view

I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view>, So how can I use python to scrap the elements within this ng-view tags.Thanks in advance..

To answer your question
how can I use python to scrap the elements within this ng-view tags
You can't.
The content you want to scrape renders on the client side(browser), what scrapy get's you is just static content from server, your browser than interprets the HTML code and renders the JS code. And JS code than fetches different content from server again and makes some stuff with it.
Can it be done?
Yes!
One of the ways is to use some sort oh headless browser like http://phantomjs.org/ to fetch all the content. Once you have the content you can save it and scrape it as you wish. The thing is that this kind of web scraping is not as easy and straight forward as just scraping regular HTML. There is a reason why Google still doesn't scrape web pages that render their content via JS.

Scraping Dynamic Content - Page Source Different than Website

I am trying to scrape the content from this website:
http://america.aljazeera.com/search.html?q=blizzard
The page source is different than what appears on the page and I think it is because of the dynamic content. Are there any simple ways to scrape this page the way that it actually appears? I am using python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.