Scraping a web page through a link. - python

I am try to scrape a web page that you have to use a specific link to access the website. The issue is that this link takes you to the home page and that each product on the website has a unique url. My question is what would I do to access these product pages in order to scrape and download the PDF?
I am used to just looping thru the URLs directly but have never had to go thru one link to access the other urls. Any help would be great.
I am using Python and bs4.

Related

Getting URL of an XHR request in Python?

I'm working on a web scraping project for scraping products off a website. I've been doing projects like this for a few months with pretty good success, but this most recent website is giving me some trouble. This is the website I'm scraping: https://www.phoenixcontact.com/online/portal/us?1dmy&urile=wcm%3apath%3a/usen/web/home. Here's an example of a product page: https://www.phoenixcontact.com/online/portal/us/?uri=pxc-oc-itemdetail:pid=3248125&library=usen&pcck=P-15-11-08-02-05&tab=1&selectedCategory=ALL. I have a program that lets me navigate to each product page and extract a majority of the information using BeautifulSoup.
The place I run into issues is trying to get the product number of all the products under the "Accessories Tab". I tried to use Selenium rather than Beautiful Soup to pull up the page and actually click through the Accessories pages. The website throws a 403 error if you try to update the page by clicking on the page numbers or arrow or change the displayed number of products. The buttons themselves don't have an actual link, the href tag = "#" to take you back to the top of the section after it updates the list. I have found that the request URL in the XHR request when you click on one of those page links would take you to a page that has the product information. From there I can make slight changes to the site= and itemsPerPage= parts of the URL and scrape the information pretty easily.
I am scraping 30,000 of these product pages and each one has a different request URL for the XHR request, but there's no recognizable relationship between the page URL and the request URL. Any ideas on how to get the XHR request URL from each page?
I'm pretty fluent in Selenium and Beautiful soup, but any other web scraping packages are unfamiliar and would warrant a little extra explanation.
EDIT: This shows what happens when I try to use Selenium to navigate through the pages. The product list doesn't change, and it gives that error. Selenium Error
This shows the XHR request that I've found. I just need a way to retrieve that URL to give to Beautiful Soup. XHR Request

Pages not processing fully

I am trying to scrape news articles from yahoo finance and to do so, i want to use their sitemap page https://finance.yahoo.com/sitemap/
The problem i have is that after following a link https://finance.yahoo.com/sitemap/2015_04_02 for example scrapy does not process the whole page - only the header. So i cannot access the links to the different articles.
Is there some internal requests that i have to sent to the page ?
I still get the whole page by deactivating javascript in my browser and i use scrapy 1.6
Thanks.
Some sites take defensive measures against robots scraping their websites. If they detect that you are non-human, they may not serve the entire page. But more than likely what is happening is there is a bunch of client-side rendering that happens when you view the page in a web browser, which is not being executed when you request that same page in scrapy.
Yahoo! Finance has a API. Using that will probably get you more reliable results.

Scrape data from JavaScript-rendered website

I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.

How can I find (and scrape) all web pages on a given domain using Python?

How would I scrape a domain to find all web pages and content?
For example: www.example.com, www.example.com/index.html, www.example.com/about/index.html and so on..
I would like to do this in Python and preferable with Beautiful Soup if possible..
You can't. Pages not only can pages be dynamically generated based on backend database data and search queries or other input that your program supplies to the website, but there is a nearly infinite list of possible pages, and the only way to know which ones exist is to test and see.
The closest you can get is to scrape a website based on hyperlinks between pages in the page content itself.
You could use the Python library newspaper
Install using sudo pip3 install newspaper3k
You can scrape all the articles on a particular website.
from newspaper import Article
url = "http://www.example.com"
built_page = newspaper.build( url )
print("%d articles in %s\n\n"%(built_page.size(), url))
for article in built_page.articles:
print(article.url)
From there you can use the Article object API to get all sorts of information from the page including the raw HTML.

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

Categories