Unable to get all the links within a page - python

I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?

I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!

Related

Where is the information stored in a html? (web-scraping)

A beginner here.
I want to extract all the jobs from Barclays (https://search.jobs.barclays/search-jobs)
I got through scraping the first page but am struggling to go to the next page, as the url don't change.
I tried to scrape the url on the next page button, but that href brings me back to the homepage.
Does that mean that all the job data is actually stored within the original html?
If so, how can I extract it?
Thanks!
So I analyzed the website, and it communicates with the server using an API, so you can get data directly from it as a JSON file.
This is the API link in this specific case(for my computer): https://search.jobs.barclays/search-jobs/results?ActiveFacetID=44699&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=
For you the url might be different, but the concept is the same:
As you ca see, there is a 'CurrentPage=2' inside the url which you can use to get any of the pages using requests, then extract what you need from the json.

How to find the Request URL from a webpage in chromes's Inspect Element> Network > XHR > Response > Headers > General

Sorry if I don't know the correct terminology, I'm new to web scraping, please correct my terminology if you feel like it.
I'm working on a project to scrape images off all the pieces by an artist given the URL for the artist's gallery. What I am doing is finding the unique id of each page of the gallery that will lead me to the webpage hosting the original image. I can already scrape from the art page I just need the id's of each page from gallery.
Artist Gallery --> Art Page --> scrape image
The id's of each page on the gallery page is not available in the page source, since it is being loaded in separately through JavaScript I think, so I can not grab them using:
import requests
import urllib.request
response = requests.get(pageurl)
print(response.text)
But I have found that by going to Chrome Inspect Element> Network > XHR > Response > Headers > General, there is a request URL that has all the id's that I need, and below that is a Query String Parameters section that has all the id's I need.
Picture of Query String Parameters
Picture of Request URL
I am using BeautifulSoup, but the problem just lies with how to get the data.
I have also used urllib.request.urlopen(pageurl) with similar results. I have also tried Selenium, but was still unable to get the ids, although I may not have done so correctly, I was able to get to the webpage, but maybe I did not use the right method. For now, this is what I want to try. EDIT: I have since figured it out using Selenium. (I just wasn't trying hard enough), but would still like some input regarding intercepting XHR's.
Link to site if you really want to see it, but you may have to login

How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs
First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

Loading Ajax with Python Requests

For a personal project, I'm trying to get a full friends list of a user (myself for now) from Facebook using Requests and BeautifulSoup.
The main friends page however displays only 20, and the rest are loaded with Ajax when you scroll down.
The request url looks something like this (method is GET):
https://www.facebook.com/ajax/pagelet/generic.php/AllFriendsAppCollectionPagelet?dpr=1&data={"collection_token":"1244314824:2256358349:2","cursor":"MDpub3Rfc3RydWN0dXJlZDoxMzU2MDIxMTkw","tab_key":"friends","profile_id":1244214828,"overview":false,"ftid":null,"order":null,"sk":"friends","importer_state":null}&__user=1364274824&__a=1&__dyn=aihaFayfyGmagngDxfIJ3G85oWq2WiWF298yeqrWo8popyUW3F6wAxu13y78awHx24UJi28cWGzEgDKuEjKeCxicxabwTz9UcTCxaFEW58nVV8-cxnxm1typ9Voybx24oqyUf9UgC_UrQ4bBv-2jAxEhw&__af=o&__req=5&__be=-1&__pc=EXP1:DEFAULT&__rev=2677430&__srp_t=1474288976
My question is, is it possible to recreate the dynamically generated tokens such as the __dyn, cursor, collection_token etc. to send manually in my request? Is there some way to figure out how they are generated or is it a lost cause?
I know that the current Facebook API does not support viewing a full friends list. I also know that I can do this with Selenium, or some other browser simulator, but that feels way too slow, ideally I want to scrape thousands of friends lists (of users whose friends lists are public) in a reasonable time.
My current code is this:
import requests
from bs4 import BeautifulSoup
with requests.Session() as S:
requests.utils.add_dict_to_cookiejar(S.cookies, {'locale': 'en_US'})
form = {}
form['email'] = 'myusername'
form['pass'] = 'mypassword'
response = S.post('https://www.facebook.com/login.php?login_attempt=1&lwv=110', data=form)
# Im logged in
page = S.get('https://www.facebook.com/yoshidakai/friends?source_ref=pb_friends_tl')
Any help will be appreciated, including other methods to achieve this :)
As of this writing, you can extract this information by parsing the page and then get the next cursor for latter pages by parsing the preceding ajax response. However, as Facebook regularly makes updates to its backend, I have had more stable results using selenium to drive a Chrome headless browser to scroll through the page, and then parsing the resulting HTML.

Categories