How do i conditionally retry and rescrape the current page in Scrapy?

How do i conditionally retry and rescrape the current page in Scrapy? - python

I'm new to Scrapy, and not too impressive with Python. I've got a scraper set up to scrape data from a website, but although I'm using proxies, if the same proxy is used too many times then my request is shown a page telling me I'm visiting too many pages too quickly (HTTP status code 200).
As my scraper see's the page's status code as okay, it doesn't find the needed data and moves on to the next page.
I can determine when these pages are show via HtmlXPathSelector, but how do i signal Scrapy to retry that page?

Scrapy comes with a built-in retry middleware. You could subclass it and override the process_response method to include a check to see if the page that is telling you that you're visiting too many pages too quickly is showing up

Related

Why are the urls and headers of two requests seemingly the same but have different status codes (404 and 200, respectively)?

I am trying to crawl pages like this http://cmispub.cicpa.org.cn/cicpa2_web/07/0000010F849E5F5C9F672D8232D275F4.shtml. Each of these pages contains certain information about an individual person.
There are two ways to get to these pages.
One is to coin their urls, which is what I used in my scrapy code. I had my scrapy post request bodies like ascGuid=&isStock=00&method=indexQuery&offName=&pageNum=&pageSize=&perCode=110001670258&perName=&queryType=2to http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml.
These posts would return response where I can use xpath and regex to find strings like'0000010F849E5F5C9F672D8232D275F4' to coin the urls I really wanted:
next_url_part1 = 'http://cmispub.cicpa.org.cn/cicpa2_web/07/'
next_url_part2 = some xptah and regex...
next_url_part3 = '.shtml'
next_url_list.append([''.join([next_url_part1, i, next_url_part3]))
Finally, scrapy sent GET requests to these coined links and downloaded and crawled information I wanted.
Since the pages I wanted are information about different individuals, I can change the perCode= part in those POST request bodies to coin corresponding urls of different persons.
But this way sometimes doesn't work out. I have sent GET requests to about 100,000 urls I coined and I got 5 404 responses. To figure out what's going on and get information I want, I firstly pasted these failed urls in a browser and not to my suprise I still got 404. So I tried the other way on these 404 urls.
The other way is to manually access these pages in a browser like a real person. Since the pages I wanted are information about different individuals, I can write their personal codes on the down-left blank on this page http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml( only works properly under IE), and click the orange search button down-right(which I think is exactly like scrapy sending POST requests). And then a table will be on screen, by clicking the right-most blue words(which are the person's name), I can finally access these pages.
What confuses me is that after I practiced the 2nd way to those failed urls and got what I want, these previously 404 urls will return 200 when I retry them with the 1st way(To avoid influences of cookies I retry them with both scrapy shell and browser's inPrivate mode). I then compared the GET request headers of 200 and 404 responses, and they looks the same. I don't understand what's happening here. Could you please help me?
Here is the rest failed urls that I haven't tried the 2nd way so they still returns 404(if you get 200, maybe some other people have tried the url the 2nd way):
http://cmispub.cicpa.org.cn/cicpa2_web/07/7694866B620EB530144034FC5FE04783.shtml
and the personal code of this person is
110001670258
http://cmispub.cicpa.org.cn/cicpa2_web/07/C003D8B431A5D6D353D8E7E231843868.shtml
and the personal code of this person is
110101301633
http://cmispub.cicpa.org.cn/cicpa2_web/07/B8960E3C85AFCF79BF0823A9D8BCABCC.shtml
and the personal code of this person is 110101480523
http://cmispub.cicpa.org.cn/cicpa2_web/07/8B51A9A73684ADF200A38A5D492A1FEA.shtml
and the personal code of this person is 110101500315

How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs

First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

Python Scrapy : response object different from source code in browser

I'm working on a project using Scrapy.
All wanted fields but one get scraped perfectly. The content of the missing field simply doesn't show up in the Scrapy response (as checked in the scrapy shell), while it does show up when i use my browser to visit the page. In the scrapy response, the expected tags are there, but not the text between the tags.
There's no JavaScript involved, but it is a variable that is provided by the server (it's the current number of visits to that particular page). No iframe involved either.
Already set the user agent (in the settings-file) to match my browser.
Already set the download delay (in the settings-file) to 5.
EDIT (addition):
The page : http://www.fincaraiz.com.co/apartamento-en-venta/bogota/salitre-det-1337688.aspx
Xpath to the wanted element : //*[#id="numAdvertVisits"]
What could be the cause of this mystery ?

It's an ajax/javascript loaded value.
What steps did you take to determine there is no JS involved? I loaded the page w/o javascript, and while that area of the page had the stub content ("Visitas"), the actual data was written there with an ajax request.
You can still load that data using scrapy, it'll just take an additional request to the URL endpoint normally accessed via on-page ajax. The server returns the number of visits in XML, via the script at http://www.fincaraiz.com.co/WebServices/Statistics.asmx/GetAdvertVisits?idAdvert=1337688&idASource=40&idType=1001 (try loading that script and you'll see the # of visits for the page you provided in the original email).
There is another ajax request that returns "True" for that page, but I'm not sure what the data's actual meaning is. Still, it may be useful:
http://www.fincaraiz.com.co/WebServices/Statistics.asmx/DetailAdvert?idAdvert=1337688&idType=1001&idASource=40&strCookie=13/11/2014:19-05419&idSession=10hx5wsfbqybyxsywezx0n1r&idOrigin=44

Browser Simulation and Scraping with windmill or selenium, how many http requests?

I want to use windmill or selenium to simulate a browser that visits a website, scrapes the content and after analyzing the content goes on with some action depending of the analysis.
As an example. The browser visits a website, where we can find, say 50 links. While the browser is still running, a python script for example can analyze the found links and decides on what link the browser should click.
My big question is with how many http Requests this can be done using windmill or selenium. I mean do these two programs can simulate visiting a website in a browser and scrape the content with just one http request, or would they use another internal request to the website for getting the links, while the browser is still running?
Thx alot!

Selenium uses the browser but number of HTTP request is not one. There will be multiple HTTP request to the server for JS, CSS and Images (if any) mentioned in the HTML document.
If you want to scrape the page with single HTTP request, you need to use scrapers which only gets what is present in the HTML source. If you are using Python, check out BeautifulSoup.

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.

Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.