Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked - python

To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?

There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.

Related

Efficient way to scrape images from website in Django/Python

First I guess I should say I am still a bit of a Django/Python noob. I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. A pretty standard scenario I assume. I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. I then have it coded in such a way that the image the user selects will be downloaded and stored locally.
However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds.
I have not provided any code as the code I have does what it should, I think the approach however is incorrect. To summarise what I want is:
To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title.
For this to be the most efficient solution, taking into account it would be run concurrently between many users at once.
For it to work in a Django (2.0) / Python (3+) environment.
I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option.
Any help/pointers would be much appreciated.

You can use 2 python solutions in your case:
1) BeautifulSoup, and here is a good answer how to download the images using it. You just have to make it a separate function and pass site as the argument into it. But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). This tool is just for parsing and scrapping the content of the page.
2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. It is much more flexible with a lot of features how to operate with scrapped data. I am not sure if u need to use it in your project and if it is not overpowered in your case.
Also my advice is to run the spider in some background task like Queue or Celery, and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response.
P.S. You can even combine those 2 tools in some cases :)

Website is denying access when requesting content

I am trying to collect the news history for a stock from a website using python.
The news load as you scroll down, so a number of requests are necessary for each stock.
After a few requests the website denies access, even after specifying the User-Agent and even after making it vary with each request. I also tried pausing the execution for a few seconds between requests. Nothing works.
Does anybody know how to go around this?

In case somebody else runs into this issue, the package "fake-useragent" seems to provide a solution.
I simply randomized the user agent before each request.
The website continues to deny access, albeit just seldom, and one can bypass this with a simple loop.
(The same answer was deleted a few days ago because I included a link to the package. I think my answer is more helpful to whomever runs into this issue in the future than previous answers to similar questions and all of your stupid comments.)

Scraping site with limited connections

I have a python script, which scrapes some information from some site.
This site has a daily limitation for 20 connections
So, I decided to use module requests with specified "proxies".
After a couple of hours testing different "proxy-list" sites I've found one and I've been parsing from site http://free-proxy-list.net/.
Seems, this site doesn't get the list updated often and after testing my script I've wasted all the proxies and I can't access the site anymore.
All these searches make me exhausted and I feel like my script completely sucks.
Is there any way I can avoid "detecting" me by site or I just need to find another list of proxies? If there are some sites with daily updated and all new list of proxies - please, let me know.
P.S. I often have stumbled upon sites like https://hide.me where I just enter the link and it gives me full access. Maybe I can just code this in Python? If it's possible - show me, please, how.

Is there an easy and fast way to generate JavaScript?

My problem begins when i try to crawl an app store, lets say google play.
for every app there are alot of comments and i want to crawl them FAST.
but the comment section in google is generated by java script.
here is a link for example: https://play.google.com/store/apps/details?id=com.gameloft.android.ANMP.GloftAMHM in that link you can see that in order to generate more comments you need to click on a button several times. (after 5-6 clicks aprox) the page generate more comments by executing a javascript.
At first i solved this problem using a web driver (firefox) and simulate a real person clicking on the button, and it generate comments, and he keep pressing till all comments are generated.
Problem with this is: 1, it takes too much time. 2, sometimes after tons fo clicks and JS generation the web browser is fail to response.
What I need is a way to generate all comments per application in a better, faster way. maybe theres some kind of tech, or just anything else that would improve my solution,
Im using a spider I've created in scrapy.
All kind of help will be much appreciated

One of the reasons they generate/show additional comments is exactly that they do not want someone to crawl them... the other is for the initial page to load without them (faster), and only if someone starts reading comments to show few more..
Unless they provide an API where you can pull all the comments at once, I do not see another quick way of pulling them, apart of simulating clicks and scrolls... (slow way of doing it)

Are you respecting robots.txt? Why or why not?

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.

The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.

Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.