I am trying to collect the news history for a stock from a website using python.
The news load as you scroll down, so a number of requests are necessary for each stock.
After a few requests the website denies access, even after specifying the User-Agent and even after making it vary with each request. I also tried pausing the execution for a few seconds between requests. Nothing works.
Does anybody know how to go around this?
In case somebody else runs into this issue, the package "fake-useragent" seems to provide a solution.
I simply randomized the user agent before each request.
The website continues to deny access, albeit just seldom, and one can bypass this with a simple loop.
(The same answer was deleted a few days ago because I included a link to the package. I think my answer is more helpful to whomever runs into this issue in the future than previous answers to similar questions and all of your stupid comments.)
Related
I'm currently testing a website with python-selenium and it works pretty well so far. I'm using webdriver.Firefox() because it makes the devolepment process much easier if you can see what the testing program actually does. However, the tests are very slow. At one point, the program has to click on 30 items to add them to a list, which takes roughly 40 seconds because the browser is responding so awfully slowly. So after googling how to make selenium faster I've thought about using a headless browser instead, for example webdriver.PhantomJS().
However, the problem is, that the website requires a login including a captcha at the beginning. Right now I enter the captcha manually in the Firefox-Browser. When switching to a headless browser, I cannot do this anymore.
So my idea was to open the website in Firefox, login and solve the captcha manually. Then I somehow continue the session in headless PhatomJS which allows me to run the code quickly. So basically it is about changing the used driver mid-code.
I know that a driver is completely clean when created. So if I create a new driver after logging in in Firefox, I'd be logged out in the other driver. So I guess I'd have to transfer some session-information between the two drivers.
Could this somehow work? If yes, how can I do it? To be honest I do not know a lot about the actual functionality of webhooks, cookies and storing the"logged-in" information in general. So how would you guys handle this problem?
Looking forward to hearing your answers,
Tobias
Note: I already asked a similar question, which got marked as a duplicate of this one. However, the other question discusses how to reconnect to the browser after quitting the script. This is not what I am intending to do. I want to change the used driver mid-script while staying logged in on the website. So I deleted my old question and created this new, more fitting one. I hope it is okay like that.
The real solution to this is to have your development team add a test mode (not available on Production) where the Captcha solution is either provided somewhere in the page code, or the Captcha is bypassed.
Your proposed solution does not sound like it would work, and having a manual step defeats the purpose of automation. Automation that requires manual steps to be taken will be abandoned.
The website "recognizes" the user via Cookies - a special HTTP Header which is being sent with each request so the website knows that the user is authenticated, has these or that permissions, etc.
Fortunately Selenium provides functions allowing cookies manipulation so all you need to do is to store cookies from the Firefox using WebDriver.get_cookies() method and once done add them to PhantomJS via WebDriver.add_cookie() method.
firefoxCookies = firefoxDriver.get_cookies()
for cookie in firefoxCookies:
phantomJSDriver.add_cookie(cookie)
I'm currently trying to design a filter that with it I can block certain URLs and also
block based on keywords that may be at the data that came back from the http response.
Just for clarification I'm working on a Windows 10 x64 machine for this project.
In order to be able to do such thing, I quickly understood that I would need a web proxy.
I checked out about 6 proxies written in python that I found on github.
These are the project I tried to use (some are Python3 some 2):
https://github.com/abhinavsingh/proxy.py/blob/develop/proxy.py
https://github.com/inaz2/proxy2/blob/master/proxy2.py
https://github.com/inaz2/SimpleHTTPProxy - this one is the earlier version of the top one
https://github.com/FeeiCN/WebProxy
Abhinavsingh's Proxy (first in the list):
what I want to happen
I want the proxy to be able to block sites based on the requests and the content came back, I also need the filter to be in a separated file and to be generic
so I can apply it on every site and every request/response.
I'd like understand where is the correct place to put a filter on this proxy
and how to do redirect or just send a block page back when the client tries to access
sites that has specific urls, or when the response is a page that contains some keywords.
what I tried
I enabled the proxy on google chrome's 'Open proxy settings'
and executed the script. It looked pretty promising, and I noticed that I can insert a filterer's function call in line 383 at the _process_request function so that I can return
to it maybe another host to redirect to or just block. It worked partially for me.
The problems
First of all, I couldn't fully redirect/block the sites. Sometimes it worked, Sometimes it
didn't.
Another problem I ran into was that I realized that I can't access the content of site that returned, if it is https.
Also, filter the response was unfortunately not clear for me.
I noticed also that proxy2 (the second in the list) can solve that problem I had
of filtering https page's content, but I couldn't find out how to make this feature work (and also I think it requires linux utilities anyhow).
The process I described above was pretty much the one that I tried to work on every proxy in the list. At some proxies, like proxy2.py I couldn't understand at all what I needed to do.
If anybody managed to make a filter on top of this proxy or any other from this list and can help me understand how to do it, I'll be grateful if you'll comment down below.
Thank you.
I have a python script, which scrapes some information from some site.
This site has a daily limitation for 20 connections
So, I decided to use module requests with specified "proxies".
After a couple of hours testing different "proxy-list" sites I've found one and I've been parsing from site http://free-proxy-list.net/.
Seems, this site doesn't get the list updated often and after testing my script I've wasted all the proxies and I can't access the site anymore.
All these searches make me exhausted and I feel like my script completely sucks.
Is there any way I can avoid "detecting" me by site or I just need to find another list of proxies? If there are some sites with daily updated and all new list of proxies - please, let me know.
P.S. I often have stumbled upon sites like https://hide.me where I just enter the link and it gives me full access. Maybe I can just code this in Python? If it's possible - show me, please, how.
To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?
There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.
I've asked one question about this a month ago, it's here: "post" method to communicate directly with a server.
And I still didn't get the reason why sometimes I get 404 error and sometimes everything works fine, I mean I've tried those codes with several different wordpress blogs. Using firefox or IE, you can post the comment without any problem whatever wordpress blog it is, but using python and "post" method directly communicating with a server I got 404 with several blogs. And I've tried to spoof the headers, adding cookies in the code, but the result remains the same. It's bugging me for quite a while... Anybody knows the reason? Or what code should I add to make the program works just like a browser such as firefox or IE etc ? Hopefully you guys would help me out!
You should use somthing like mechanize.
The blog may have some spam protection against this kind of posting. ( Using programmatic post without accessing/reading the page can be easily detected using javascript protection ).
But if it's the case, I'm surprised you receive a 404...
Anyway, if you wanna simulate a real browser, the best way is to use a real browser remote controlled by python.
Check out WebDriver (http://seleniumhq.org/docs/09_webdriver.html) It has a python implementation and can run HtmlUnit, chrome, IE and Firefox browsers.