I'm trying to use TOR as a generic proxy but it fails
Right now I'm trying with python but I'm pretty sure it would be the same with any other language. I can connect to other proxies with python so I get how it "should" be done.
I found a list of TOR entry nodes
h = httplib.HTTPConnection("one entry node", 80)
h.connect()
h.request("GET", "www.google.com")
resp = h.getresponse()
page = resp.read()
unfortunately that doesnt work, i get redirected to a 404 message.
I'm just not sure of what I'm doing wrong. Probably the list of entry nodes cannot be connected just like that. I'm searching on how to do it properly but i dont get any documentation about how to program applications with tor
edit :
ditch the tor proxy list, i don't know why i should want to know about it.
the "entry node" is yourself, after you've installed the (windows) vidalia client and privoxy (all bundled as one)
httplib.HTTPConnection("one entry node", 80)
becomes
httplib.HTTPConnection("127.0.0.1", 8118)
and voilĂ , everything is routed through TOR
First, make sure you are using the correct node location and port. Most proxies use ports other than 80. Second, specify the protocol to use with the correct URL on your request string.
Under normal circumstances, your code should work if it looks something like this one:
h = httplib.HTTPConnection("138.45.68.134", 8080)
h.connect()
h.request("GET", "http://www.google.com")
resp = h.getresponse()
page = resp.read()
h.close();
You can also use socket as an alternative but that's another issue and it's even more complicated than the one above.
Hope that helps! :-)
Related
So, I am trying to use the pyodata library in Python to access and download data from Odata.
I tried accessing the Northwind data and it worked. So, i guess the codes i used is ok.
import requests
import pyodata
url_t = 'http://services.odata.org/V2/Northwind/Northwind.svc'
# connection set up
northwind = pyodata.Client(url_t, requests.Session())
# This prints out a single value from the table Customers
for customer in northwind.entity_sets.Customers.get_entities().execute():
print(customer.CustomerID,",", customer.CompanyName)
break
# This will print out - ALFKI , Alfreds Futterkiste
I also tried connecting to Odata in excel to see if the codes above return the correct data, and it did.
Click to see the screenshot in excel for Odata connection
Now, using the same code to connect to the data source where I want to pull the data did not work:
#using this link to connect to Odata worked.
url_1 = 'https://batch.decisionkey.npd.com/odata/dkusers'
session = requests.Session()
session.auth = (user_name, psw)
theservice = pyodata.Client(url_1, session)
The above codes return this error message(is it something about security?):
Click to see error message
Connecting to the data in excel looks like this:
Click the view image
I am thinking about it might be security issue that is blocking me from accessing the data, or it could be something else. Please let me know if anything need to be clarify. Thanks.
First time asking question, so please let me know if anything I did not do right here. ^_^
You got HTTP 404 - Not Found.
The service "https://batch.decisionkey.npd.com/odata/dkusers" is not accessible from outside world for me to try it, so there is something more from networking point of view that happens in the second picture in the Excel import.
You can forget the pyodata at the moment, for your problem it is just wrapper around HTTP networking layer, the Requests library. You need to find a way initialize the Requests session in a way, that will return HTTP 200 OK instead.
Northwind example service is just plain and simple, so no problem during initialization of pyodata.Client
Refer to Requests library documentation- https://docs.python-requests.org/en/latest/user/advanced/
//sample script
url_1 = 'https://batch.decisionkey.npd.com/odata/dkusers'
session = requests.Session()
session.auth = (user_name, psw)
//??? SSL certificate needs to be provided perhaps?
//?? or maybe you are behind some proxy that Excel uses but python not.. try ping in CMD
response = session.get(url_1)
print(response.text)
Usable can be pyodata documentation about initialization, however you will not find there the reason why you get HTTP 404 - https://pyodata.readthedocs.io/en/latest/usage/initialization.html
This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.
I have a website developed in flask running on an apache2 server that responds on port 80 to two URLs
Url-1 http://www.example.com
Url-2 http://oer.example.com
I want to detect which of the two urls the user is coming in from and adjust what the server does and store the variable in a config variable
app.config['SITE'] = 'OER'
or
app.config['SITE'] = 'WWW'
Looking around on the internet I can find lots of examples using urllib2 the issue is that you need to pass it the url you want to slice and I cant find a way to pull that out as it may change between the two with each request.
I could fork the code and put up two different versions but that's as ugly as a box of frogs.
Thoughts welcome.
Use the Flask request object (from flask import request) and one of the following in your request handler:
hostname = request.environ.get('HTTP_HOST', '')
url = urlparse(request.url)
hostname = url.netloc
This will get e.g. oer.example.com or www.example.com. If there is a port number that will be included too. Keep in mind that this ultimately comes from the client request so "bad" requests might have it set wrong, although hopefully apache wouldn't route those to your app.
I am working on web-crawler [using python].
Situation is, for example, I am behind server-1 and I use proxy setting to connect to the Outside world. So in Python, using proxy-handler I can fetch the urls.
Now thing is, I am building a crawler so I cannot use only one IP [otherwise I will be blocked]. To solve this, I have bunch of Proxies, I want to shuffle through.
My question is: This is two level proxy, one to connect to main server-1, I use proxy and then after to shuffle through proxies, I want to use proxy. How can I achieve this?
Update Sounds like you're looking to connect to proxy A and from there initiate HTTP connections via proxies B, C, D which are outside of A. You might look into the proxychains project which says it can "tunnel any protocol via a user-defined chain of TOR, SOCKS 4/5, and HTTP proxies".
Version 3.1 is available as a package in Ubuntu Lucid. If it doesn't work directly for you, the proxychains source code may provide some insight into how this capability could be implemented for your app.
Orig answer:
Check out the urllib2.ProxyHandler. Here is an example of how you can use several different proxies to open urls:
import random
import urllib2
# put the urls for all of your proxies in a list
proxies = ['http://localhost:8080/']
# construct your list of url openers which each use a different proxy
openers = []
for proxy in proxies:
opener = urllib2.build_opener(urllib2.ProxyHandler({'http': proxy}))
openers.append(opener)
# select a url opener randomly, round-robin, or with some other scheme
opener = random.choice(openers)
req = urllib2.Request(url)
res = opener.open(req)
I recommend you take a look at CherryProxy. It lets you send a proxy request to an intermediate server (where CherryProxy is running) and then forward your HTTP request to a proxy on a second level machine (e.g. squid proxy on another server) for processing. Viola! A two-level proxy chain.
http://www.decalage.info/python/cherryproxy
I've got mechanize setup and working with python. I am adding support for using a proxy, but how do I check that I am actually using the proxy?
Here is some code I am using:
ip = 'some proxy ip address'
br.set_proxies({"http://": ip} )
I started to wonder if it was working because just to do some testing I typed in:
ip = 'asdfasdf'
and it didn't throw an error. So how do I go about checking if it is really using the ip address for the proxy that I pass in or the ip address of my computer? Is there a way to return info on your ip in mechanize?
maybe like this ?
br = mechanize.Browser()
br.set_proxies({"http": '127.0.0.1:80'})
you need to debug for more information
br.set_debug_http(True)
br.set_debug_redirects(True)
I am not sure how to handle this issue with mechanize, but you could read the next link that explains how to do it without mechanize (but still in python):
Proxy Check in python
The simple solution provided at the above-mentioned link could be easily adapted to your needs.
Thus, instead of the line:
print "Connection error! (Check proxy)"
you could replace by
SucceededYesNo="NO"
and instead of
print "All was fine"
just replace by
SucceededYesNo="YES"
Now, you have a variable available for further processing.
I am however afraid this will not cover the cases when the target web page is down because the same error might occur out of two causes (so one would not know whether a NO outcome is coming from a not working proxy server or from a bad web page), but still could be a solution: what about to check with the above-mentioned code a working web page? i.e. www.google.com? In this way, you could eliminate one cause and it remains the other.