Using Python Twisted framework, when I use:
twisted.names.client.getHostByName('some_domain')
I get the domain name resolved to an IP address.
But when I use
from twisted.web.client import Agent
agent = Agent(reactor)
agent.request(b'GET', 'http://some_domain', None)
I get this error Error received [Failure instance: Traceback (failure with no frames): <class 'ValueError'>: invalid hostname: some_domain ]
The some_domain only has A record, no AAAA if that helps. Also, these are communicating between 2 AWS ECS containers with some_domain sitting behind AWS service discovery endpoint.
Using Python 3.8.6 docker image and Twisted 20.3.0
Any ideas what is happening or where to look at? Thanks
This unfortunate exception does not mean that there was a problem resolving the name to an address. It means that the name itself was considered invalid and no attempt was even made to resolve it. The reason it is considered invalid is difficult to say without knowing what the real domain name is. some_domain is perfectly valid but I assume the real domain you're using is something else.
This is not to say your domain is invalid but you may have a problem in your representation of it or there may be a bug in Twisted that causes it to be considered invalid. Again, without knowing what it is, it's hard to say more.
Related
I am trying to a simple make a http request to a server inside my company, from a dev server. I figured out that depending on the origin / destination server, I might, or not, to be forced to use qualified name of the destination server, like srvdestination.com.company.world instead of just srvdestination.
I am ok with this, but I don't understand how come my DB connection works?
Let's say I have srvorigin. Now, to make http request, I must use qualified name srvdestination.com.company.world. However, for database connection, the connection string with un-qualified name is enough psycopg.connect(host='srvdestination', ...) I understand that protocols are different, but how psycopg2 does to resolve the real name?
First it all depend on how the name resolution subsystem of your OS is configured. If you are on Unix (you did not specify), this is governed by /etc/resolv.conf. Here you can provide the OS with a search list: if a name has not "enough" dots (the number is configurable) then a suffix is added to retry resolution.
The library you use to do the HTTP request may not query the OS for name resolution and do its DNS resolution itself. In which case, it can only work with the information you give it (but it could as well re-use the OS /etc/resolv.conf and information in it), hence the need to use the full name.
On the contrary, the psycopg2 may use the OS resolution mechanism and hence dealing with "short" names just fine.
Both libraries should have documentation on how they handle hostnames... or otherwise you need to study their source code. I guess psycopg2 is a wrapper around the default libpq standard library, written in C if I am not mistaken, which hence certainly use the standard OS resolution process.
I can understand the curiosity around this difference but anyway my advice is to keep short names when you type commands on the shell and equivalent (and even there it could be a problem), but always use FQDNs (Fully Qualified Domain Names) in your program and configuration files. You will avoid a lot of problems.
This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.
Bear with me. This is my first post...
The Tor project has recently introduced Stem as a loadable python module. I've been playing around with it to see if it's a viable tool. My results have been mixed.
I try to enable a configuration for a hidden service within the controller (which is supposed to act as though it came directly from the torrc file. It always fails on me. Here's a quick example of what I try:
#!/usr/bin/env python
from stem.control import Controller
controller = Controller.from_port(port = 9051)
controller.authenticate()
controller.set_options({'HIDDENSERVICEDIR':'/tmp/hiddenservice/','HIDDENSERVICEPORT':'1234 127.0.0.1:1234'})
...which returns an error:
InvalidRequest Traceback (most recent call last)
/home/user/my/folder/<ipython-input-5-3921e9b46181> in <module>()
/usr/local/lib/python2.7/dist-packages/stem/control.pyc in set_options(self, params, reset)
1618 raise stem.InvalidRequest(response.code, response.message)
1619 elif response.code in ("513", "553"):
-> 1620 raise stem.InvalidRequest(response.code, response.message)
1621 else:
1622 raise stem.ProtocolError("Returned unexpected status code: %s" % response.code)
InvalidRequest: Unacceptable option value: Failed to configure rendezvous options. See logs
...and the following in /var/log/tor/log:
Aug 1 10:10:05.000 [warn] HiddenServicePort with no preceding HiddenServiceDir directive
Aug 1 10:10:05.000 [warn] Controller gave us config lines that didn't validate: Failed to configure rendezvous options. See logs for details.
I've tried this with Stem's "set_options" as seen above and in two separate commands with "set_conf". With "set_conf", I can set the HiddenServiceDir but it still fails the same when setting the port, making me think I have a fundamental misunderstanding of Tor.
I checked my circuits and it doesn't seem to matter if I have one with a hidden service rendezvous point; it keeps failing. I'd prefer to keep things pythonic, temporal and clean and not have a hacked up bash script that rewrites the torrc before restarting tor. (In a perfect world, I'd rather not write to a hidden service directory, but tor hasn't implemented that yet.)
I try to be as cross-platform as possible, but I'm running Linux with Tor 2.3.25...
So who has ideas of why Stem won't let me make a hidden service?
Thanks for pointing this out to me via our bug tracker. Answering this here. :)
The set_options() docs say...
The params can optionally be a list of key/value tuples, though the only reason this type of argument would be useful is for hidden service configuration (those options are order dependent).
The issue here is that Tor's hidden service options behave in a slightly different fashion from all the rest of its config options. Tor expects a 'HiddenServiceDir' followed by the properties associated with that hidden service (it's order dependent). This is because a single tor instance can provide multiple hidden services.
Please change your call from...
controller.set_options({'HIDDENSERVICEDIR':'/tmp/hiddenservice/','HIDDENSERVICEPORT':'1234 127.0.0.1:1234'})
... to be a list of tuples instead...
controller.set_options([('HiddenServiceDir', '/tmp/hiddenservice/'), ('HiddenServicePort', '1234 127.0.0.1:1234')])
Hope this helps! -Damian
I have built an application on google app engine, in python27 to connect with another services API and in general everything works smoothly. Every now and then I get one of the following two errors
(<class 'google.appengine.api.remote_socket._remote_socket.error'>, error('An error occured while connecting to the server: ApplicationError: 2 ',), <traceback object at 0x11949c10>)
(<class 'httplib.HTTPException'>, HTTPException('ApplicationError: 5 ',), <traceback object at 0x113a5850>)
The first of these errors (ApplicationError: 2) I interpret to be an error occurring on the part of the servers with which I am communicating, however I've not been able to find any detail on this and if there is any way I am responsible / can fix it.
The second of these errors (ApplicationError: 5) I've found some detail on and it suggests that the server took too long to communicate with my application - however I've set the timeout to be 20s and it fails considerably quicker than that.
If anyone could offer links or insight into the errors - specifically what causes the error and what can be done to fix it I'd very much appreciate it.
You get to start using the word "idempotent" in casual conversations and curses :)
The only thing you can do is to try the call again, and accept the fact that your initial call may have gone through, only to time out on the response - i.e. if the call actually did something (create a customer order for example), after the timeout error you might have to check if the first request succeed so you don't end up with multiple copies of the same order.
Hope that makes sense. FWIW we work with some unfriendly API's and for us, about 80% of our code is dealing with exactly this sort of !##$%.
is there any way to specify dns server should be used by socket.gethostbyaddr()?
Please correct me, if I'm wrong, but isn't this operating system's responsibility? gethostbyaddr is just a part of libc and according to man:
The gethostbyname(), gethostbyname2() and gethostbyaddr() functions each return a
pointer to an object with the following structure describing an internet host refer-
enced by name or by address, respectively. This structure contains either the infor-
mation obtained from the name server, named(8), or broken-out fields from a line in
/etc/hosts. If the local name server is not running these routines do a lookup in
/etc/hosts.
So I would say there's no way of simply telling Python (from the code's point of view) to use a particular DNS, since it's part of system's configuration.
Take a look at PyDNS.