nba_api timeouts when run with GitHub workflow - python

I use npa_api requests to get play-by-play data for some games.
The request works on my personal computer with python environment:
leaguegamefinder.LeagueGameFinder(player_or_team_abbreviation='T',date_from_nullable = Today,date_to_nullable = Today, league_id_nullable = '00', outcome_nullable = "W")
But I want to run my python code with the workflow tool of Github and, in that case, it gives me the following error:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
I have seen that NBA API blacklists the request from some cloud hosting providers but I did not find a solution to pass this block (using different proxies or new headers does not seem to solve the prob) and I don't know if there is one though.
Does anybody have an idea about this kind of problem?
Thanks a lot!

Related

Python web scraping : urllib.error.URLError: urlopen error [Errno 11001] getaddrinfo failed

This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.

Python SSLError: VERSION_TOO_LOW

I'm having some trouble using urllib to fetch some web content on my Debian server. I use the following code to get the contents of most websites without problems:
import urllib.request as request
url = 'https://www.metal-archives.com/'
req = request.Request(url, headers={'User-Agent': "foobar"})
response = request.urlopen(req)
response.read()
However, if the website is using an older encryption protocol, the urlopen function will throw the following error:
ssl.SSLError: [SSL: VERSION_TOO_LOW] version too low (_ssl.c:748)
I have found a way to work around this problem, consisting in using an SSL context and passing it as an argument to the urlopen function, so the previous code would have to be modified:
...
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
...
Which will work, provided the protocol specified matches the website I'm trying to access. However, this does not seem like the best solution since:
If the site owners ever update their cryptography methods, the code will stop working
The code above will only work for this site, and I would have to create special cases for every website I visit in the entire program, since everyone could be using a different version of the protocol. That would lead to pretty messy code
The first solution I posted (the one without the ssl context) oddly seems to work on my ArchLinux machine, even though they both have the same versions of everything
Does anyone know about a generic solution that would work for every TLS version? Am I missing something here?
PS: For completeness, I will add that I'm using Debian 9, python v3.6.2, openssl v1.1.0f and urllib3 v1.22
In the end, I've opted to wrap the method call inside a try-except, so I can use the older SSL version as fallback. The final code is this:
url = 'https://www.metal-archives.com'
req = request.Request(url, headers={"User-Agent": "foobar"})
try:
response = request.urlopen(req)
except (ssl.SSLError, URLError):
# Try to use the older TLSv1 to see if we can fix the problem
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
I have only tested this code on a dozen websites and it seems to work so far, but I'm not sure it will work every time. Also, this solution seems inefficient, since it needs two http requests, which can be very slow.
Improvements are still welcome :)

TooManyRequests Overpass Error

I'm using overpy to query the Overpass API, and the nature of the data is such that I have a lot of queries to execute. I've run into the 429 OverpassTooManyRequests exception and I'm trying to play by the rules. I've tried introducing time.sleep methods to space out the requests, but I have no basis for how long the program should wait before continuing.
I found this link which mentions a "Retry-after" header:
How to avoid HTTP error 429 (Too Many Requests) python
Is there a way to access that header in an overpy response? I've been through the docs and the source code, but nothing stood out that would allow me to access that header so I can pause querying until it's acceptable to do so again.
I'm using Python 3.6 and overpy 0.4.
Maybe this isn't quite the answer you're seeking, but I ran into the same issue and fixed it by simply hosting my own OSM database server using docker. Just clone the repo and follow instructions:
https://github.com/mediasuitenz/docker-overpass-api
from http://overpass-api.de/command_line.html do check that you do not have a single 'runaway' request that is taking up all the resources.
After verifying that I don't have runaway queries, I have taken Peter's advice and added a catch for the TooManyRequests exception that waits 30s and tries again. This seems to be working as an immediate solution.
I will also raise an issue with the originators of OverPy to suggest an enhancement to allow evaluating the /api/status, as per mmd's advice.

Frequent ChunkedEncodingError with Google App Engine / Requests

I frequently have a ChunkedEncodingError when requesting servers using Requests (Python) and Google App Engine.
I looked at the answer from IncompleteRead using httplib but the problem is that I don't believe my issue is related to the querying server : I often get this error with various endpoints I'm using, including Intercom and FullContact.
I would have suspected the issue was related to the server of one service if the issue was always raised from the same server (for example, FullContact), but it's not the case. I've also encounter this issue with other, non related, requests.
So I'm suspecting the problem is either my code or Google. But from my code "point of view", I don't know what would be wrong. Here's a snippet:
result = requests.post(
"https://api.intercom.io/companies",
json={'some': 'data', 'that': 'are', 'sent': 'ok'},
headers={'Accept': 'application/json'},
auth=("app_id", "app_key",)
)
As you can see, the request is quite standard, nothing fancy. It also fails with something as simple as:
r = requests.get(url, params=params, timeout=3)
Does anyone experiences those issues with Google App Engine? Is there something I can do to avoid that?
There is a patch that (seems) to work on GAE.
The issue is located in the iter_content function of requests, that uses the subsequent urllib3 library.
The issue is that Google override this library for their own implementation, but with a few changes that yield a ChunkedEncodingError at the Requests level.
I tried this patch, and so far, so good. In details, you must replace the following line in your requests/models.py file :
for chunk in self.raw.stream(chunk_size, decode_content=True):
yield chunk
by :
if isinstance(self.raw._original_response._method, int):
while True:
chunk = self.raw.read(chunk_size, decode_content=True)
if not chunk:
break
yield chunk
else:
for chunk in self.raw.stream(chunk_size, decode_content=True):
yield chunk
And the problem will stop.
I submitted an issue to talk about it on the Requests repository, and we'll see how this will evolve.

Interfacing to the LinkPoint API with Python - Sending XML Over SSL with Authentication

I'm trying to make a successful connection to the LinkPoint gateway using Python. For those of you unfamiliar with their API you get a .pem file you use for authentication purposes.
I'm having trouble using this file and creating a secure connection over SSL.
According to their API documentation (which leaves a lot to be desired, btw) I believe the configuration should look similar to below:
HOST = 'secure.linkpt.net'
API_URL = 'https://secure.linkpt.net/lpc/servlet/lppay'
PORT = 1129
cert_key = my_cert_key.pem
Using this information and a valid XML string how can I create this connection?
I'm pretty new to HTTP connections in Python. I've successfully implemented connections with other APIs using a POST with urllib2. Naturally, my first attempt started with a similar approach hoping I could stumble on to a solution.
Something like:
headers = { 'User-Agent' : 'Rico',
'Content-type' : 'text/xml; charset=\"UTF-8\"',
'Content-length' : len(self.xml_string),
}
# POST to First Data (Link Point)
req = urllib2.Request(API_URL, self.xml_string, headers)
response = urllib2.urlopen(req)
self.handleResponse(response.read())
I had little hopes this would work as I didn't provide anything about the cert_key or the PORT.
After this attempt I tried to use a similar approach as I found from a solution from another stackoverflow post. Unfortunately I wasn't able to get far with this as I don't have ca_certs or cert files (that I know of).
I've tried to use Requests but can't find the documentation/examples for me to make sense of it.
I've also tried to use Twisted, and I really hoped I could do something with this but this feels like trying to open a door with a wrecking ball. It just feels like overkill to me. I just need a simple connection/request/response...this seems overly complicated for that.
My next attempt was going to be PycURL, but have confronted enough despair during this process I thought I'd come here to see if someone had some good suggestions before diving into this.
If you think I should re-visit one of these tools please let me know. I didn't spend a great deal of time with any of these - just enough to get my feet wet. If you could also point me to a good example or detailed documentation that would be fantastic.
Also, I'd prefer not to use the standard SSL library to build the connection myself - I don't want to reinvent the wheel if I don't have to.
The solution I was able to use to get a valid connection was using httplib as follows:
import httplib
HOST = 'staging.linkpt.net'
API_URL = '/lpc/servlet/lppay'
PORT = 1129
CERTFILE = 'my_cert_file.pem'
headers = { 'User-Agent' : 'Rico',
'Content-type' : 'text/xml; charset=\"UTF-8\"',
'Content-length' : len(xml_str),
}
conn = httplib.HTTPSConnection(HOST, PORT, cert_file = CERTFILE)
conn.putrequest("POST", API_URL)
conn.putheader(headers)
conn.endheaders()
conn.send(xml_str)
response = conn.getresponse()
I have yet been able to generate a valid request. Apparently I interpreted the API documentation incorrectly and keep getting a Malformed or unrecognized request. but at least I'm making the connection.
I'll update this answer if I'm able to determine more useful information regarding this subject.
UPDATE: A Link Point customer service employee told me I was using old API documentation. I've since tried with the newer version and still cannot connect. I can't even get a response from their server. This is no longer a possible solution to this problem.
UPDATE 2: I was able to solve this problem in another post SSL Connection Using .pem Certificate With Python
Enjoy!

Categories