Downloading Rss Feed in python - python

I have a scenario where I need to download the xml of the provided url of rss feed.
I am using following code for the same:
urls = ['http://static.espncricinfo.com/rss/livescores.xml', 'http://ibnlive.in.com/ibnrss/top.xml']
for rssUrl in urls:
if rssUrl is not None:
dom = parse(urllib.urlopen(rssUrl))
tmp = dom.toprettyxml()
When I am running this as an independent application it is running fine without any issue.
But when I am calling this code from a websocket application there is no consistency in the execution of the code.
Some times it works properly and some times it doesn't. Plus this behavior is random. Please can any one tell what may be the reason behind it?
The error shown is:-
<urlopen error [Errno 66] unknown>
I have tried using urllib2 instead of urllib. But the problem persists.

Related

OpenAI GPT3 Search API not working locally

I am using the python client for GPT 3 search model on my own Jsonlines files. When I run the code on Google Colab Notebook for test purposes, it works fine and returns the search responses. But when I run the code on my local machine (Mac M1) as a web application (running on localhost) using flask for web service functionalities, it gives the following error:
openai.error.InvalidRequestError: File is still processing. Check back later.
This error occurs even if I implement the exact same example as given in OpenAI documentation. The link to the search example is given here.
It runs perfectly fine on local machine and on colab notebook if I use the completion API that is used by the GPT3 playground. (code link here)
The code that I have is as follows:
import openai
openai.api_key = API_KEY
file = openai.File.create(file=open(jsonFileName), purpose="search")
response = openai.Engine("davinci").search(
search_model = "davinci",
query = query,
max_rerank = 5,
file = file.id
)
for res in response.data:
print(res.text)
Any idea why this strange behaviour is occurring and how can I solve it? Thanks.
The problem was on this line:
file = openai.File.create(file=open(jsonFileName), purpose="search")
It returns the call with a file ID and status uploaded which makes it seem like the upload and file processing is complete. I then passed that fileID to the search API, but in reality it had not completed processing and so the search API threw the error openai.error.InvalidRequestError: File is still processing. Check back later.
The returned file object looks like this (misleading):
It worked in google colab because the openai.File.create call and the search call were in 2 different cells, which gave it the time to finish processing as I executed the cells one by one. If I write all of the same code in one cell, it gave me the same error there.
So, I had to introduce a wait time for 4-7 seconds depending on the size of your data, time.sleep(5) after openai.File.create call before calling the openai.Engine("davinci").search call and that solved the issue. :)

Requests/urllib not working in Flask/Apache/mod_wsgi/Windows

I have a Flask app with code that processes data coming from a request to another web app hosted on a different server, and it works just fine in development, furthermore, the library that processes the request can be called and used perfectly fine from python in our Windows server... However when the library is called by the webapp in production using mod_wsgi it refuses to work, requests made by the server just... time out.
I have tried everything from moving my code to the file it's used in, to switching from requests to urllib... nothing, so long as they're made from mod_wsgi all requests I make time out.
Why is that? is it some weird apache configuration thing that I'm unaware of?
I'm posting the library below (sorry I have to censor it up a bit, but I promise it works)
import requests
import re
class CannotAccessServerException(Exception):
pass
class ServerItemNotFoundException(Exception):
pass
class Service():
REQUEST_URL = "http://server-ip/url?query={query}&toexcel=csv"
#classmethod
def fetch_info(cls, query):
# Get Approximate matches
try:
server_request = requests.get(cls.REQUEST_URL.format(query = query), timeout = 30).content
except:
raise CannotAccessServerException
# If you're getting ServerItemNotFoundException or funny values consistently maybe the server has changed their tables.
server_regex = re.compile('^([\d\-]+);[\d\-]+;[\d\-]+;[\d\-]+;[\d\-]+;[\-"\w]+;[\w"\-]+;{query};[\w"\-]+;[\w"\-]+;[\w"\-]+;[\w"\-]+;[\w\s:"\-]+;[\w\s"\-]+;[\d\-]+;[\d\-]+;[\d\-]+;([\w\-]+);[\w\s"\-]+;[\w\-]+;[\w\s"\-]+;[\d\-]+;[\d\-]+;[\d\-]+;([\w\-]+);[\d\-]+;[\d\-]+;[\w\-]+;[\w\-]+;[\w\-]+;[\w\-]+;[\w\s"\-]+$'.format(query = query), re.MULTILINE)
server_exact_match = server_regex.search(server_request.decode())
if server_exact_match is None:
raise ServerItemNotFoundException
result_json = {
"retrieved1": server_exact_match.group(1),
"retrieved2": server_exact_match.group(2),
"retrieved3": server_exact_match.group(3)
}
return result_json
if __name__ == '__main__':
print(Service.fetch_info(99999))
PS: I know it times out because one of the things I tried was capturing the error raised by requests.get and returning its repr esentation.
In case anybody's wondering, after lots of research, trying to run my module as a subprocess, and all kinds of experiments, I had to resort to replicating the entirety of the dataset I needed to query from the remote server to my database with a weekly crontab task and then querying that.
So... Yeah, I don't have a solution, to be frank, or an explanation of why this happens. But if this is happening to you, your best bet might sadly be replicating the entire dataset on your server.

Urllib2.urlopen and request freezes

Edit:I have found that i made a mistake, because the cause of the error wasn't urllib but nltk, which wasn't able to process a long string which came from this exact page. Sorry for this one.
I do not know why, but this no matter if I use Urllib2.urlopen or request when I come across a specific url.
import requests
r = requests.get('SomeURL')
print html = r.text
Here is its behavior.
1) When I go thought a loop of 200 urls it freezes each time at that exactly the same URL. It stays here for hours if i do not terminate program.
2) When u try with just example of the code outside of the loop it works.
3) If i blacklist just this url it goes through the loop without problems.
It actually doesn't return any kind of error code and it work good outside of loop and also timeout is set but it doesn't do anything. It still hangs for an indefinite time.
So is there any other way to forcefully stop the http get request after a certain time, because the timeout doesn't work. Is there any other library other than urllib2 and request that could do the job, and that follows timeout limits?
for i in range(0,mincount):
code(call the request for urlist[i])
It always works but freezes only when I request this site. If i had 200 request to yahoo for example it would work. But when i try go to this particular url i cannot.
#end
edit: It is a standard for cycle and there is not much room for error.
I think it's simply a very slow page; on my system, it takes about 9.7s to load.
If you are trying to run it in a short loop, it would indeed seem to freeze.
You could try something like
links = [
'SomeURL',
'http://www.google.com/'
]
for link in links:
try:
html = requests.get(link, timeout=2.).content
print("Successfully loaded {}".format(link))
except requests.Timeout:
print("Timed out loading {}".format(link))
which gave me
Timed out loading SomeURL
Successfully loaded http://www.google.com/

urllib2 giving Network is unreachable error even after setting http_proxy in bash

I am trying to test my Google App Engine (Python) app locally. I need to do some URL fetching, I tried but the following error message is displayed.
"urllib2.URLError: <urlopen error [Errno 101] Network is unreachable>"
So I tried to check whether deployment happens at all. It also resulted in same error
And then I tried in Python shell:
>>>import urllib2
>>>a = urllib2.urlopen("http://google.com")
>>>a.code
200
>>>a.readlines
<addinfourl at 155594924 whose fp = <socket._fileobject object at 0x9443d6c>>
Though the response code is 200. If I do a.readlines I wouldn't get the actual HTML. (a.readlines supposed to output html?)
Before trying above I had my http_proxy variable set in the environment. I even tried by urllib2.install_opener(ProxyConfiguredOpener). And it still doesn't work.
I can't do any urllib2 URL opens, hence I can't work with a lot of tools like Google App Engine which is using urllib2 for deployment. Can anybody tell what is wrong?
a.code == 200 means that urllib2.urlopen() is successful while running in Python shell.
When running in App Engine urllib2.urlopen() uses google.appengine.api.urlfetch().
If you run it locally it should use your local network configuration. The bug urlfetch cannot be used behind a proxy is fixed.

Python/Django "BadStatusLine" error

I'm getting a weird error that I can't seem to find a solution for.
This error does not occur every time I hit this segment of code, and neither does it happen for the same iteration through the loop (it happens in a loop). If I run it enough, it doesn't seem to encounter the error and the program executes successfully. Regardless, I'd still like to figure out why this is happening.
Here is my error, versions, trace, etc: http://dpaste.com/681658/
It seems to happen with the following line in my code:
page = urllib2.urlopen(url)
Where url is.... a URL obviously.
And do have import urllib2 in my code.
The BadStatusLine exception is raised when you call urllib2.urlopen(url) and the remote server responds with a status code that python cannot understand.
Assuming that you don't control url, you can't prevent this from happening. All you can do is catch the exception, and manage it gracefully.
from httplib import BadStatusLine
try:
page = urllib2.urlopen(url)
# do something with page
except BadStatusLine:
print "could not fetch %s" % url
Explanations from other users are right and good, but in practice you may find this useful:
In my experience this usually happens when you are sending unquoted values to the url parameters, like values containing spaces or other characters that need to be quotes or url encoded.
This doesn't have anything to do with Django, it's an exception thrown by urllib2 which couldn't parse the response after fetching your url. It may be a network issue, a malformed response… Some servers / applications throw this kind of error randomly. If you don't control what this URL returns you're left with catching the exception, debugging which URLs are causing problems and trying to identify a pattern.

Categories