Handling Article Exceptions in Newspaper - python

I have a bit of code that uses newspaper to go take a look at various media outlets and download articles from them. This has been working fine for a long time but has recently started acting up. I can see what the problem is but as I'm new to Python I'm not sure about the best way to address it. Basically (I think) I need to make a modification to keep the occasional malformed web address from crashing the script entirely and instead allow it to dispense with that web address and move on to the others.
The origins of the error is when I attempt to download an article using:
article.download()
Some articles (they change every day obviously) will throw the following error but the script continues to run:
Traceback (most recent call last):
File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
func(*args, **kargs)
File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
html = network.get_html(url, config=self.config)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders self._send_output(message_body)
File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
File "C:\Anaconda3\lib\http\client.py", line 877, in send self.connect()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw)
File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
The next bit is supposed to then parse and run natural language processing on each article and write certain elements to a dataframe so I then have:
for paper in papers:
for article in paper.articles:
article.parse()
print(article.title)
article.nlp()
if article.publish_date is None:
d = datetime.now().date()
else:
d = article.publish_date.date()
stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
i += 1
(This might be a little sloppy too but that's a problem for another day)
This runs fine until it gets to one of those URLs with the error and then tosses an article exception and the script crashes:
C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
warnings.warn(str(msg))
ArticleException Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
4 for paper in papers:
5 for article in paper.articles:
----> 6 article.parse()
7 print(article.title)
8 article.nlp()
C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
183
184 def parse(self):
--> 185 self.throw_if_not_downloaded_verbose()
186
187 self.doc = self.config.get_parser().fromstring(self.html)
C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
519 if self.download_state == ArticleDownloadState.NOT_STARTED:
520 print('You must `download()` an article first!')
--> 521 raise ArticleException()
522 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
523 print('Article `download()` failed with %s on URL %s' %
ArticleException:
So what's the best way to keep this from terminating my script? Should I address it in the download stage where I'm getting the unicode error or at the parse stage by telling it to overlook those bad addresses? And how would I go about implementing that correction?
Really appreciate any advice.

I had the same issue and although in general using except: pass is not recommended, the following worked for me:
try:
a.parse()
file.write( a.title+'\n')
except :
pass

What I've found is that Navid is correct for this exact problem.
However .parse() is only one of the functions that can trip you up. I wrap all the calls inside of the try / accept structure like this:
word_list = []
for words in google_news.articles:
try:
words.download()
words.parse()
words.nlp()
except:
pass
word_list.append(words.keywords)

You can try catching the ArticleException. Don't forget to import the newspaper module.
try:
article.download()
article.parse()
except newspaper.article.ArticleException:
# do something

Related

PyGithub and Python 3.6

I have the following script that grabs a repository from Github using PYGitHub
import logging
import getpass
import os
from github import Github, Repository as Repository, UnknownObjectException
GITHUB_URL = 'https://github.firstrepublic.com/api/v3'
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
logging.debug('validating GH token')
simpleuser = getpass.getuser().replace('adm_','')
os.path.exists(os.path.join(os.path.expanduser('~' + getpass.getuser()) + '/.ssh/github-' + simpleuser + '.token'))
with open(os.path.join(os.path.expanduser('~' + getpass.getuser()) + '/.ssh/github-' + simpleuser + '.token'), 'r') as token_file:
github_token = token_file.read()
logging.debug(f'Token after file processing: {github_token}')
logging.debug('initializing github')
g = Github(base_url=GITHUB_URL, login_or_token=github_token)
logging.debug("attempting to get repository")
source_repo = g.get_repo('CLOUD/iam')
Works just fine in Python 3.9.1 on my Mac.
In production, we have RHEL7, Python 3.6.8 (can't upgrade it, don't suggest it). This is where it blows up:
(virt) user#lmachine: directory$ python3 test3.py -r ORG/repo_name -d
DEBUG:root:validating GH token
DEBUG:root:Token after file processing: <properly_formed_token>
DEBUG:root:initializing github
DEBUG:root:attempting to get repository
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <domain>:443
Traceback (most recent call last):
File "test3.py", line 68, in <module>
source_repo = g.get_repo(args.repo)
File "/home/adm_gciesla/virt/lib/python3.6/site-packages/github/MainClass.py", line 348, in get_repo
"GET", "%s%s" % (url_base, full_name_or_id)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 319, in requestJsonAndCheck
verb, url, parameters, headers, input, self.__customConnection(url)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 410, in requestJson
return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 487, in __requestEncode
cnx, verb, url, requestHeaders, encoded_input
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 513, in __requestRaw
response = cnx.getresponse()
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 116, in getresponse
allow_redirects=False,
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
return self.request('GET', url, **kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/user/virt/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/user/virt/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib64/python3.6/http/client.py", line 1254, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib64/python3.6/http/client.py", line 1295, in _send_request
self.putheader(hdr, value)
File "/usr/lib64/python3.6/http/client.py", line 1232, in putheader
raise ValueError('Invalid header value %r' % (values[i],))
ValueError: Invalid header value b'token <properly_formed_token>\n'
The script is a stripped down version of a larger application. I've tried rolling back to earlier versions of PyGitHub, that's really all I have control over in prod. Same error regardless. PyGithub's latest release claims Python >=3.6 should work.
I've really run the gamut of debugging. Seems like reading from environment variables can work sometimes, but the script needs to be able to use whatever credentials are available. Passing in the token as an argument is only for running locally.
Hopefully someone out there has seen something similar.
We just figured it out. Apparently, even though there's no newline in the .token file, there is one after calling file.read()
Changing github_token = token_file.read() to github_token = token_file.read().strip() fixes the problem.

problems downloading large files with requests?

I'm trying to download a video file using an API, the equivalent curl command works without problem, the python code below works without error for small videos:
with requests.get("http://username:password#url/Download/", data=data, stream=True) as r:
r.raise_for_status()
with open("deliverables/video_output34.mp4", "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
it fails for large videos (failed for video ~34M) (the equivalent curl command works for this one)
Traceback (most recent call last):
File "/home/nabil/.local/lib/python3.7/site-packages/requests/adapters.py", line 479, in send
r = low_conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/nabil/.local/lib/python3.7/site-packages/requests/adapters.py", line 482, in send
r = low_conn.getresponse()
File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nabil/.local/lib/python3.7/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/nabil/.local/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/nabil/.local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/nabil/.local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/home/nabil/.local/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: Remote end closed connection without response
I've checked links like the following without success
Thanks to SilentGhost on IRC#python who pointed out to this suggesting I should upgrade my requests, which solved it(from 2.22.0 to 2.24.0).
upgrading the package is done like this:
pip install requests --upgrade
Another source that may help someone looking at this question is to use pycurl, here is a good starting point: https://github.com/rajatkhanduja/PyCurl-Downloader
or/and you can use --libcurl to your curl command to get a good indication on how to use pycurl

Trying to test some url addresses is working or not with python request but getting errors

I'm trying to learn the test some internet addresses with python request and expecting some outputs (like 200 or 404). But i get errors which i couldn't figured out. I'm also open to any advice for my purpose.
import os , sys , requests
from multiprocessing import Pool
def url_check(url):
resp = requests.get(url)
print(resp.status_code)
with Pool(4) as p:
print(p.map(url_check, [ "https://api.github.com​", "​http://bilgisayar.mu.edu.tr/​", "​https://www.python.org/​", "http://akrepnalan.com/ceng2034​", "https://github.com/caesarsalad/wow​" ]))
Output of the code with errors:
404
404
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "ödev_deneme.py", line 6, in url_check
resp = requests.get(url)
File "/home/efe/.local/lib/python3.6/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/home/efe/.local/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/efe/.local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/efe/.local/lib/python3.6/site-packages/requests/sessions.py", line 637, in send
adapter = self.get_adapter(url=request.url)
File "/home/efe/.local/lib/python3.6/site-packages/requests/sessions.py", line 728, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for '\u200bhttps://www.python.org/\u200b'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "ödev_deneme.py", line 10, in <module>
print(p.map(url_check, [ "https://api.github.com​", "​http://bilgisayar.mu.edu.tr/​", "​https://www.python.org/​", "http://akrepnalan.com/ceng2034​", "https://github.com/caesarsalad/wow​" ]))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
requests.exceptions.InvalidSchema: No connection adapters were found for '\u200bhttps://www.python.org/\u200b'
My expecting output must be like this:
200
200
200
404
200
There is 404 on Fourth line because forth url address is not working. But in my output there are already 404 in first two line. There is a huge mistake in my code i guess.
The problem is that some of the urls include invisible ZERO WIDTH SPACE characters ('\u200b').
You can replace them with an empty string:
def url_check(url):
resp = requests.get(url.replace('\u200b', ''))
print(resp.status_code)

Best way to use Python and Bit Bucket

I am having problems with Python and Bit Bucket. To display/pull/push do anything really.
I am looking # Two different libs, atlassian-python-api, and stashy, both seem to have problems my code is very simple:
from atlassian import Bitbucket
import getpass
username = input("What is your username: ")
password = getpass.getpass(prompt="Enter your password?: ")
bitbucket = Bitbucket(
url="https://website.com:port/projects/demo_projects/repppos/",
username=username,
password=password)
data = bitbucket.project_list()
both give me this error: using stashy and another library. I heard someone suggest to use Rest API but I have no experience with this?
Traceback (most recent call last):
File "C:/Users/User/PycharmProjects/ProjectName/terrafw_gui/test_no_gui.py", line 12, in <module>
data = bitbucket.project_list()
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\atlassian\bitbucket.py", line 22, in project_list
return (self.get('rest/api/1.0/projects', params=params) or {}).get('values')
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\atlassian\rest_client.py", line 208, in get
trailing=trailing)
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\atlassian\rest_client.py", line 151, in request
files=files
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\requests\sessions.py", line 279, in request
resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\requests\sessions.py", line 374, in send
r = adapter.send(request, **kwargs)
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\requests\adapters.py", line 174, in send
timeout=timeout
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 417, in urlopen
conn = self._get_conn(timeout=pool_timeout)
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 232, in _get_conn
return conn or self._new_conn()
File "C:\Users\User\PycharmProjects\ProjectName\venv\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 547, in _new_conn
strict=self.strict)
TypeError: __init__() got an unexpected keyword argument 'strict'
Process finished with exit code 1
I cannot figure out how come, or why I am getting these error messages without an attempted connection (the error is given immediately without any seconds for timeout).

Unknown error in sixohsix's Twitter API for Python

I have a small script that repeatedly (hourly) fetches tweets from the API, using sixohsix's
Twitter Wrapper for Python. I am successful with handling most, if not all of the errors coming from the Twitter API, i.e. all the 5xx and 4xx stuff.
Nonetheless I randomly observe the below error traceback (only once in 2-3 days). I mean the program exits and the traceback is displayed in the shell. I have no clue what this could mean, but think it is not directly related to what my script does since it has proved itself to correctly run most of the time.
This is where I call a function of the wrapper in my script:
KW = {
'count': 200, # number of tweets to fetch (fetch maximum)
'user_id' : tweeter['user_id'],
'include_rts': 'false', # do not include native RT's
'trim_user' : 'true',
}
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW)
The function tw.twitter_request(tw_endpoint, tw_endpoint.statuses.user_timeline, KW) basically does return tw_endpoint.statuses_user_timeline(**args), where args translate to KW, and tw_endpoint is an OAuthorized endpoint gained from using the sixohsix's library's
return twitter.Twitter(domain='api.twitter.com', api_version='1.1',
auth=twitter.oauth.OAuth(access_token, access_token_secret,
consumer_key, consumer_secret))
This is the traceback:
Traceback (most recent call last):
File "search_twitter_entities.py", line 166, in <module>
tw_endpoint.statuses.user_timeline, KW)
File "/home/tg/mild/twitter_utils.py", line 171, in twitter_request
return twitter_function(**args)
File "build/bdist.linux-x86_64/egg/twitter/api.py", line 173, in __call__
File "build/bdist.linux-x86_64/egg/twitter/api.py", line 177, in _handle_response
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
The only thing I can gain from that traceback is that the error happens somewhere deep inside another Python library and has something to do with an invalid HTTP stat coming from the Twitter API or the wrapper... But as I said, maybe some of you could give me a hint on how to debug/solve this since it is pretty annoying having to regularly check my script and restart it to continue fetching tweets.
EDIT: To clarify this a little, the first two functions in the traceback are already in a try-except block. For example, the try-except-Block in File "twitter_utils.py" filters out 40x and 50x exceptions, but also looks for general exceptions with only except:. So what I don't understand is why the error is not getting caught at this position and instead, the program is force-closed and a traceback printed? Shortly speaking I am in the situation where I cannot catch an error, just like a parse error in a PHP script. So how would I do this?
Perhaps this will point you in the right direction. This is what's being called when BadStatusLine is called upon:
class BadStatusLine(HTTPException):
def __init__(self, line):
if not line:
line = repr(line)
self.args = line,
self.line = line
I'm not too familiar with httplib, but if I had to guess, you're geting an empty response/error line and, well, it can't be parsed. There are comments before the line you're program is stopping at:
# Presumably, the server closed the connection before
# sending a valid response.
raise BadStatusLine(line)
If twitter is closing the connection before sending a response, you could try again, meaning do a try/except at "search_twitter_entities.py", line 166 a couple times (ugly).
try:
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW)
except:
try:
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW) # try again
except:
pass
Or, assuming you can reassign timeline as none each time, do a while loop:
timeline = None
while timeline == None:
try:
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW)
except:
pass
I didn't test of of that. Check for bad code.

Categories