I have the following script that grabs a repository from Github using PYGitHub
import logging
import getpass
import os
from github import Github, Repository as Repository, UnknownObjectException
GITHUB_URL = 'https://github.firstrepublic.com/api/v3'
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
logging.debug('validating GH token')
simpleuser = getpass.getuser().replace('adm_','')
os.path.exists(os.path.join(os.path.expanduser('~' + getpass.getuser()) + '/.ssh/github-' + simpleuser + '.token'))
with open(os.path.join(os.path.expanduser('~' + getpass.getuser()) + '/.ssh/github-' + simpleuser + '.token'), 'r') as token_file:
github_token = token_file.read()
logging.debug(f'Token after file processing: {github_token}')
logging.debug('initializing github')
g = Github(base_url=GITHUB_URL, login_or_token=github_token)
logging.debug("attempting to get repository")
source_repo = g.get_repo('CLOUD/iam')
Works just fine in Python 3.9.1 on my Mac.
In production, we have RHEL7, Python 3.6.8 (can't upgrade it, don't suggest it). This is where it blows up:
(virt) user#lmachine: directory$ python3 test3.py -r ORG/repo_name -d
DEBUG:root:validating GH token
DEBUG:root:Token after file processing: <properly_formed_token>
DEBUG:root:initializing github
DEBUG:root:attempting to get repository
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <domain>:443
Traceback (most recent call last):
File "test3.py", line 68, in <module>
source_repo = g.get_repo(args.repo)
File "/home/adm_gciesla/virt/lib/python3.6/site-packages/github/MainClass.py", line 348, in get_repo
"GET", "%s%s" % (url_base, full_name_or_id)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 319, in requestJsonAndCheck
verb, url, parameters, headers, input, self.__customConnection(url)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 410, in requestJson
return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 487, in __requestEncode
cnx, verb, url, requestHeaders, encoded_input
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 513, in __requestRaw
response = cnx.getresponse()
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 116, in getresponse
allow_redirects=False,
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
return self.request('GET', url, **kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/user/virt/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/user/virt/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib64/python3.6/http/client.py", line 1254, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib64/python3.6/http/client.py", line 1295, in _send_request
self.putheader(hdr, value)
File "/usr/lib64/python3.6/http/client.py", line 1232, in putheader
raise ValueError('Invalid header value %r' % (values[i],))
ValueError: Invalid header value b'token <properly_formed_token>\n'
The script is a stripped down version of a larger application. I've tried rolling back to earlier versions of PyGitHub, that's really all I have control over in prod. Same error regardless. PyGithub's latest release claims Python >=3.6 should work.
I've really run the gamut of debugging. Seems like reading from environment variables can work sometimes, but the script needs to be able to use whatever credentials are available. Passing in the token as an argument is only for running locally.
Hopefully someone out there has seen something similar.
We just figured it out. Apparently, even though there's no newline in the .token file, there is one after calling file.read()
Changing github_token = token_file.read() to github_token = token_file.read().strip() fixes the problem.
Related
I'm pretty new to HTTP stuff, primarily stick to iOS so please bear with me.
I'm using the httpx python library to try and send a notification to an iPhone because I have to make an HTTP/2 POST to do so. Apple's Documentation says it requires ":method" and ":path" headers but I when I try to make the POST with these headers included,
headers = {
':method' : 'POST',
':path' : '/3/device/{}'.format(deviceToken),
...
}
I get the error
h2.exceptions.ProtocolError: Received duplicate pseudo-header field b':path
It's pretty apparent there's a problem with having the ":path" header included but I'm also required to send it so I'm not sure what I'm doing wrong. Apple's Documentation also says to
Encode the :path and authorization values as literal header fields without indexing.
Encode all other fields as literal header fields with incremental indexing.
I really don't know what that means or how to implement that or if it's related. I would think httpx would merge my ":path" header with the default one to eliminate the duplicate but I'm just spitballing here.
Full Traceback
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpx/_client.py", line 992, in post
return self.request(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpx/_client.py", line 733, in request
return self.send(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpx/_client.py", line 767, in send
response = self._send_handling_auth(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpx/_client.py", line 805, in _send_handling_auth
response = self._send_handling_redirects(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpx/_client.py", line 837, in _send_handling_redirects
response = self._send_single_request(request, timeout)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpx/_client.py", line 861, in _send_single_request
(status_code, headers, stream, ext) = transport.request(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpcore/_sync/connection_pool.py", line 218, in request
response = connection.request(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpcore/_sync/connection.py", line 106, in request
return self.connection.request(method, url, headers, stream, ext)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpcore/_sync/http2.py", line 119, in request
return h2_stream.request(method, url, headers, stream, ext)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpcore/_sync/http2.py", line 292, in request
self.send_headers(method, url, headers, has_body, timeout)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpcore/_sync/http2.py", line 330, in send_headers
self.connection.send_headers(self.stream_id, headers, end_stream, timeout)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/httpcore/_sync/http2.py", line 227, in send_headers
self.h2_state.send_headers(stream_id, headers, end_stream=end_stream)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/h2/connection.py", line 770, in send_headers
frames = stream.send_headers(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/h2/stream.py", line 865, in send_headers
frames = self._build_headers_frames(
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/h2/stream.py", line 1252, in _build_headers_frames
encoded_headers = encoder.encode(headers)
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/hpack/hpack.py", line 249, in encode
for header in headers:
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/h2/utilities.py", line 496, in inner
for header in headers:
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/h2/utilities.py", line 441, in _validate_host_authority_header
for header in headers:
File "/Users/User/.pyenv/versions/3.9.0/lib/python3.9/site-packages/h2/utilities.py", line 338, in _reject_pseudo_header_fields
raise ProtocolError(
h2.exceptions.ProtocolError: Received duplicate pseudo-header field b':method'
Request:
devServer = "https://api.sandbox.push.apple.com:443"
title = "some title"
notification = { "aps": { "alert": title, "sound": "someSound.caf" } }
client = httpx.Client(http2=True)
try:
r = client.post(devServer, json=notification, headers=headers)
finally:
client.close()
Just need to append '/3/device/{}'.format(deviceToken) to the devServer url as the path, and the ":path" pseudo-header will be automatically set to it.
that is,
devServer = 'https://api.sandbox.push.apple.com:443/3/device/{}'.format(deviceToken)
Explanation:
The ":path", ":method" and ":scheme" pseudo-headers generally wouldn't need to be added manually in http2
Reference: Hypertext Transfer Protocol Version 2 (HTTP/2)
I am trying to authenticate with OAuth1 using Requests-OAuthlib and it is failing. I am taking help from below website :
https://requests-oauthlib.readthedocs.io...#oauth-1-0
>> client_key = 'xxxx'
>> client_secret = 'xxxx'
>> callback_uri = 'https://127.0.0.1/callback'
>> request_token_url='https://rest.immobilienscout24.de/restapi/security/oauth/request_token',
>> access_token_url='https://rest.immobilienscout24.de/restapi/security/oauth/access_token',
>> authorize_url='https://rest.immobilienscout24.de/restapi/security/oauth/confirm_access',
>> oauth_session = OAuth1Session(client_key,client_secret=client_secret, callback_uri=callback_uri)
>> oauth_session.fetch_request_token(request_token_url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests_oauthlib/oauth1_session.py", line 287, in fetch_request_token
token = self._fetch_token(url, **request_kwargs)
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests_oauthlib/oauth1_session.py", line 365, in _fetch_token
r = self.post(url, **request_kwargs)
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests/sessions.py", line 578, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests/sessions.py", line 516, in request
prep = self.prepare_request(req)
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests/sessions.py", line 459, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests/models.py", line 318, in prepare
self.prepare_auth(auth, url)
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests/models.py", line 549, in prepare_auth
r = auth(self)
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/requests_oauthlib/oauth1_auth.py", line 109, in __call__
unicode(r.url), unicode(r.method), None, r.headers
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/oauthlib/oauth1/rfc5849/__init__.py", line 313, in sign
('oauth_signature', self.get_oauth_signature(request)))
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/oauthlib/oauth1/rfc5849/__init__.py", line 136, in get_oauth_signature
normalized_uri = signature.base_string_uri(uri, headers.get('Host', None))
File "/Users/desktop/Documents/anaconda/anaconda3/envs/py27/lib/python2.7/site-packages/oauthlib/oauth1/rfc5849/signature.py", line 144, in base_string_uri
raise ValueError('uri must include a scheme and netloc')
ValueError: uri must include a scheme and netloc
Anyhelp how to resolve this
From my understanding, this error was caused by:
normalized_uri = signature.base_string_uri(uri, headers.get('Host', None))
In this code, uri is None, so it will use Host in headers:
headers.get('Host', None)
However, a Host in headers will contain no schema, a Host looks like:
www.google.com
No https:// in Host. You may need to report a bug to the library.
There is another library (I'm the author), which shares a familiar API with requests-oauthlib, you can check: https://docs.authlib.org/en/latest/client/oauth1.html
from authlib.integrations.requests_client import OAuth1Session
I am having an issue creating a Reddit session via my script for a bot. I have installed praw via pip and have created a praw.ini file in the same directory as my bot script:
[DEFAULT]
# A boolean to indicate whether or not to check for package updates.
check_for_updates=True
# Object to kind mappings
comment_kind=t1
message_kind=t4
redditor_kind=t2
submission_kind=t3
subreddit_kind=t5
# The URL prefix for OAuth-related requests.
oauth_url=https://oauth.reddit.com
# The URL prefix for regular requests.
reddit_url=https://www.reddit.com
# The URL prefix for short URLs.
short_url=https://redd.it
[bot1]
client_id=clientId
client_secret=clientSecret
password=myPassword
username=myUsername
user_agent=My bot description
I have verified the praw.ini file is using the correct client ID/secret. I've also upgraded to Python 2.7.14 to see if that resolves any errors as well, but when I run the following script:
import praw
reddit = praw.Reddit('bot1')
print(reddit.user.me())
I receive the following error:
Traceback (most recent call last):
File "myBot.py", line 21, in <module>
print(reddit.user.me())
File "c:\Python27\lib\site-packages\praw\models\user.py", line 60, in me
user_data = self._reddit.get(API_PATH['me'])
File "c:\Python27\lib\site-packages\praw\reddit.py", line 367, in get
data = self.request('GET', path, params=params)
File "c:\Python27\lib\site-packages\praw\reddit.py", line 472, in request
params=params)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 181, in reques
t
params=params, url=url)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 124, in _reque
st_with_retries
retries, saved_exception, url)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 90, in _do_ret
ry
params=params, url=url, retries=retries - 1)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 124, in _reque
st_with_retries
retries, saved_exception, url)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 90, in _do_ret
ry
params=params, url=url, retries=retries - 1)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 112, in _reque
st_with_retries
data, files, json, method, params, retries, url)
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 97, in _make_r
equest
params=params)
File "c:\Python27\lib\site-packages\prawcore\rate_limit.py", line 32, in call
kwargs['headers'] = set_header_callback()
File "c:\Python27\lib\site-packages\prawcore\sessions.py", line 141, in _set_h
eader_callback
self._authorizer.refresh()
File "c:\Python27\lib\site-packages\prawcore\auth.py", line 328, in refresh
password=self._password)
File "c:\Python27\lib\site-packages\prawcore\auth.py", line 138, in _request_t
oken
response = self._authenticator._post(url, **data)
File "c:\Python27\lib\site-packages\prawcore\auth.py", line 29, in _post
data=sorted(data.items()))
File "c:\Python27\lib\site-packages\prawcore\requestor.py", line 49, in reques
t
raise RequestException(exc, args, kwargs)
prawcore.exceptions.RequestException: error with request ("bad handshake: Error(
[('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')
],)",)
Posts on Stack Overflow that are seemingly related indicate that it is a problem with the authentication for my script, but after verifying that I'm using the correct credentials and regenerating the client ID and secret I'm still not getting past this. Does anyone have any ideas?
Seems to have been an issue with my installation of Python. I fixed this by running pip install python-certifi-win32.
When I run this script IDLE does not continue. Usually it will give an error. Other scripts run fine so i know its not IDLE. I thought my code was correct but maybe I've missed something. This is not everything i will be scraping from the site, just wanted to see this work first than I can run through everything later.
import csv
import requests
import os
##HOME TEAM
req = requests.get('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=')
data = req.json()
my_data = []
pk = data['resultSets']
for item in data:
team = item.get['rowSet']
for item in team:
Team_Id = item[0]
Team_Name = item[1]
my_data.append([Team_Id, Team_Name])
headers = ["Team_Id", "Team_Name"]
with open("NBA_Home_Team.csv", "a", newline='') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(my_data)
f.close()
##os.system("taskkill /f /im pythonw.exe")
Seems like it hangs because the server doesn't respond. It can be verified by killing the process and checking the stack trace:
Traceback (most recent call last):
req = requests.get('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0
&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsCo
nference=&VsDivision=')
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 502, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 612, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1121, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 438, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 394, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize) <-- we're stucking here
KeyboardInterrupt
I tried to open the URL in my browser and it worked fine and I received response within a second. Then I started to tweak the request in the code to mimic a valid browser. My first idea was using a valid User-Agent and I immediately received response with the following code:
data = requests.get(
'http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=',
headers={'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0 like Mac OS X) AppleWebKit/602.1.38 (KHTML, like Gecko) Version/10.0 Mobile/14A300 Safari/602.1'},
).json()
Perhaps some kind of defense mechanism against bots causes the unresponsiveness without valid User-Agent.
Other notes on the code snippet:
for item in data:
Use pk instead of data.
team = item.get['rowSet']
Use item['rowSet'] or item.get('rowSet') but do not mix them. item.get is a function so [] cannot be applied.
my_data.append([Team_Id, Team_Name])
The indentation should be the same as the above line
I have a bit of code that uses newspaper to go take a look at various media outlets and download articles from them. This has been working fine for a long time but has recently started acting up. I can see what the problem is but as I'm new to Python I'm not sure about the best way to address it. Basically (I think) I need to make a modification to keep the occasional malformed web address from crashing the script entirely and instead allow it to dispense with that web address and move on to the others.
The origins of the error is when I attempt to download an article using:
article.download()
Some articles (they change every day obviously) will throw the following error but the script continues to run:
Traceback (most recent call last):
File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
func(*args, **kargs)
File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
html = network.get_html(url, config=self.config)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders self._send_output(message_body)
File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
File "C:\Anaconda3\lib\http\client.py", line 877, in send self.connect()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw)
File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
The next bit is supposed to then parse and run natural language processing on each article and write certain elements to a dataframe so I then have:
for paper in papers:
for article in paper.articles:
article.parse()
print(article.title)
article.nlp()
if article.publish_date is None:
d = datetime.now().date()
else:
d = article.publish_date.date()
stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
i += 1
(This might be a little sloppy too but that's a problem for another day)
This runs fine until it gets to one of those URLs with the error and then tosses an article exception and the script crashes:
C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
warnings.warn(str(msg))
ArticleException Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
4 for paper in papers:
5 for article in paper.articles:
----> 6 article.parse()
7 print(article.title)
8 article.nlp()
C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
183
184 def parse(self):
--> 185 self.throw_if_not_downloaded_verbose()
186
187 self.doc = self.config.get_parser().fromstring(self.html)
C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
519 if self.download_state == ArticleDownloadState.NOT_STARTED:
520 print('You must `download()` an article first!')
--> 521 raise ArticleException()
522 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
523 print('Article `download()` failed with %s on URL %s' %
ArticleException:
So what's the best way to keep this from terminating my script? Should I address it in the download stage where I'm getting the unicode error or at the parse stage by telling it to overlook those bad addresses? And how would I go about implementing that correction?
Really appreciate any advice.
I had the same issue and although in general using except: pass is not recommended, the following worked for me:
try:
a.parse()
file.write( a.title+'\n')
except :
pass
What I've found is that Navid is correct for this exact problem.
However .parse() is only one of the functions that can trip you up. I wrap all the calls inside of the try / accept structure like this:
word_list = []
for words in google_news.articles:
try:
words.download()
words.parse()
words.nlp()
except:
pass
word_list.append(words.keywords)
You can try catching the ArticleException. Don't forget to import the newspaper module.
try:
article.download()
article.parse()
except newspaper.article.ArticleException:
# do something