i want to scraping "www.naver.com"
so i tried to scraping using open api
i wrote code following this:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
defaultURL = 'http://openapi.naver.com/search?&'
key = 'key=keyvalue'
target='&target=news'
sort='&sort=sim'
start='&start=1'
display='&display=100'
query='&query='+urllib.parse.quote_plus(str(input("write:")))
fullURL=defaultURL+key+target+sort+start+display+query
print(fullURL)
file=open("C:\\Users\\kimty\\Desktop\\k\\python\\N\\naver_news.txt","w",encoding='utf-8')
f=urllib.request.urlopen(fullURL)
resultXML=f.read()
xmlsoup=BeautifulSoup(resultXML,'html.parser')
items=xmlsoup.find._all('item')
for item in items:
file.write('---------------------------------------\n')
file.write('title :'+item.tile.get_text(strip=True)+'\n')
file.write('contents : '+item.description.get_text(strip=True)+'\n')
file.write('\n')
file.close()
but python shell only show this
============= RESTART: C:\Users\kimty\Desktop\kpython\N\N.py =============
write:lee
http://openapi.naver.com/search?&key=keyvalue&target=news&sort=sim&start=1&display=100&query=lee
Traceback (most recent call last):
File "C:\Users\kimty\Desktop\k\python\N\N.py", line 19, in <module>
f=urllib.request.urlopen(fullURL)
File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 464, in open
response = self._open(req, data)
File "C:\Python34\lib\urllib\request.py", line 482, in _open
'_open', req)
File "C:\Python34\lib\urllib\request.py", line 442, in _call_chain
result = func(*args)
File "C:\Python34\lib\urllib\request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python34\lib\urllib\request.py", line 1186, in do_open
r = h.getresponse()
File "C:\Python34\lib\http\client.py", line 1227, in getresponse
response.begin()
File "C:\Python34\lib\http\client.py", line 386, in begin
version, status, reason = self._read_status()
File "C:\Python34\lib\http\client.py", line 356, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
why this happening?
what about that python shell talk to me?
i am using windows 8.1 64x, python 3.4.4
This http.client.BadStatusLine is a subclass of http.client.HTTPException. It gave you a http error back, maybe your API key is wrong! If I try to access the link with my browser it also gives me an error.
This is the exact address you tried to request.
Edit
Some people have fixed this error by importing the http lib.
I am trying to download a dataset from a web API for my work project which requires using python. I used python 3.4 and the library urllib to open the request. This does not work:
from urllib import request
r = request.urlopen(SOME_URL)
This gives error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "C:\Anaconda3\lib\urllib\request.py", line 463, in open
response = self._open(req, data)
File "C:\Anaconda3\lib\urllib\request.py", line 481, in _open
'_open', req)
File "C:\Anaconda3\lib\urllib\request.py", line 441, in _call_chain
result = func(*args)
File "C:\Anaconda3\lib\urllib\request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Anaconda3\lib\urllib\request.py", line 1184, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time,
or established connection failed because connected host has failed to respond>
But when I used RStudio with the same URL, it works:
dt = read.csv(SOME_URL)
This gives me the exact dataset I want.
For the project we want to keep a unified tech stack (only use python throughout the process), does anyone have idea why the URL can be open in R but not python? Is there any special set-up I need to configure for python?
Thanks
The following should do the job:
urllib2.urlopen(SOME_URL).read()
I am using common_crawl_index lib for python from here to get some data from S3.
When I run the command to get data cci_lookup com.abc it has error as below.
I still get the output that is the list of urls for the lookup domain but I do not know why the error happen.
MainThread:2014-12-19 01:48:16,150:ERROR:utils:224 Caught exception reading instance data
Traceback (most recent call last):
File "/home/deploy/anaconda/lib/python2.7/site-packages/boto/utils.py", line 211, in retry_url
r = opener.open(req)
File "/home/deploy/anaconda/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/home/deploy/anaconda/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/home/deploy/anaconda/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/home/deploy/anaconda/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/home/deploy/anaconda/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
com.abc.www/forum/index.php/groupcp.php:http
com.abc.www/forum/index.php/includes/MLOtvHD:http
com.abc.www/forum/index.php/includes/blog.php:http
com.abc.www/forum/index.php/includes/content.php:http
com.abc.www/forum/index.php/includes/index.php:http
Hope anyone can help!
I'm running many celery tasks (20,000) using gevent for the pool (also monkey patching all). Each of these tasks hit 3rd party services like adwords to pull data.
I keep having tasks fail because of underlying SSL errors. Below are the stack-traces from a few of the exceptions (in no particular order, these are failures from separate tasks). I also get WantWriteError and ZeroReturnError occasionally but the EOF error seems to come up the most.
These errors happen while using different client libraries like googleads (suds library for soap communication) as well as requests and elasticsearch. I'm guessing some of these libraries use urllib3 while others use urllib2 etc.
There has been a lot of info on the EOF issue and forcing TLSv1 but I can't seem to find a resolution that works.
I'm not sure if I'm running too many requests at once, if somethings blocking or what; any help would be greatly appreciated, I'm pulling my hair out over this one.
Traceback (most recent call last):
...
File "/srv/reporting/src/reporting/stats/adwords/client.py", line 58, in _awql_report
downloader = self._get_client(client_id).GetReportDownloader(version=self.REPORT_DOWNLOADER_VERSION)
File "/usr/local/lib/python2.7/dist-packages/googleads/adwords.py", line 283, in GetReportDownloader
return ReportDownloader(self, version, server)
File "/usr/local/lib/python2.7/dist-packages/googleads/adwords.py", line 400, in __init__
proxy=proxy_option, cache=self._adwords_client.cache).wsdl.schema
File "/usr/local/lib/python2.7/dist-packages/suds/client.py", line 115, in __init__
self.wsdl = reader.open(url)
File "/usr/local/lib/python2.7/dist-packages/suds/reader.py", line 150, in open
d = self.fn(url, self.options)
File "/usr/local/lib/python2.7/dist-packages/suds/wsdl.py", line 136, in __init__
d = reader.open(url)
File "/usr/local/lib/python2.7/dist-packages/suds/reader.py", line 74, in open
d = self.download(url)
File "/usr/local/lib/python2.7/dist-packages/suds/reader.py", line 92, in download
fp = self.options.transport.open(Request(url))
File "/usr/local/lib/python2.7/dist-packages/suds/transport/https.py", line 62, in open
return HttpTransport.open(self, request)
File "/usr/local/lib/python2.7/dist-packages/suds/transport/http.py", line 67, in open
return self.u2open(u2request)
File "/usr/local/lib/python2.7/dist-packages/suds/transport/http.py", line 132, in u2open
return url.open(u2request, timeout=tm)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1216, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1178, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
Traceback (most recent call last):
...
File "/srv/reporting/src/reporting/stats/analytics/client.py", line 57, in get_access_token
response = requests.post('https://accounts.google.com/o/oauth2/token', data)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 88, in post
return request('post', url, data=data, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 382, in send
raise SSLError(e, request=request)
SSLError: [Errno bad handshake] (-1, 'Unexpected EOF')
Traceback (most recent call last):
...
self.es.index(index=self.INDICE, doc_type=self.ROOT_CLASS.__name__, body=self.export(obj), id=obj.id)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 68, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 213, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 284, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 44, in perform_request
response = self.session.request(method, url, data=body, timeout=timeout or self.timeout)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 327, in send
timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline()
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 273, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 995, in recv
self._raise_ssl_error(self._ssl, result)
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 851, in _raise_ssl_error
raise ZeroReturnError()
ZeroReturnError
So let's break this down by each traceback block. The first ends with:
File "/usr/lib/python2.7/urllib2.py", line 1178, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 8] _ssl.c:504: EOF occurred in violation of protocol>
This is coming from urllib2. The fact that this receives an EOF makes me think that the server closed the connection while you were waiting for that "thread" to read from the socket again. You might want to use more time.sleep(0) to yield to gevent.
The second traceback comes from requests:
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 382, in send
raise SSLError(e, request=request)
SSLError: [Errno bad handshake] (-1, 'Unexpected EOF')
The [Errno bad handshake] would make me tend to think this is a problem establishing the connection which could be caused by an unexpected EOF. Is that caused by using gevent? I'm uncertain.
The final traceback is definitely from requests as well but it also is coming out of PyOpenSSL and isn't being caught by urllib3 or requests.
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 851, in _raise_ssl_error
raise ZeroReturnError()
ZeroReturnError
I did some searching and found that "According to the pyOpenSSL docs ZeroReturnError means that the SSL connection has been closed cleanly." This says to me that the server again closed the connection because you took to long to read anything from the socket.
In short, I think you need to explicitly yield more often just to ensure that these socket problems don't arise. That's just a guess though, so take it with a grain of salt.
I'm working my way through The Programming Historian 2, a self-tutorial in coding for historians focusing on HTML and Python. I am attempting to complete the lesson Working with Files and Web Pages but am stuck on the Opening URLs with Python unit. I am running the following program:
# open-webpage.py
import urllib2
url = 'http://www.oldbaileyonline.org/print.jsp?div=t17800628-33'
response = urllib2.urlopen(url)
webContent = response.read()
print webContent[0:300]
Every time I run the program Komodo Edit 7 returns the following error message:
Traceback (most recent call last):
File "open-webpage.py", line 7, in <module>
response = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "C:\Python27\lib\httplib.py", line 1030, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline()
File "C:\Python27\lib\socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host
I have attempted the program with a number of different urls, always with the same result. The guys at Komodo think the problem is to do with my university's firewall, because I access the internet through my university's proxy. The tech people here told me to change my default browser from RockMelt (chromium) to IE, because only IE is fully supported. I did so with no change and they have no other suggestions.
Can anyone suggest an alternate explanation for the error or a way to address the firewall problem? Thanks.