Scraping in python to using open api - python

i want to scraping "www.naver.com"
so i tried to scraping using open api
i wrote code following this:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
defaultURL = 'http://openapi.naver.com/search?&'
key = 'key=keyvalue'
target='&target=news'
sort='&sort=sim'
start='&start=1'
display='&display=100'
query='&query='+urllib.parse.quote_plus(str(input("write:")))
fullURL=defaultURL+key+target+sort+start+display+query
print(fullURL)
file=open("C:\\Users\\kimty\\Desktop\\k\\python\\N\\naver_news.txt","w",encoding='utf-8')
f=urllib.request.urlopen(fullURL)
resultXML=f.read()
xmlsoup=BeautifulSoup(resultXML,'html.parser')
items=xmlsoup.find._all('item')
for item in items:
file.write('---------------------------------------\n')
file.write('title :'+item.tile.get_text(strip=True)+'\n')
file.write('contents : '+item.description.get_text(strip=True)+'\n')
file.write('\n')
file.close()
but python shell only show this
============= RESTART: C:\Users\kimty\Desktop\kpython\N\N.py =============
write:lee
http://openapi.naver.com/search?&key=keyvalue&target=news&sort=sim&start=1&display=100&query=lee
Traceback (most recent call last):
File "C:\Users\kimty\Desktop\k\python\N\N.py", line 19, in <module>
f=urllib.request.urlopen(fullURL)
File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 464, in open
response = self._open(req, data)
File "C:\Python34\lib\urllib\request.py", line 482, in _open
'_open', req)
File "C:\Python34\lib\urllib\request.py", line 442, in _call_chain
result = func(*args)
File "C:\Python34\lib\urllib\request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python34\lib\urllib\request.py", line 1186, in do_open
r = h.getresponse()
File "C:\Python34\lib\http\client.py", line 1227, in getresponse
response.begin()
File "C:\Python34\lib\http\client.py", line 386, in begin
version, status, reason = self._read_status()
File "C:\Python34\lib\http\client.py", line 356, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
why this happening?
what about that python shell talk to me?
i am using windows 8.1 64x, python 3.4.4

This http.client.BadStatusLine is a subclass of http.client.HTTPException. It gave you a http error back, maybe your API key is wrong! If I try to access the link with my browser it also gives me an error.
This is the exact address you tried to request.
Edit
Some people have fixed this error by importing the http lib.

Related

SocksiPy fails when connecting throug Tor

Tried the 3 following methods to control Tor:
using TorCtl/urllib2: Python script Exception with Tor
using socks/httplib: http://www.youtube.com/watch?v=KDsmVH7eJCs
using socks/urllib2: Python urllib over TOR?
Each of them fails w/ same error (tried to make it as clear as possible):
Traceback (most recent call last):
File "tor.py", line 26, in <module>
print(urllib2.urlopen("http://www.ifconfig.me/ip").read())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py"
line 127, in urlopen
return _opener.open(url, data, timeout)
line 404, in open
response = self._open(req, data)
line 422, in _open
'_open', req)
line 382, in _call_chain
result = func(*args)
line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
line 1181, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py"
line 973, in request
self._send_request(method, url, body, headers)
line 1007, in _send_request
self.endheaders(body)
line 969, in endheaders
self._send_output(message_body)
line 829, in _send_output
self.send(msg)
line 791, in send
self.connect()
line 772, in connect
self.timeout, self.source_address)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py"
line 562, in create_connection
sock.connect(sa)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/SocksiPy_branch-1.01-py2.7.egg/socks.py"
line 392, in connect
self.__negotiatesocks5(destpair[0],destpair[1])
line 199, in __negotiatesocks5
self.sendall("\x05\x01\x00")
line 165, in sendall
socket.socket.sendall(self, bytes)
... last error repeating a lot of times and then ...
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/SocksiPy_branch-1.01-py2.7.egg/socks.py", line 163, in sendall
if 'encode' in dir(bytes):
RuntimeError: maximum recursion depth exceeded while calling a Python object
Does anyone understand where it comes from?
For me it actually failed with socksipy-branch installed with pip.
However, it worked ok after I downloaded socks.py directly from http://socksipy.sourceforge.net/ to my working directory.
It looks like you are using the SocksiPy library. I had the same problem and got it fixed by installing this library again directly from https://code.google.com/p/socksipy-branch/. The first time I installed it via pip.

urllib error: cannot fetch html data. Python, Beagle Bone Black

I was making my project on mac and I tried to do the same things by Beagle Bone Black(BBB).
However, I couldn't use urllib in BBB so I am stuck: I cannot go forward.(it is working well in my mac)
I tried this simple code as an example:
import urllib
conn = urllib.urlopen('http://stackoverflow.com/questions/8479736/using-python-urllib-how-to-avoid-non-html-content')
then this Error occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 351, in open_http
'got a bad status line', None)
IOError: ('http protocol error', 0, 'got a bad status line', None)
I need to fetch a html data for my project.
How can I solve this problem? Do you have any ideas ?
Thank you.
When I tried urllib2
I got this:
>>> import urllib2
>>> conn = urllib2.urlopen('http://stackoverflow.com/questions/8479736/using-python-urllib-how-to-avoid-non-html-content')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
Also I tried this:
curl http://stackoverflow.com/questions/8479736/using-python-urllib-how-to-avoid-non-html-content
curl: (52) Empty reply from server
and this:
wget http://stackoverflow.com/questions/8479736/using-python-urllib-how-to-avoid-non-html-content
Connecting to stackoverflow.com (198.252.206.16:80)
wget: error getting response
but they didn't work
at home, I also tried and failed but returns a different error:
conn = urllib2.urlopen('http://stackoverflow.com/questions/8479736/using-python-urllib-how-to-avoid-non-html-content')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
environment
BBB: Linux beaglebone 3.8.13 #1 SMP Tue Jun 18 02:11:09 EDT 2013 armv7l GNU/Linux
python version: 2.7.3
I'm really want to recommend you requests lib:
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
http://www.python-requests.org/en/latest/
How to install:
sudo pip install requests

getting http.client.BadStatusLine with urlopen(IP).read()

The data I am trying to read is in xml format. There is a single space before the xml declaration. I can not edit this part as it is hard coded into the data source. I can only read from it. When the url is entered in IE the data comes up. When entered in Chrome/Firefox, an error is shown but data can be viewed from view source.
Is there a way with python to either strip this space off or ignore it as IE seems to do?
(tried to add strip() in many places)
Or is there a way to default to the page source (I think urlopen does this already)?
Here is the line giving the error:
html = urlopen(address).read()
Here is the error:
Traceback (most recent call last):
File "C:\Users\212311674\Desktop\Python Work\M10url.py", line 27, in <module>
html = urlopen(address).read()
File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 473, in open
response = self._open(req, data)
File "C:\Python33\lib\urllib\request.py", line 491, in _open
'_open', req)
File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 1272, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python33\lib\urllib\request.py", line 1257, in do_open
r = h.getresponse()
File "C:\Python33\lib\http\client.py", line 1131, in getresponse
response.begin()
File "C:\Python33\lib\http\client.py", line 354, in begin
version, status, reason = self._read_status()
File "C:\Python33\lib\http\client.py", line 336, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <?xml version="1.0"?><controller_history_cnd>

Python urllib and firewall: "connection was forcibly closed"

I'm working my way through The Programming Historian 2, a self-tutorial in coding for historians focusing on HTML and Python. I am attempting to complete the lesson Working with Files and Web Pages but am stuck on the Opening URLs with Python unit. I am running the following program:
# open-webpage.py
import urllib2
url = 'http://www.oldbaileyonline.org/print.jsp?div=t17800628-33'
response = urllib2.urlopen(url)
webContent = response.read()
print webContent[0:300]
Every time I run the program Komodo Edit 7 returns the following error message:
Traceback (most recent call last):
File "open-webpage.py", line 7, in <module>
response = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "C:\Python27\lib\httplib.py", line 1030, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline()
File "C:\Python27\lib\socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host
I have attempted the program with a number of different urls, always with the same result. The guys at Komodo think the problem is to do with my university's firewall, because I access the internet through my university's proxy. The tech people here told me to change my default browser from RockMelt (chromium) to IE, because only IE is fully supported. I did so with no change and they have no other suggestions.
Can anyone suggest an alternate explanation for the error or a way to address the firewall problem? Thanks.

Unable to connect to secure website using mechanize in Python

I'm trying to open a secure (https) website using mechanize library in Python. When I try to access the website, the server closes the connection and exception BadStatusLine is raised.
I have tried to modify the headers using the addheaders property, but no response.
import mechanize
br = mechanize.Browser()
print 'opening page ...'
resp = br.open('https://onlineservices.tin.nsdl.com/etaxnew/tdsnontds.jsp') #this one works fine
print 'ok'
print 'opening page 2 ...'
resp = br.open('https://incometaxindiaefiling.gov.in/portal/index.do') #exception raised
print 'ok'
Exception:
Traceback (most recent call last): File
pydev_imports.execfile(file, globals, locals) #execute the script File "Z:\pyTax\app_test.py", line 22, in
resp=br.open('https://incometaxindiaefiling.gov.in/portal/index.do')
File "build\bdist.win32\egg\mechanize_mechanize.py", line 203, in
open File "build\bdist.win32\egg\mechanize_mechanize.py", line 230,
in _mech_open File "build\bdist.win32\egg\mechanize_opener.py",
line 188, in open File "build\bdist.win32\egg\mechanize_http.py",
line 316, in http_request File
"build\bdist.win32\egg\mechanize_http.py", line 242, in read File
"build\bdist.win32\egg\mechanize_mechanize.py", line 203, in open
File "build\bdist.win32\egg\mechanize_mechanize.py", line 230, in
_mech_open File "build\bdist.win32\egg\mechanize_opener.py", line 193, in open File
"build\bdist.win32\egg\mechanize_urllib2_fork.py", line 344, in _open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 332, in
_call_chain File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1170, in https_open File
"build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1116, in
do_open File "D:\Python27\lib\httplib.py", line 1031, in getresponse
response.begin() File "D:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status() File "D:\Python27\lib\httplib.py", line 371, in _read_status
raise BadStatusLine(line) httplib.BadStatusLine: ''
httplib.BadStatusLineis s a subclass of HTTPException. Raised if a server responds with a HTTP status code that we don’t understand. That's whats causing your problem. I am not entirely sure about the fixup though, as your code works fine on my computer.

Categories