Unable to connect to secure website using mechanize in Python - python

I'm trying to open a secure (https) website using mechanize library in Python. When I try to access the website, the server closes the connection and exception BadStatusLine is raised.
I have tried to modify the headers using the addheaders property, but no response.
import mechanize
br = mechanize.Browser()
print 'opening page ...'
resp = br.open('https://onlineservices.tin.nsdl.com/etaxnew/tdsnontds.jsp') #this one works fine
print 'ok'
print 'opening page 2 ...'
resp = br.open('https://incometaxindiaefiling.gov.in/portal/index.do') #exception raised
print 'ok'
Exception:
Traceback (most recent call last): File
pydev_imports.execfile(file, globals, locals) #execute the script File "Z:\pyTax\app_test.py", line 22, in
resp=br.open('https://incometaxindiaefiling.gov.in/portal/index.do')
File "build\bdist.win32\egg\mechanize_mechanize.py", line 203, in
open File "build\bdist.win32\egg\mechanize_mechanize.py", line 230,
in _mech_open File "build\bdist.win32\egg\mechanize_opener.py",
line 188, in open File "build\bdist.win32\egg\mechanize_http.py",
line 316, in http_request File
"build\bdist.win32\egg\mechanize_http.py", line 242, in read File
"build\bdist.win32\egg\mechanize_mechanize.py", line 203, in open
File "build\bdist.win32\egg\mechanize_mechanize.py", line 230, in
_mech_open File "build\bdist.win32\egg\mechanize_opener.py", line 193, in open File
"build\bdist.win32\egg\mechanize_urllib2_fork.py", line 344, in _open
File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 332, in
_call_chain File "build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1170, in https_open File
"build\bdist.win32\egg\mechanize_urllib2_fork.py", line 1116, in
do_open File "D:\Python27\lib\httplib.py", line 1031, in getresponse
response.begin() File "D:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status() File "D:\Python27\lib\httplib.py", line 371, in _read_status
raise BadStatusLine(line) httplib.BadStatusLine: ''

httplib.BadStatusLineis s a subclass of HTTPException. Raised if a server responds with a HTTP status code that we don’t understand. That's whats causing your problem. I am not entirely sure about the fixup though, as your code works fine on my computer.

Related

can a keyboard interrupt delete data in pandas?

I am running a google.py scraping script to get data. The script is reading a csv file and for each line of the csv file it is sacraping a page. After the scraping is done, the script saves the result on the same csv file.
The dataframe is several thousand lines long.
After it was getting captcha results on several lines i interrupted the scraping with Ctrl+C.
I re-ran the script just after, and the length of the dataframe read from the csv file was 3929 lines less long.
this is the output of the the ctrl+C :
^CTraceback (most recent call last):
File "google.py", line 255, in <module>
Scraping().scrape()
File "google.py", line 239, in scrape
self.write_dataframe(df_psys,psy,tel_list, mail_list)
File "google.py", line 143, in write_dataframe
df_psys.to_csv('psychologues.csv',sep=';',index=False)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/generic.py", line 3563, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1180, in to_csv
csv_formatter.save()
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 261, in save
self._save()
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 266, in _save
self._save_body()
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 304, in _save_body
self._save_chunk(start_i, end_i)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 311, in _save_chunk
res = df._mgr.to_native_types(**self._number_format)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 473, in to_native_types
return self.apply("to_native_types", **kwargs)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 304, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 634, in to_native_types
result = to_native_types(self.values, na_rep=na_rep, quoting=quoting, **kwargs)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2207, in to_native_types
mask = isna(values)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 143, in isna
return _isna(obj)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 172, in _isna
return _isna_array(obj, inf_as_na=inf_as_na)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 254, in _isna_array
result = _isna_string_dtype(values, inf_as_na=inf_as_na)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 278, in _isna_string_dtype
result = libmissing.isnaobj2d(values, inf_as_na=inf_as_na)
KeyboardInterrupt
It seems there is an interrupt with the command to_csv, so i am wondering if the data missing comes from that or from a hack/a physical intervention on my computer. I have another keyboard interrupt on a previous run of the script, and there is no to_csv in it:
^CTraceback (most recent call last):
File "google.py", line 255, in <module>
Scraping().scrape()
File "google.py", line 238, in scrape
(tel_list, mail_list) = self.google_scraping(psy, counter)
File "google.py", line 166, in google_scraping
if a.text == "Que s'est-il passé ?":
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 77, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 740, in _execute
return self._parent.execute(command, params)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute
response = self.command_executor.execute(driver_command, params)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 347, in execute
return self._request(command_info[0], url, body=data)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 369, in _request
response = self._conn.request(method, url, body=body, headers=headers)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/request.py", line 74, in request
return self.request_encode_url(
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/macbook/.test_requests/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/http/client.py", line 1337, in getresponse
response.begin()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt
I didnt notice the missing lines before recently and I ran the script
3 times after the missing lines appeared. I have tried to look into the history commands of my terminal with history -E 1 | grep google.py but I dont have the times I ran the command, only 1 command is showing up which is the last one i ran.
So I dont really know exactly when the deletion of data happened (in the last 24 hours for sure). I would like to check other system log files but if the hypothesis of deletion comes from a bug of pandas i wont look further in my logs...
What do you think?
Is there a way I can prevent the ctrl+C interrupt to provoke this error?
This is write_dataframe:
def write_dataframe(self,df,psy,tel_list, mail_list):
index=df[df['psy'] == psy].index.values[0]
print('writing dataframes')
df_psys.loc[index,'tel_google']=tel_list
df_psys.loc[index, 'mail_google'] = mail_list
df_psys.to_csv('file.csv',sep=';',index=False)
If I do
try:
write_dataframes(args)
except KeyboardInterrupt:
sys.exit()
Will it be enough to prevent loss of data for a keyboard interupt?
Thank you
Reading the comments and the downvotes, the answer is yes.
Reading this post How to prevent a block of code from being interrupted by KeyboardInterrupt in Python?,
I have implemented the method that seemed more reliable to prevent keyboard interrupt:
import signal
s = signal.signal(signal.SIGINT, signal.SIG_IGN)
My code not to be interrupted
signal.signal(signal.SIGINT, s)

Scraping in python to using open api

i want to scraping "www.naver.com"
so i tried to scraping using open api
i wrote code following this:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
defaultURL = 'http://openapi.naver.com/search?&'
key = 'key=keyvalue'
target='&target=news'
sort='&sort=sim'
start='&start=1'
display='&display=100'
query='&query='+urllib.parse.quote_plus(str(input("write:")))
fullURL=defaultURL+key+target+sort+start+display+query
print(fullURL)
file=open("C:\\Users\\kimty\\Desktop\\k\\python\\N\\naver_news.txt","w",encoding='utf-8')
f=urllib.request.urlopen(fullURL)
resultXML=f.read()
xmlsoup=BeautifulSoup(resultXML,'html.parser')
items=xmlsoup.find._all('item')
for item in items:
file.write('---------------------------------------\n')
file.write('title :'+item.tile.get_text(strip=True)+'\n')
file.write('contents : '+item.description.get_text(strip=True)+'\n')
file.write('\n')
file.close()
but python shell only show this
============= RESTART: C:\Users\kimty\Desktop\kpython\N\N.py =============
write:lee
http://openapi.naver.com/search?&key=keyvalue&target=news&sort=sim&start=1&display=100&query=lee
Traceback (most recent call last):
File "C:\Users\kimty\Desktop\k\python\N\N.py", line 19, in <module>
f=urllib.request.urlopen(fullURL)
File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 464, in open
response = self._open(req, data)
File "C:\Python34\lib\urllib\request.py", line 482, in _open
'_open', req)
File "C:\Python34\lib\urllib\request.py", line 442, in _call_chain
result = func(*args)
File "C:\Python34\lib\urllib\request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python34\lib\urllib\request.py", line 1186, in do_open
r = h.getresponse()
File "C:\Python34\lib\http\client.py", line 1227, in getresponse
response.begin()
File "C:\Python34\lib\http\client.py", line 386, in begin
version, status, reason = self._read_status()
File "C:\Python34\lib\http\client.py", line 356, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''
why this happening?
what about that python shell talk to me?
i am using windows 8.1 64x, python 3.4.4
This http.client.BadStatusLine is a subclass of http.client.HTTPException. It gave you a http error back, maybe your API key is wrong! If I try to access the link with my browser it also gives me an error.
This is the exact address you tried to request.
Edit
Some people have fixed this error by importing the http lib.

Google App Engine (python) - Connecting to local datastore via python script

I've been able to connect from a local script directly to the Google App Engine (GAE) ndb store on the server as described in this article.
But I'm trying to do the same thing to access my local dev server but am not able to. My dev API server runs at http://localhost:42020 but when I try to connect using the following command:
$ remote_api_shell.py -s http://localhost:42020
I get the following error:
Traceback (most recent call last):
File "/home/myuser/google_appengine/remote_api_shell.py", line 127, in <module>
run_file(__file__, globals())
File "/home/myuser/google_appengine/remote_api_shell.py", line 123, in run_file
execfile(_PATHS.script_file(script_name), globals_)
File "/home/myuser/google_appengine/google/appengine/tools/remote_api_shell.py", line 150, in <module>
main(sys.argv)
File "/home/myuser/google_appengine/google/appengine/tools/remote_api_shell.py", line 146, in main
appengine_rpc.HttpRpcServer)
File "/home/myuser/google_appengine/google/appengine/tools/remote_api_shell.py", line 74, in remote_api_shell
rpc_server_factory=rpc_server_factory)
File "/home/myuser/google_appengine/google/appengine/ext/remote_api/remote_api_stub.py", line 874, in ConfigureRemoteApi
app_id = GetRemoteAppIdFromServer(server, path, rtok)
File "/home/myuser/google_appengine/google/appengine/ext/remote_api/remote_api_stub.py", line 569, in GetRemoteAppIdFromServer
response = server.Send(path, payload=None, **urlargs)
File "/home/myuser/google_appengine/google/appengine/tools/appengine_rpc.py", line 424, in Send
f = self.opener.open(req)
File "/usr/lib/python2.7/urllib2.py", line 401, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 419, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1181, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
Is it not possible to connect locally?
Looks like I was doing it wrong. This answer solved my problem.
In short, I had to connect to the regular web server for it to work without the leading http. This asks for my email and password and once I enter that, it works!
$ remote_api_shell.py -s localhost:8080

Python urllib and firewall: "connection was forcibly closed"

I'm working my way through The Programming Historian 2, a self-tutorial in coding for historians focusing on HTML and Python. I am attempting to complete the lesson Working with Files and Web Pages but am stuck on the Opening URLs with Python unit. I am running the following program:
# open-webpage.py
import urllib2
url = 'http://www.oldbaileyonline.org/print.jsp?div=t17800628-33'
response = urllib2.urlopen(url)
webContent = response.read()
print webContent[0:300]
Every time I run the program Komodo Edit 7 returns the following error message:
Traceback (most recent call last):
File "open-webpage.py", line 7, in <module>
response = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "C:\Python27\lib\httplib.py", line 1030, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline()
File "C:\Python27\lib\socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 10054] An existing connection was forcibly closed by the remote host
I have attempted the program with a number of different urls, always with the same result. The guys at Komodo think the problem is to do with my university's firewall, because I access the internet through my university's proxy. The tech people here told me to change my default browser from RockMelt (chromium) to IE, because only IE is fully supported. I did so with no change and they have no other suggestions.
Can anyone suggest an alternate explanation for the error or a way to address the firewall problem? Thanks.

urllib error of Google App Engine & python.[Errno 11003] getaddrinfo failed

Thanks for your help in advance!
I want to get contents of a website, so I use urllib.urlopen(url).
set url='http://localhost:8080'(tomcat page)
If I use Google App Engine Launcher, run the application, browse http://localhost:8082 , it works well.
But if I specify the address and port for the application:
python `"D:\Program Files\Google\google_appengine\dev_appserver.py" -p 8082 -a 10.96.72.213 D:\pagedemon\videoareademo`
there's something wrong:
Traceback (most recent call last):
File "D:\Program Files\Google\google_appengine\google\appengine\ext\webapp\_webapp25.py", line 701, in __call__
handler.get(*groups)
File "D:\pagedemon\videoareademo\home.py", line 76, in get
wp = urllib.urlopen(url)
File "C:\Python27\lib\urllib.py", line 84, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 205, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 343, in open_http
errcode, errmsg, headers = h.getreply()
File "D:\Program Files\Google\google_appengine\google\appengine\dist\httplib.py", line 334, in getreply
response = self._conn.getresponse()
File "D:\Program Files\Google\google_appengine\google\appengine\dist\httplib.py", line 222, in getresponse
deadline=self.timeout)
File "D:\Program Files\Google\google_appengine\google\appengine\api\urlfetch.py", line 263, in fetch
return rpc.get_result()
File "D:\Program Files\Google\google_appengine\google\appengine\api\apiproxy_stub_map.py", line 592, in get_result
return self.__get_result_hook(self)
File "D:\Program Files\Google\google_appengine\google\appengine\api\urlfetch.py", line 365, in _get_fetch_result
raise DownloadError(str(err))
DownloadError: ApplicationError: 2 [Errno 11003] getaddrinfo failed
The strangest thing is when I change the url form "http://localhost:8080" to "http://127.0.0.1:8080", it works well!
I googled a lot, but I didn't find any good solutions.Hoping for some help!
Also, I didn't configure any proxy.IE works well.
Your system doesn't necessarily know that localhost should resolve to 127.0.0.1. You might need to put an entry into your hosts file. On Windows, it's located at C:\Windows\System32\drivers\etc\hosts

Categories