Python: saving large web page to file - python

Let me start off by saying, I'm not new to programming but am very new to python.
I've written a program using urllib2 that requests a web page that I would then like to save to a file. The web page is about 300KB, which doesn't strike me as particularly large but seems to be enough to give me trouble, so I'm calling it 'large'.
I'm using a simple call to copy directly from the object returned from urlopen into the file:
file.write(webpage.read())
but it will just sit for minutes, trying to write into the file and I eventually receive the following:
Traceback (most recent call last):
File "program.py", line 51, in <module>
main()
File "program.py", line 43, in main
f.write(webpage.read())
File "/usr/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)
I don't know why this should give the program so much grief?
EDIT |
here is how I'm retrieving the page
jar = cookielib.CookieJar()
cookie_processor = urllib2.HTTPCookieProcessor(jar);
opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)
requ_login = urllib2.Request(LOGIN_PAGE,
data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' : PASSWORD } ))
requ_page = urllib2.Request(WEBPAGE)
try:
#login
urllib2.urlopen(requ_login)
#get desired page
portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
print e.code, ": ", e.reason

I'd use a handy fileobject copier function provided by shutil module. It worked on my machine :)
>>> import urllib2
>>> import shutil
>>> remote_fo = urllib2.urlopen('http://docs.python.org/library/shutil.html')
>>> with open('bigfile', 'wb') as local_fo:
... shutil.copyfileobj(remote_fo, local_fo)
...
>>>
UPDATE: You may want to pass the 3rd argument to copyfileobj that controls the size of internal buffer used to transfer bytes.
UPDATE2: There's nothing fancy about shutil.copyfileobj. It simply reads a chunk of bytes from source file object and writes it the destination file object repeatedly until there's nothing more to read. Here's the actual source code of it that I grabbed from inside Python standard library:
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)

Related

Import JSON in Python error BadStatusLine

I'm trying to import the Json data generated by an Impinj R420 reader.
The code i use is:
# import urllib library
import urllib.request
from urllib.request import urlopen
# import json
import json
# store the URL in url as
# parameter for urlopen
url = "http://10.234.92.19:14150"
# store the response of URL
response = urllib.request.urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response())
# print the json response
print(data_json)
When i execute the programm it gives the following error:
Traceback (most recent call last):
File "C:\Users\V\PycharmProjects\Stapelcontrole\main.py", line 13, in <module>
response = urllib.request.urlopen(url)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 519, in open
response = self._open(req, data)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1377, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1352, in do_open
r = h.getresponse()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1374, in getresponse
response.begin()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "C:\Users\V\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 300, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: {"epc":"3035307B2831B383E019E8EA","firstSeenTimestamp":"2022-04-11T11:24:23.592434Z","isHeartBeat":false}
Process finished with exit code 1
I know this is an error in the response where it gets a faulty HTTP status code.
Yet i don't know how to fix the error.
Could you advice me how to fix this?
The {"epc":"3035307B2831B383E019E8EA","firstSeenTimestamp":"2022-04-11T11:24:23.592434Z","isHeartBeat":false} is an answer i expect.
Thanks in advance
Edit:
Even with
with urllib.request.urlopen(url) as f:
data_json = json.load(f)`
i get the same BadStatusLine error.
I can't setup the reader any different, it can only sent a JSON response trough the IP-adress of the device. Is there a way to import the data without the HTTP Protocol?
# store the response of URL
response = urllib.request.urlopen(url)
# storing the JSON response
# from url in data
data_json = json.loads(response())
Here you are actually calling response, I do not know what you want to achieve by that, but examples in urllib.request docs suggest that urllib.request.urlopen should be treated akin to local file handle, thus please replace above using
with urllib.request.urlopen(url) as f:
data_json = json.load(f)
Observe that I used json.load not json.loads
EDIT: After Is there a way to import the data without the HTTP Protocol? I conclude more low-level solution is needed, hopefully socket will you allow to get what you want, using echo client example as starting point I prepared following code
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.connect(("10.234.92.19",14150))
s.sendall(b'GET / HTTP/1.1\r\n')
data = s.recv(1024)
print(data)
If everything will work as intended you should get printed 1024 first bytes of answer. If it so change 1024 to value which will be always bigger than number of bytes of response and use json.dumps(data) to get parsed response

My Python cannot work with URL's, and nobody can figure out why?

All I want to do is scrape some data about earthquakes from a website. In fact, I just want Python to be able to extract data from URL's. For some reason, even the simplest code which only opens a url and uses '.readlines()' is met with a wall of errors. It doesn't seem to understand the 'openurl' command, nor most anything else.
I don't know what to even try, because I can't parse the errors that it's giving me. I was hoping, before I had to do something drastic like re-download python or something, that someone would have an answer for me.
import urllib.request
def urltest():
url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
f = urllib.request.urlopen(url)
allLines = f.readlines()
f.close()
line = allLines[0].decode()
print(line)
This is the code I've used to simply test it. The URL goes to a website which holds a .csv file, which python should easily acquire and read through.
If anyone wants, I can actually post the entire wall of errors that this code returns. There looks to be at least 6 different ones, but this is the final line that it spits back:
urllib.error.URLError: <urlopen error unknown url type: https>
Looking through the urllib.requests module it loads a collection of handlers. we can see this code snippet in urllib.request.py
if hasattr(http.client, "HTTPSConnection"):
default_classes.append(HTTPSHandler)
skip = set()
for klass in default_classes:
for check in handlers:
if isinstance(check, type):
if issubclass(check, klass):
skip.add(klass)
elif isinstance(check, klass):
skip.add(klass)
for klass in skip:
default_classes.remove(klass)
for klass in default_classes:
opener.add_handler(klass())
So the https handler class is only loaded if the http.client.py has the attribute HTTPSConnection. If we look in the http.client.py we can see the following code for setting this attribute.
try:
import ssl
except ImportError:
pass
else:
class HTTPSConnection(HTTPConnection):
"This class allows communication via SSL."
default_port = HTTPS_PORT
So the HTTPSConnection class is only created if the ssl module can successfully be imported. If you system doesnt have the ssl module then http.client wont load the HTTPSConnection class which in turn will not add the attribute and as such urllib wont load a handler for https.
While the code you provided worked on my system. I added the following code before it to cause my system to not be able to locate the ssl module.
#load then remove the ssl module from the system
import sys
import ssl
del ssl
sys.modules['ssl']=None
import urllib.request
def urltest():
url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
f = urllib.request.urlopen(url)
allLines = f.readlines()
f.close()
line = allLines[0].decode()
print(line)
urltest()
Doing this i get the same error you were getting
C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\python.exe C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py
Traceback (most recent call last):
File "C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py", line 19, in <module>
urltest()
File "C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py", line 13, in urltest
f = urllib.request.urlopen(url)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 548, in _open
'unknown_open', req)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1387, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: https>
So i suspect you have installed python without ssl configured. You should be able to verify this easly by just trying to import ssl from the python command line import ssl if you get an error like
>>> import ssl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'ssl'
Then that will be the cause of your issues. You would have to either reinstall python with ssl configured or somehow build the ssl module from source
It looks like the problem is a network(dns/proxy/firewall) issue.
https://github.com/pbugnion/gmaps/issues/245
You can use Pandas:
import pandas as pd
data = pd.read_csv('http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv')
print (data)

No connection adapters were found for '%s'" % url

Sorry to ask this ! I'm newbie so feel free to teach me anything you guys know.
I'm making a scraping tool for my marketing purpose to scrape contact information from website. I'm using Python 3
This is my code:
import requests, bs4, os, codecs, csv
import pandas as pd
import sys
os.path.join('usr', 'bin', 'spam')
openFile = open('C:\\Users\\hdtra\\Desktop\\Test_1.csv',encoding='utf-8-sig')
read_test = csv.reader(openFile)
for link in read_test :
res = requests.get(link)
res.raise_for_status
facebookSpider = bs4.BeautifulSoup(res.text)
email = facebookSpider.select("._4-u2._3xaf._3-95._4-u8")
helloFile = open('C:\\Users\\hdtra\\Desktop\\In processing\\information.txt','w')
helloFile.write(str(email[3].encode('utf-8')) + '\n')
helloFile.close()
Have no idea why it gets me st like this:
Traceback (most recent call last):
File "C:\Users\hdtra\Desktop\In processing\Facebook_spider.py", line 12, in <module>
res = requests.get(link)
File "C:\Program Files\Python36\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 612, in send
adapter = self.get_adapter(url=request.url)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 703, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['http://www.facebook.com/D2Streetwear/?ref=br_rs']'
I know that get() only gets string, but have no idea how to convert these links into strings. This is my cvs file:
only one column with 5 row:
http://www.facebook.com/D2Streetwear/?ref=br_rs
https://www.facebook.com/RealClothes/?ref=br_rs
https://www.facebook.com/Lecamelliaclothing/?ref=br_rs
https://www.facebook.com/TaTclothing-285844471884952/?ref=br_rs
https://www.facebook.com/Dai-Clothing-130675847640538/?ref=br_rs
I tried to put str(link()) but it does not work.
You should understand that csv.reader returns an iterator that iterates over each row to return a list of columns for each one.
csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given
csvfile.[...]
Each row read from the csv file is returned as a list of strings.
Bold emphasis mine. Your CSV appears to contain a single column, so you can access the first column using link[0].
with open('test.csv') as f:
r = csv.reader(f)
for row in r:
r = requests.get(row[0])
...
I consider it good practice to always use a with...as context manager when handling file I/O, as it automatically closes your file and results in cleaner code.

python script to scan a pdf file using online scanner

I used this code to scan multiple PDF files contained in a folder with the online scanner "https://wepawet.iseclab.org/" using this scrip.
import mechanize
import re
import os
def upload_file(uploaded_file):
url = "https://wepawet.iseclab.org/"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(nr=0)
f = os.path.join("200",uploaded_file)
br.form.add_file(open(f) ,'text/plain', f)
br.form.set_all_readonly(False)
res = br.submit()
content = res.read()
with open("200_clean.html", "a") as f:
f.write(content)
def main():
for file in os.listdir("200"):
upload_file(file)
if __name__ == '__main__':
main()
but after the execution of the code I got the following error:
Traceback (most recent call last):
File "test.py", line 56, in <module>
main()
File "test.py", line 50, in main
upload_file(file)
File "test.py", line 40, in upload_file
res = br.submit()
File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 541, in submit
return self.open(self.click(*args, **kwds))
File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error refresh: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
OK
could any one help me with this problem ?
I think the issue is the mime-type text/plain you set. For PDF, this should be application/pdf. Your code with this change worked for me when I uploaded a sample PDF.
Change the br.form.add_file call to look like this:
br.form.add_file(open(f), 'application/pdf', f)

Python Extracting binary from a POST request using web.py

I am developing an API that allows outside clients to send a binary file which will be processed. my web.data() is a string and the function I am calling requires a binary. How do I get it into the correct format? Maybe I have the incorrect Headers? how do I extract the binary data. I am using web.py.
-----------------POST request----------------------------------------------------
import json
import requests
files = {'file':('000038fe4b46c210c37bdde767835007', open('000038fe4b46c210c37bdde767835007', 'rb'))}
headers = {'content-type' : 'application/octet-stream', 'X-Auth-Token':'xxxf'}
r = requests.post('http://XXX:8080/v1/binaries', files = files, headers = header
-----------------------API function------------------------------
def POST(self):
a = web.ctx.env.get("HTTP_X_AUTH_TOKEN", None)
creds = authenticator(a)
postdata = web.data().read()
analysis = atklite.FileAnalysis(data=postdata)
metadata = analysis.return_analysis()
------------------------Traceback--------------------------------
File "/usr/lib/pymodules/python2.7/web/application.py", line 242, in process
return self.handle()
File "/usr/lib/pymodules/python2.7/web/application.py", line 233, in handle
return self._delegate(fn, self.fvars, args)
File "/usr/lib/pymodules/python2.7/web/application.py", line 415, in _delegate
return handle_class(cls)
File "/usr/lib/pymodules/python2.7/web/application.py", line 390, in handle_class
return tocall(*args)
File "/home/XXXXXX/ProcessingCode/bfsapi.py", line 75, in POST
postdata = web.data().read()
AttributeError: 'str' object has no attribute 'read'
Thanks
Sorry if the formatting got all messed up in the Post.
Even if it is a binary file, reading raw post data would get you a encoded string. You would need to decode to convert to binary data. You can write to a file as follows:
written = open('binary.file', 'wb')
for chunk in rawdata.chunks():
written.write(chunk)
written.close()

Categories