urllib alongwith json to save to a variable - python

Please correct my code. I am trying to save the result of this web page in json format to a variable in python.
Error:
Traceback (most recent call last):
File "C:/Users/Varen/Desktop/json_v1.py", line 5, in <module>
json.dump(link, f)
File "C:\Python27\lib\json\__init__.py", line 189, in dump
for chunk in iterable:
File "C:\Python27\lib\json\encoder.py", line 442, in _iterencode
o = _default(o)
File "C:\Python27\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <addinfourl at 53244992 whose fp = <socket._fileobject object at 0x032B4AF0>> is not JSON serializable
Code:
import urllib
import json
link = urllib.urlopen("http://www.saferproducts.gov/RestWebServices/Recall?RecallDateStart=2015-01-01&RecallDateEnd=2015-12-31&format=json")
with open('link.json', 'w') as f:
json.dump(link, f)

You need to read the data from the file like object returned by urlopen():
import urllib
import json
link = urllib.urlopen("http://www.saferproducts.gov/RestWebServices/Recall?RecallDateStart=2015-01-01&RecallDateEnd=2015-12-31&format=json")
with open('link.json', 'w') as f:
json.dump(link.read(), f)
will do the trick.

Related

How to save SHA256 object to a file?

(I'm doing all this in python 3.10.4 using pycryptodome)
I'm trying to do this process:
Get a hash of a file
Save that hash somewhere
Load that hash and perform RSA signing using a private key
I'm having a problem in step 3 where to save the hash, I have to save it as a string which doesn't work in Step 3.
I've tried using pickle but I'm getting
"ctypes objects containing pointers cannot be pickled"
Code generating the hash:
sha256 = SHA256.new()
with open(fileDir, 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
sha256.update(data)
Code to perform the signing:
get_file(fileName + '.hash', directory)
with open(currentDir + '/client_files/downloaded/' + fileName + '.hash', 'r') as f:
hash_data = f.read()
with open(currentDir + '/client_files/private_key.pem', 'rb') as f:
private_key = RSA.importKey(f.read())
print(private_key)
signer = PKCS1_v1_5.new(private_key)
signature = signer.sign(hash_data)
The error I'm getting:
Traceback (most recent call last):
File "c:\Users\User\Documents\Coding\VSCode Projects\practiceGround\sec_cloud_project\client\client.py", line 168, in <module>
main()
File "c:\Users\User\Documents\Coding\VSCode Projects\practiceGround\sec_cloud_project\client\client.py", line 163, in main
sign(fileName, 'worker_test_files')
File "c:\Users\User\Documents\Coding\VSCode Projects\practiceGround\sec_cloud_project\client\client.py", line 120, in sign
signature = signer.sign(hash_data)
File "C:\Users\User\anaconda3\envs\nscc_project\lib\site-packages\Crypto\Signature\pkcs1_15.py", line 77, in sign
em = _EMSA_PKCS1_V1_5_ENCODE(msg_hash, k)
File "C:\Users\User\anaconda3\envs\nscc_project\lib\site-packages\Crypto\Signature\pkcs1_15.py", line 191, in _EMSA_PKCS1_V1_5_ENCODE
digestAlgo = DerSequence([ DerObjectId(msg_hash.oid).encode() ])
AttributeError: 'str' object has no attribute 'oid'
Note that I'm currently saving the original hash as a string to a text file. If I try to use pickle to save the object as a whole I get this error
with open(currentDir + '/worker_files/sha256.pickle', 'wb') as f:
pickle.dump(sha256, f)
Traceback (most recent call last):
File "c:\Users\User\Documents\Coding\VSCode Projects\practiceGround\sec_cloud_project\worker\worker.py", line 188, in <module>
main()
File "c:\Users\User\Documents\Coding\VSCode Projects\practiceGround\sec_cloud_project\worker\worker.py", line 179, in main
hash_file(fileName, 'worker_test_files')
File "c:\Users\User\Documents\Coding\VSCode Projects\practiceGround\sec_cloud_project\worker\worker.py", line 55, in hash_file
pickle.dump(sha256, f)
ValueError: ctypes objects containing pointers cannot be pickled
Thanks to #Topaco. Changing to using Cyptography for both hashing and signing seemed to work.
Hashing with Cryptography, dumping to a file with pickle, then load and sign with Cryptography again.

Decompress a gzip compressed dictionary object using python

I need to decompress this "H4sIAAAAAAAA/6tWKkktLjFUsjI00lEAs42UrCAMpVoAbyLr+R0AAAA=" which actually is compressed form of {"test1":12, "test2": "test"}. Now in python I'm using gzip library and getting below mentioned response:
>>> import gzip
>>> gzip.decompress("H4sIAAAAAAAA/6tWKkktLjFUsjI00lEAs42UrCAMpVoAbyLr+R0AAAA=".encode("UTF-8"))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 548, in decompress
return f.read()
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 479, in read
if not self._read_gzip_header():
File "/data/python-3.8.10/lib/python3.8/gzip.py", line 427, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'H4')
Is there any way to decompress the string in python ?
The string is Base64 encoded. Therefore:-
import gzip
import base64
b = base64.b64decode('H4sIAAAAAAAA/6tWKkktLjFUsjI00lEAs42UrCAMpVoAbyLr+R0AAAA=')
r = gzip.decompress(b)
print(r.decode())

How Do We Convert HTML to PDF using Python, Is there Any code please share it to me?

I have tried the Library called pytotree, But i didnt get any Answer
This is the code:
import pdftotree
file= open('C:/Users/chaitanya.naidu/Downloads/test.pdf', 'rb')
f = pdftotree.parse(file)
I am getting this error
Traceback (most recent call last):
File "<ipython-input-4-4a9a6b72801d>", line 1, in <module>
f = pdftotree.parse(file)
File "C:\Users\chaitanya.naidu\AppData\Local\Continuum\Anaconda3\lib\site-packages\pdftotree\core.py", line 63, in parse
if not extractor.is_scanned():
File "C:\Users\chaitanya.naidu\AppData\Local\Continuum\Anaconda3\lib\site-packages\pdftotree\TreeExtract.py", line 121, in is_scanned
self.parse()
File "C:\Users\chaitanya.naidu\AppData\Local\Continuum\Anaconda3\lib\site-packages\pdftotree\TreeExtract.py", line 91, in parse
for page_num, layout in enumerate(analyze_pages(self.pdf_file)):
File "C:\Users\chaitanya.naidu\AppData\Local\Continuum\Anaconda3\lib\site-packages\pdftotree\utils\pdf\pdf_utils.py", line 117, in analyze_pages
with open(os.path.realpath(file_name), "rb") as fp:
File "C:\Users\chaitanya.naidu\AppData\Local\Continuum\Anaconda3\lib\ntpath.py", line 542, in abspath
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not _io.BufferedReader
You can use pdfkit, example:
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('test.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf')

How to download ms word docx file in python with raw data from http url

if the following url is hit in browser the docx file will be downloaded i want to automate the download with python.
https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=CASE OF NDIDI v. THE UNITED KINGDOM.docx&logEvent=False
i have tried this following
from docx import Document
import requests
import json
from bs4 import BeautifulSoup
dwnurl = 'https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=CASE%20OF%20NDIDI%20v.%20THE%20UNITED%20KINGDOM.docx&logEvent=False'
doc = requests.get(dwnurl)
print(doc.content) #printing the document like b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00!\xfb\x16\x01\x16\x02\x00\x00\xec\x0c\x00\x00\x13\x00\xc4\x01[Content_Types].xml \xa2\xc0\
print(doc.raw) #printing the document like <urllib3.response.HTTPResponse object at 0x063D8BD0>
document = Document(doc.content)
document.save('test.docx')
#on document.save i have facing these issues
Traceback (most recent call last):
File "scraping_hudoc.py", line 40, in <module>
document = Document(doc.content)
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1108, in __init__
self._RealGetContents()
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1171, in _RealGetContents
endrec = _EndRecData(fp)
File "C:\Users\204387\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 241, in _EndRecData
fpin.seek(0, 2)
AttributeError: 'bytes' object has no attribute 'seek'
i have saved the ms word docx file through this
import requests
def save_link(book_link, book_name):
the_book = requests.get(book_link, stream=True)
with open(book_name, 'wb') as f:
for chunk in the_book.iter_content(1024 * 1024 * 2): # 2 MB chunks
f.write(chunk)
save_link("https://hudoc.echr.coe.int/app/conversion/docx/?library=ECHR&id=001-176931&filename=CASE%20OF%20NDIDI%20v.%20THE%20UNITED%20KINGDOM.docx&logEvent=False","CASE OF NDIDI v. THE UNITED KINGDOM.docx")

Python lib execute error

I made this python lib and it had this function with uses urllib and urllib2 but when i execute the lib's functions from python shell i get this error
>>> from sabermanlib import geturl
>>> geturl("roblox.com","ggg.html")
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
geturl("roblox.com","ggg.html")
File "sabermanlib.py", line 21, in geturl
urllib.urlretrieve(Address,File)
File "C:\Users\Andres\Desktop\ddd\Portable Python 2.7.5.1\App\lib\urllib.py", line 94, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "C:\Users\Andres\Desktop\ddd\Portable Python 2.7.5.1\App\lib\urllib.py", line 240, in retrieve
fp = self.open(url, data)
File "C:\Users\Andres\Desktop\ddd\Portable Python 2.7.5.1\App\lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "C:\Users\Andres\Desktop\ddd\Portable Python 2.7.5.1\App\lib\urllib.py", line 463, in open_file
return self.open_local_file(url)
File "C:\Users\Andres\Desktop\ddd\Portable Python 2.7.5.1\App\lib\urllib.py", line 477, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the file specified: 'roblox.com'
>>>
and here's the code for the lib i made:
import urllib
import urllib2
def geturl(Address,File):
urllib.urlretrieve(Address,File)
EDIT 2
I cant understand why i get this error in the python shell executing:
geturl(Address,File)
You don't want urllib.urlretrieve. This takes a file-like object. Instead, you want urllib.urlopen:
>>> help(urllib.urlopen)
urlopen(url, data=None, proxies=None)
Create a file-like object for the specified URL to read from.
Additionally, if you want to download and save a document, you'll need a more robust geturl function:
def geturl(Address, FileName):
html_data = urllib.urlopen(Address).read() # Open the URL
with open(FileName, 'wb') as f: # Open the file
f.write(html_data) # Write data from URL to file
geturl(u'http://roblox.com') # URL's must contain the full URI, including http://

Categories