I have a MATLAB application that reads a .bin file and parses through the data. I am trying to convert this script from MATLAB to Python but am seeing discrepancies in the values being read.
The read function utilized in the MATLAB script is:
fname = 'file.bin';
f=fopen(fname);
data = fread(f, 100);
fclose(f);
The Python conversion I attempted is: (edited)
fname = 'file.bin'
with open(fname, mode='rb') as f:
data= list(f.read(100))
I would then print a side-by-side comparison of the read bytes with their index and found discrepancies between the two. I have confirmed that the values read in Python are correct by executing $ hexdump -n 100 -C file.bin and by viewing the file's contents on the application HexEdit.
I would appreciate any insight into the source of discrepancies between the two programs and how I may be able to resolve it.
Note: I am trying to only utilize built-in Python libraries to resolve this issue.
Solution: Utilizing incorrect file path/structure between programming languages. Implementing #juanpa.arrivillaga's suggestion cleanly reproduced the MATLAB results.
An exact translation of the MATLAB code, using NumPy, would be:
data = np.frombuffer(f.read(100), dtype=np.uint8).astype(np.float64)
python automatically transforms single bytes into unsigned integers, as done by matlab, so you just need to do the following.
fname = 'file.bin'
with open(fname, mode='rb') as f:
bytes_arr = f.read(100)
# Conversion for visual comparison purposes
data = [x for x in bytes_arr]
print(data)
also welcome to python, bytes is a built-in type, so please don't override the built-in bytes type ... or you'll run into unexpected problems.
Edit: as pointed by #juanpa.arrivillaga you could use the faster
fname = 'file.bin'
with open(fname, mode='rb') as f:
bytes_arr = f.read(100)
# Conversion for visual comparison purposes
data = list(bytes_arr)
I want to install some python packages using pip but cannot as every file downloaded produces the same hash, which then fails comparison in pips security check.
After playing around, I see that every file I download using curl from files.pythonhosted will hash to the same value. I've tested this with a python script like so:
curl http://files.pythonhosted.org/packages/1a/80/b06ce333aabba7ab1b6a41ea3c4e46970ceb396e705733480a2d47a7f74b/Django-4.0.3-py3-none-any.whl -o django.whl
import hashlib
hasher = hashlib.sha256()
BLOCKSIZE = 65536
def hash_stuff(file):
with open(file, 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
print(hasher.hexdigest())
hash_stuff("pynvim.tar.gz")
hash_stuff("opencv.tar.gz")
hash_stuff("django.whl")
which outputs:
➜ ~ python pythonhash.py
c77ab57a36e39ce205ca2327a3edd10399f4d78a3be91e80d845a1b97c29b7d6
ea75572349ed10da0f3224398737fd08352ae10e6f3c571345feb971e080a276
9e31adaf584633587df90d7be36e2fb287c7344eaa4bb23d619f4bdaa19a67d0
if I modify the order of the hash_stuff function like so (note the ordering is different):
hash_stuff("django.whl")
hash_stuff("opencv.tar.gz")
hash_stuff("pynvim.tar.gz")
the output does not change!
➜ ~ python pythonhash.py
c77ab57a36e39ce205ca2327a3edd10399f4d78a3be91e80d845a1b97c29b7d6
ea75572349ed10da0f3224398737fd08352ae10e6f3c571345feb971e080a276
9e31adaf584633587df90d7be36e2fb287c7344eaa4bb23d619f4bdaa19a67d0
If I reset the hasher object I get the first hash c77ab57 three times like so
def hash_stuff(file):
hasher = hashlib.sha256()
BLOCKSIZE = 65536
with open(file, 'rb') as afile:
-----
➜ ~ python pythonhash.py
c77ab57a36e39ce205ca2327a3edd10399f4d78a3be91e80d845a1b97c29b7d6
c77ab57a36e39ce205ca2327a3edd10399f4d78a3be91e80d845a1b97c29b7d6
c77ab57a36e39ce205ca2327a3edd10399f4d78a3be91e80d845a1b97c29b7d6
I've written the same test in ruby and getting the same results..
require 'digest'
puts Digest::SHA256.hexdigest File.read "django.whl"
puts Digest::SHA256.hexdigest File.read "opencv.tar.gz"
puts Digest::SHA256.hexdigest File.read "pynvim.tar.gz"
As a sanity check, I've tested hashing some local files and they produce the same hash consistently, regardless of ordering.
How can the ordering of execution effect the hash?
erm, files.pythonhosted doesn't even have a proper ssl certificate.. - can I even trust this host?
What could I possibly be doing wrong?
Turns out my internet provider was blocking the content due to files.pythonhosted having a self signed ssl certificate.
the reason the hash was the same for all files was because I was getting an error html page (doh..) thanks #jasonharper for the pointer!
I'm able to embed the original filename using python-gnupg when encrypting a file using below:
filename = 'test.tim'
with open(filename, 'rb) as fh:
status = gpg.encrypt_file(fh, recipients='somerecipient', output='test.tim.gpg',\
sign='somesignature', extra_args=['--set-filename', os.path.basename(filename)])
This can be verified by using gpg from the command line:
$ gpg2 --list-packets test.tim.gpg | grep name
I am however unable to preserve the original filename when decrypting the file:
with open(filename, 'rb') as fh:
status = gpg.decrypt_file(fh, extra_args['--use-embedded-filename'])
I am aware about the output parameter (which specifies the filename to save contents to) in the decrypt_file function, but i want to preserve the original filename (which i won't always know)
It seems the decrypt_file function always passes the --decrypt flag to gpg which always ouputs the contents to stdout (unless used in conjunction with output parameter) as in:
$ gpg --decrypt --use-embedded-filename test.tim.gpg
Below command will decrypt and save output to original filename:
$ gpg --use-embedded-filename test.tim.gpg
Any ideas?
Tim
The functionality to do what you want doesn't exist in the original python-gnupg.
There's a modified version here by isislovecruft (which is what you get if you pip install gnupg) that adds support for --list-packets with gpg.listpackets but still doesn't support --use-embeded-file-name
So my approach, if I were to insist on using python only, would probably be to start with isislovecruft's version and then subclass GPG like this:
import gnupg
import os
GPGBINARY = os.environ.get('GPGBINARY', 'gpg')
hd = os.path.join(os.getcwd(), 'keys')
class myGPG(gnupg.GPG):
def decrypt_file_original_name(self, file, always_trust=False, passphrase=None, extra_args=None):
args = ["--use-embedded-filename"]
output = calculate_the_file_name_using_list_packets()
self.set_output_without_confirmation(args, output)
if always_trust: # pragma: no cover
args.append("--always-trust")
if extra_args:
args.extend(extra_args)
result = self.result_map['crypt'](self)
self._handle_io(args, file, result, passphrase, binary=True)
# logger.debug('decrypt result: %r', result.data)
return result
gpg = myGPG(gnupghome=hd, gpgbinary=GPGBINARY)
Bear in mind, at this point it is almost certainly much easier to just use subprocess and run the gpg binary directly from the shell, especially as you don't care about the output.
Anyway, I got this far and have run out of time for now, so I leave implementing calculate_the_file_name_using_list_packets up to you, if you choose to go the 'pure python' route. Hopefully it is a bit easier now you have gpg.list-packets. Good luck!
I need to read the details for a file in Windows so that I can interrogate the file's 'File version' as displayed in the Details tab of the File Properties window.
I haven't found anything in the standard library that makes this very easy to accomplish but figured if I could find the right windows function, I could probably accomplish it using ctypes.
Does anyone have any exemplary code or can they point me to a Windows function that would let me read this info. I took a look a GetFileAttributes already, but that wasn't quite right as far as I could tell.
Use the win32 api Version Information functions from ctypes. The api is a little fiddly to use, but I also want this so have thrown together a quick script as an example.
usage: version_info.py [-h] [--lang LANG] [--codepage CODEPAGE] path
Can also use as a module, see the VersionInfo class. Checked with Python 2.7 and 3.6 against a few files.
import array
from ctypes import *
def get_file_info(filename, info):
"""
Extract information from a file.
"""
# Get size needed for buffer (0 if no info)
size = windll.version.GetFileVersionInfoSizeA(filename, None)
# If no info in file -> empty string
if not size:
return ''
# Create buffer
res = create_string_buffer(size)
# Load file informations into buffer res
windll.version.GetFileVersionInfoA(filename, None, size, res)
r = c_uint()
l = c_uint()
# Look for codepages
windll.version.VerQueryValueA(res, '\\VarFileInfo\\Translation',
byref(r), byref(l))
# If no codepage -> empty string
if not l.value:
return ''
# Take the first codepage (what else ?)
codepages = array.array('H', string_at(r.value, l.value))
codepage = tuple(codepages[:2].tolist())
# Extract information
windll.version.VerQueryValueA(res, ('\\StringFileInfo\\%04x%04x\\'
+ info) % codepage, byref(r), byref(l))
return string_at(r.value, l.value)
In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>
When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.
if check_extractable and not doc.is_extractable:
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..
Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!
================================================
Below you will find the code with which I currently extract the text from non-read protected.
def getTextFromPDF(rawFile):
resourceManager = PDFResourceManager(caching=True)
outfp = StringIO()
device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
interpreter = PDFPageInterpreter(resourceManager, device)
fileData = StringIO()
fileData.write(rawFile)
for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
interpreter.process_page(page)
fileData.close()
device.close()
result = outfp.getvalue()
outfp.close()
return result
I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.
The code to use this is pretty straightforward:
import pikepdf
pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')
As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable to True isn't going to help you.
Per this thread:
Does a library exist to remove passwords from PDFs programmatically?
I would recommend removing the read-protection with a command-line tool such as qpdf (easily installable, e.g. on Ubuntu use apt-get install qpdf if you don't have it already):
qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf
Then open the unlocked file with pdfminer and do your stuff.
For a pure-Python solution, you can try using PyPDF2 and its .decrypt() method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf - see:
https://github.com/mstamy2/PyPDF2/issues/53
I used below code using pikepdf and able to overwrite.
import pikepdf
pdf = pikepdf.open('filepath', allow_overwriting_input=True)
pdf.save('filepath')
In my case there was no password, but simply setting check_extractable=False circumvented the PDFTextExtractionNotAllowed exception for a problematic file (that opened fine in other viewers).
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
This issue was fixed in 2020 by disabling the check_extractable by default. It now shows a warning instead of raising an error.
Similar question and answer here.
The 'check_extractable=True' argument is by design.
Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.
If you want to unlock all pdf files in a folder without renaming them, you may use this code:
import glob, os, pikepdf
p = os.getcwd()
for file in glob.glob('*.pdf'):
file_path = os.path.join(p, file).replace('\\','/')
init_pdf = pikepdf.open(file_path)
new_pdf = pikepdf.new()
new_pdf.pages.extend(init_pdf.pages)
new_pdf.save(str(file))
In pikepdf library it is impossible to overwrite the existing file by saving it with the same name. In contrast, you would like to copy the pages to the newly created empty pdf file, and save it.
I too faced the same problem of parsing the secured pdf but it has got resolved using pikepdf library. I tried this library on my jupyter notebbok and on windows os but it gave errors but it worked smoothly on Ubuntu
If you've forgotten the password to your PDF, below is a generic script which tries a LOT of password combinations on the same PDF. It uses pikepdf, but you can update the function check_password to use something else.
Usage example:
I used this when I had forgotten a password on a bank PDF. I knew that my bank always encrypts these kind of PDFs with the same password-structure:
Total length = 8
First 4 characters = an uppercase letter.
Last 4 characters = a number.
I call script as follows:
check_passwords(
pdf_file_path='/Users/my_name/Downloads/XXXXXXXX.pdf',
combination=[
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
NUMBER,
NUMBER,
NUMBER,
NUMBER,
]
)
Password-checking script:
(Requires Python3.8, with libraries numpy and pikepdf)
from typing import *
from itertools import product
import time, pikepdf, math, numpy as np
from pikepdf import PasswordError
ALPHABET_UPPERCASE: Sequence[str] = tuple('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ALPHABET_LOWERCASE: Sequence[str] = tuple('abcdefghijklmnopqrstuvwxyz')
NUMBER: Sequence[str] = tuple('0123456789')
def as_list(l):
if isinstance(l, (list, tuple, set, np.ndarray)):
l = list(l)
else:
l = [l]
return l
def human_readable_numbers(n, decimals: int = 0):
n = round(n)
if n < 1000:
return str(n)
names = ['', 'thousand', 'million', 'billion', 'trillion', 'quadrillion']
n = float(n)
idx = max(0,min(len(names)-1,
int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))
return f'{n/10**(3*idx):.{decimals}f} {names[idx]}'
def check_password(pdf_file_path: str, password: str) -> bool:
## You can modify this function to use something other than pike pdf.
## This function should throw return True on success, and False on password-failure.
try:
pikepdf.open(pdf_file_path, password=password)
return True
except PasswordError:
return False
def check_passwords(pdf_file_path, combination, log_freq: int = int(1e4)):
combination = [tuple(as_list(c)) for c in combination]
print(f'Trying all combinations:')
for i, c in enumerate(combination):
print(f"{i}) {c}")
num_passwords: int = np.product([len(x) for x in combination])
passwords = product(*combination)
success: bool | str = False
count: int = 0
start: float = time.perf_counter()
for password in passwords:
password = ''.join(password)
if check_password(pdf_file_path, password=password):
success = password
print(f'SUCCESS with password "{password}"')
break
count += 1
if count % int(log_freq) == 0:
now = time.perf_counter()
print(f'Tried {human_readable_numbers(count)} ({100*count/num_passwords:.1f}%) of {human_readable_numbers(num_passwords)} passwords in {(now-start):.3f} seconds ({human_readable_numbers(count/(now-start))} passwords/sec). Latest password tried: "{password}"')
end: float = time.perf_counter()
msg: str = f'Tried {count} passwords in {1000*(end-start):.3f}ms ({count/(end-start):.3f} passwords/sec). '
msg += f"Correct password: {success}" if success is not False else f"All {num_passwords} passwords failed."
print(msg)
Comments
Obviously, don't use this to break into PDFs which are not your own. I hold no responsibility over how you use this script or any consequences of using it.
A lot of optimizations can be made.
Right now check_password uses pikepdf, which loads the file from disk for every "check". This is really slow, ideally it should run against an in-memory copy. I haven't figured out a way to do that, though.
You can probably speed this up a LOT by calling qpdf directly using C++, which is much better than Python for this kind of stuff.
I would avoid multi-processing here, since we're calling the same qpdf binary (which is normally a system-wide installation), which might become the bottleneck.