Merging PDF files with Python3 - python

I am writing a small script that needs to merge many one-page pdf files. I want the script to run with Python3 and to have as few dependencies as possible.
For the PDF merging part, I tried using PyPdf. However, the Python 3 support seems to be buggy; It can't handle inkscape generated PDF files (which I need). I have the current git version of PyPdf installed, and the following test script doesn't work:
import PyPDF2
output_pdf = PyPDF2.PdfFileWriter()
with open("testI.pdf", "rb") as input:
input_pdf = PyPDF2.PdfFileReader(input)
output_pdf.addPage(input_pdf.getPage(0))
with open("test.pdf", "wb") as output:
output_pdf.write(output)
It throws the following stack trace:
Traceback (most recent call last):
File "test.py", line 7, in <module>
output.addPage(input.getPage(0))
File "/usr/lib/python3.3/site-packages/pyPdf/pdf.py", line 420, in getPage
self._flatten()
File "/usr/lib/python3.3/site-packages/pyPdf/pdf.py", line 574, in _flatten
self._flatten(page.getObject(), inherit)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/lib/python3.3/site-packages/pyPdf/pdf.py", line 616, in getObject
retval = readObject(self.stream, self)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 66, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 526, in readFromStream
value = readObject(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 57, in readObject
return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 152, in readFromStream
obj = readObject(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 86, in readObject
return NumberObject.readFromStream(stream)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 231, in readFromStream
return FloatObject(name.decode("ascii"))
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 207, in __new__
return decimal.Decimal.__new__(cls, str(value), context)
TypeError: optional argument must be a context
The same script, however, works flawlessly with Python 2.7.
What am I doing wrong here? Is it a bug in the library? Can I work around it without touching the PyPDF library?

So I found the answer. The decimal.Decimal module in Python3.3 shows some weird behaviour. This is the corresponding StackOverflow question: Instantiate Decimal class I added some workaround to the PyPDF2 library and submitted a pull request.

Just to make sure you are aware of already existing tools that do exactly this:
PDFtk
PDFjam (my favourite, requires LaTeX though)
Directly with GhostScript:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf file1.pdf file2.pdf

Related

Issues Opening Spyder via Anaconda

I was forced into some software updates on my laptop (ThinkPad) and after the update when I tried to open Spyder (via Anaconda) and it won't open. I don't have any experience in errors like this or fixing this stuff (and I am aware this may be something that is super simple). Please help. This is the application launch error I am getting:
Traceback (most recent call last):
File "C:\Users\cyrra\anaconda3\Scripts\spyder-script.py", line 10, in
sys.exit(main())
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\app\start.py", line 205, in main
mainwindow.main()
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\app\mainwindow.py", line 3651, in main
mainwindow = run_spyder(app, options, args)
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\app\mainwindow.py", line 3526, in run_spyder
main.setup()
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\app\mainwindow.py", line 871, in setup
self.help = Help(self, css_path=css_path)
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\plugins\help\plugin.py", line 68, in __init__
color_scheme = self.get_color_scheme()
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\api\plugins.py", line 347, in get_color_scheme
return super(BasePluginWidget, self)._get_color_scheme()
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\plugins\base.py", line 446, in _get_color_scheme
return get_color_scheme(CONF.get('appearance', 'selected'))
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\config\gui.py", line 114, in get_color_scheme
color_scheme[key] = CONF.get("appearance", "%s/%s" % (name, key))
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\config\manager.py", line 228, in get
return config.get(section=section, option=option, default=default)
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\config\user.py", line 976, in get
return config.get(section=section, option=option, default=default)
File "C:\Users\cyrra\anaconda3\lib\site-packages\spyder\config\user.py", line 513, in get
raise cp.NoOptionError(option, section)
configparser.NoOptionError: No option 'custom-1/background' in section: 'appearance'
Thanks,
Rachel
Did you try to run spyder on the terminal?
It should be prebuild in Anaconda.
Your Spyder config file should contain but is missing 'custom-1/background' in section: 'appearance'. If you have no experience, the easiest thing for you to do is uninstall and install Spyder (via Anaconda). Your other option is to find the config file called in line 513 of C:\Users\cyrra\anaconda3\lib\site-packages\spyder\config\user.py". Then open the config file (if it exists) and correct it. If the file is not found, search for that config file and place it in the expected location.

Parse postgresql -pycparser.plyparser.ParseError before: pgwin32_signal_event

I need to parse an open-source project Postgresql using pycparser.
While parsing its source-code the following error arises:
Traceback (most recent call last):
File "examples\using_cpp_libc.py", line 48, in <module>
getAllFiles(projectName)
File "examples\using_cpp_libc.py", line 29, in getAllFiles
ast = parse_file(dirName+'\\'+fname, use_cpp = True, cpp_path = 'cpp',
cpp_args = [r'-nostdinc',r'-Iutils/fake_libc_include',r'-
Iprojects/postgresql/src/include'])
File "G:\python\pycparser-master\pycparser\__init__.py", line 92, in
parse_file
return parser.parse(text, filename)
File "G:\python\pycparser-master\pycparser\c_parser.py", line 152, in parse
debug=debuglevel)
File "G:\python\pycparser-master\pycparser\ply\yacc.py", line 334, in parse
return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
File "G:\python\pycparser-master\pycparser\ply\yacc.py", line 1204, in
parseopt_notrack
tok = call_errorfunc(self.errorfunc, errtoken, self)
File "G:\python\pycparser-master\pycparser\ply\yacc.py", line 193, in
call_errorfunc
r = errorfunc(token)
File "G:\python\pycparser-master\pycparser\c_parser.py", line 1838, in
p_error
column=self.clex.find_tok_column(p)))
File "G:\python\pycparser-master\pycparser\plyparser.py", line 67, in
_parse_error
raise ParseError("%s: %s" % (coord, msg))
pycparser.plyparser.ParseError:
projects/postgresql/src/include/pg_config_os.h:366:15: before:
pgwin32_signal_event
I am using postgresql-9.6.9, build it using visual studio express 2017 on windows 10 (64-bit)
The blog post you quoted in the comment is the canonical resource. Parsing large C projects is not easy - they have their own quirks - so it takes work. I doubt it's resolvable within the confines of a Stack Overflow question.
You need to start tackling the issues one by one - for example look at the pgwin32_signal_event token in pg_config_os.h - why can't it be parsed? Perhaps its type is unparsable? Was it defined? Could it be added to a "fake" header, etc. Unfortunately, there's no easy way to do this except working through the issues one by one.
Be sure to preprocess the file you're parsing first, dumping the full preprocessed version into a single .c file - this gets all the types into a single file you can work with.

Google App Engine dev_appserver.py: watcher_ignore_re flag "is not JSON serializable"

Why I run the dev_appserver.py with the option watcher_ignore_re, I get an error message that the regex is not JSON serializable.
Is this a bug with the development server? Am I using this command improperly? The command and callstack is printed below.
C:\Users\mes65\Documents\MyProject>"C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin\dev_appserver.py" ^
--watcher_ignore_re="(.*\.git|.*\.idea|tmp\.py)" ^
"C:\Users\mes65\Documents\MyProject"
WARNING 2018-06-06 09:28:59,161 appinfo.py:1622] lxml version "2.3" is deprecated, use one of: "3.7.3"
INFO 2018-06-06 09:28:59,187 devappserver2.py:120] Skipping SDK update check.
Traceback (most recent call last):
File "C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\dev_appserver.py", line 96, in <module>
_run_file(__file__, globals())
File "C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\dev_appserver.py", line 90, in _run_file
execfile(_PATHS.script_file(script_name), globals_)
File "C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py", line 454, in <module>
main()
File "C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py", line 442, in main
dev_server.start(options)
File "C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py", line 163, in start
bool(ssl_certificate_paths), options)
File "C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\metrics.py", line 166, in Start
self._cmd_args = json.dumps(vars(cmd_args)) if cmd_args else None
File "C:\Python27\lib\json\__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "C:\Python27\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <_sre.SRE_Pattern object at 0x00000000063C2188> is not JSON serializable
It looks like it is an issue with the google analytics code built into dev_appserver2 (google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py on or around line 316). It wants to send all of your command line options to google analytics. If you remove the analytics client id by adding the command line option --google_analytics_client_id= (note: '=' without any following value) the appserver won't call the google analytics code where it is trying to JSON serialize an SRE object and failing. However, since you are on Windows I find that the --watcher_ignore_re does not work anyway even when you get past this issue.
There is a comment in file_watcher.py
TODO: b/33178251 - Add watcher_ignore_re support for windows.
I also faced with this usability problem on Windows and was really disappointed. I tried to find some workarounds but I hadn't found any appropriate way.
In the end, I decided to make my own implementation of support watcher_ignore_re for Windows. I put required changes in my Github repo
If describe them in several words:
Add _watcher_ignore_re, _skip_files_re properties and its setters
Add import statement from google.appengine.tools.devappserver2 import watcher_common and use it in newly created def _path_ginored
Filter additional_changes before adding them to watcher changed files
For resolving mentioned problem with not serializable regex attribute we should drop them from the serialized dictionary. Fix for this is added as consequent commit and can be checked at metrics.py:185-193.
I hope it helps other guys enjoy developing on GAE on Windows :)

pycparser ParseError

I am trying to creat AST usinfg pyCparser,
The following error printed:
Traceback (most recent call last):
File "C:\Work\RE\Tools\VarsExporter\BuildExportedDb.py", line 1076, in
main()
File "C:\Work\RE\Tools\VarsExporter\BuildExportedDb.py", line 1032, in main
ast = parse_file(i_file)
File "C:\Python27\lib\site-packages\pycparser_init_.py", line 93, in
parse_file
return parser.parse(text, filename)
File "C:\Python27\lib\site-packages\pycparser\c_parser.py", line 152, in
parse
debug=debuglevel)
File "C:\Python27\lib\site-packages\pycparser\ply\yacc.py", line 331, in
parse
return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
File "C:\Python27\lib\site-packages\pycparser\ply\yacc.py", line 1199, in
parseopt_notrack
tok = call_errorfunc(self.errorfunc, errtoken, self)
File "C:\Python27\lib\site-packages\pycparser\ply\yacc.py", line 193, in
call_errorfunc
r = errorfunc(token)
File "C:\Python27\lib\site-packages\pycparser\c_parser.py", line 1761, in
p_error
column=self.clex.find_tok_column(p)))
File "C:\Python27\lib\site-packages\pycparser\plyparser.py", line 67, in _
parse_error
raise ParseError("%s: %s" % (coord, msg))
ParseError: Objectffly\SerDb.i:43:18: before: __loff_t
What causes the above issue? How Can I handle it?
Any suggestions how can I debug it, and find out what's going on?
From pyCparser git FAQ:
C code almost always #includes various header files from the standard C library, like stdio.h. While (with some effort) pycparser can be made to parse the standard headers from any C compiler, it's much simpler to use the provided "fake" standard includes in utils/fake_libc_include. These are standard C header files that contain only the bare necessities to allow valid parsing of the files that use them.
To solve the issue I have successfully used the method described here.

NBT Parser Minecraft mca file not a gzipped file error

I try to read a Minecraft world with Python from the filesystem and the .mca region/anvil files using the NBT 1.4.1 module (Named Binary Tag Reader/Writer), which is supposed to read the NBT format used in Minecraft. It works fine for files such as level.dat, but throws an error for the region files such as r.0.0.mca
Edit: I am referring to the auto generated world files that minecraft stores in the .minecraft/saves/"MyWorld"/ folder. Such as the level.dat (which works), and the mca files stored in the .minecraft/saves/"MyWorld"/region/ folder such as r.0.0.mca which don't work. I uploaded two sample files from one of my worlds.
Code:
from nbt import nbt
level_file = nbt.NBTFile("level.dat", "rb") # works
region_file = nbt.NBTFile("r.0.0.mca", "rb")# does not work
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 508, in __init__
self.parse_file()
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 532, in parse_file
type = TAG_Byte(buffer=self.file)
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 85, in __init__
self._parse_buffer(buffer)
File "/usr/local/lib/python3.5/dist-packages/nbt/nbt.py", line 90, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "/usr/lib/python3.5/gzip.py", line 274, in read
return self._buffer.read(size)
File "/usr/lib/python3.5/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.5/gzip.py", line 461, in read
if not self._read_gzip_header():
File "/usr/lib/python3.5/gzip.py", line 409, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')
Any suggestions how to get this working?
r.0.0.mca is most definitely not compressed. About 80% of the bytes are zeros.
It turns out that the NBT library only supports .mcr region files which have been replaced by .mca files about 6 years ago. However, mcedit is written in Python and supports those files. Due the changes in the Minecraft save format, the interpretation of the content needs to be adjusted though, but the files can be successfully read.

Categories