Convert PDF to HTML without losing any format - python

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe.
I tried several things so far:
the pdfminer.six library, produced messy HTML,
trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML
finally I came across pdf2htmlEX (https://github.com/pdf2htmlEX/pdf2htmlEX) which produced exactly what I wanted.
Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly. The project is deprecated and the documentation is limited and terrible. The problem has something to do with broken dependencies.
So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?
Thanks a lots.
if anyone is willing to help me getting the pdf2htmlEX to work on heroku, leave a comment and I will post more details in a different post

This is not going to be trivial. But I'll give some pointers.
You need an app.json in which you define your buildpacks.
https://devcenter.heroku.com/articles/app-json-schema#buildpacks
If this project is available via apt it's going to be easy. You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install. Example
Then it installs it automatically and you are done.
If it is not available as a package you will need to create your own buildpack.
https://devcenter.heroku.com/articles/buildpack-api
Example used here.
Another solution is to dockerize your project and execute it as a docker container.

Related

How to run Github code (with python) automatically on MyBinder or Google Colab without downloading the Sample code?

MyBinder and Colaboratory are feasible to allow people to run our examples from the website directly in their browser, without any download required.
When I work on the Binder, our data to be loaded takes a huge time. So, I need to run python code on the website directly.
I'm not sure whether I totally get the question. If you want to avoid having to download the data from another source, you can add the data into you git repo which you use to start Binder. It should look something like this: https://github.com/lschmiddey/book_recommender_voila
However, if your dataset is too big to be uploaded to your git repo, you have to get the data onto the provided Binder server somehow. So you usually have to download the data onto your Binder Server so that other users can work on your notebook.

Hosting Jupyter Notebook Python App on a Webpage

I've made a simple python GUI application using ipywidgets, ipycanvas, and numpy. I made the program on Jupyter notebook as an ipynb file. I would now like to take my application and put it on a webpage. What is the best way to take this Jupyter notebook app and host it on the web?
I've looked a bit into Binder and Django, but I can't seem to find enough resources or documentation on the net to help me learn how to do this.
If you already have it working as a Jupyter notebook (.ipynb file), I'd suggest that sticking with that as the core item for now. I'd suggest getting running via MyBinder.org based on either this example repo or this one. Or a combination of the two.
This video is recent and a good reference for many of the steps of setting up a repo with your content.
You essentially make a copy of the Binder templates under your control and then edit them to have your content. You adapt the URLs that trigger launches so that when you share the link, they launch a session via MyBInder.org with your content. Most often the steps can be performed right in the Github browser-based interface without you needing to use git or work locally. If you have something fancier you need, you may have to move to using more complex configuration file set-ups and those may necessitate some use of git and local editing.
If you hit some technical road blocks, post your questions here using the 'questions' category as suggested in this post about 'Debugging your Binder'.
Maybe once the basics of sharing the notebook or appmode version are working with your own content, you may want to check out Voila or some of the other ways you can share a jupyter notebook-based app discussed here.
Jupyter itself is made with Tornado web framework.
There are many bindings to another popular web frameworks.
I once tried on it, and I found that pyramid-notebook is easy to use.
For a quick build I recommend Binder. This is how you can quickly set up Binder with voila:
Checkout this Git Repo: https://github.com/lschmiddey/book_recommender_voila
In combination with this blogpost: https://lschmiddey.github.io/fastpages_/2020/09/28/Build-binder-app-Part4.html

Having trouble with download_app and coursebuilder

I inherited an appengine coursebuilder project a couple months ago, and we've been trying to upgrade to a more recent version of coursebuilder. In order to do this, the first step is to download local version of the course.
Whenever I run appcfg.py download_app -A $projectID -V $versionNumber ./folderToSaveTo
It downloads a different version of the course, one that looks like an old test version with old placeholder text, all links to lessons set to private, etc.
When I look at the versions of the course in the appengine dashboard, there is only a single version, so I'm not sure what it's even downloading.
Alternatively, it was suggested I use the ETL tool provided with coursebuilder to download the files instead, but that had a bunch of other issues associated with it as well. Previously I had asked the question directly on the coursebuilder forum where the ETL tool was initially suggested.
Thanks in advance for any help,
-Tyler Nolan
appcfg's download_app will only look within the default module. You should check if there are any drop-downs in the Developers Console UI which will allow you to look into whether there are other modules.
gcloud preview app modules download on the other hand, does allow you to specify modules.
Hopefully this helps you find the "real version" of your app.
It's also possible that what you download is displaying default data because it's not being viewed in a manner which is properly connected to the database, so it falls back to look like that.

How to deploy codenode

Aloha everyone,
So I was hoping to deploy codenode 'http://codenode.org/', on my website nested within a page. For the life of me I just can't follow the documentation and figure out what I'm supposed to do.
It only ever seems to talk about running things locally from the terminal, how are you supposed to set it up with regards to views, models and templates?
Thanks in advance.
They're simply telling you to install it via pip and virtualenv. This isn't terribly difficult to do on a host that is very Django and Python friendly, such as WebFaction. You can always put the necessary files where they need to go so that they will be added to your Python path via FTP, etc.

Checking repository and updating

I'm making a game using the (very) old Python library, PyGame. But that's not what I'm here to ask about.
How do I make a code that would check the repositories in a server with the latest build, check if the build is newer or the same, and if newer prompt the user to download (or deny) the update of the game, as it will be developed in multiple versions and will allow players to gradually update as we make changes.
Like Minecraft does once an update comes out and it prompts you to update... But in Python
There are 3 things you need for this:
A server where you will store all the information about the updates and versions.
It can be a web server (for Python, see flask, web.py, django, pylons, etc., or PHP or whatever) which can have a single page.
It will take the current version as input (GET/POST requests) and output the updates available (in a format that can be parsed, JSON preferably, or XML or just plain text).
These can be fetched from a database (see MySQL, postgresql, or any ORM that works with your choice of web server, sqlalchemy)
Or by checking the names of the files available on the server (if the files will be hosted on the same web server) (the names will have a pattern XXX-r24-20121224.tar.gz and you'll check the list of files with glob or something).
A piece of code that will query the server every time you start your game to check with the web server if there are updates. You can use requests or urllib2 for example.
A piece of code that will download and update your actual game.
The web server should give you a link to where the update file is
From there you will have to download it (with requests or urllib2)
Unzip it (using zipfile or tarfile) and replace your actual files with that.
Now it all depends on how your files are laid out:
If you're distributing the source code, what you could do is build it all in a package, and then you just replace the whole package.
The zipfile package and Python actually account for that, and they give you an option to only put the python files in the zip file and Python gives you an option to add said zip file to the PYTHONPATH and import directly from there.
If you're compiling it with py2exe or anything, it'll be a different issue: you might be able to only update one zip file, or replace the actual DLLs and stuff, which might be a big mess.
If it's a deb package or similar, you might want to use that to update, and ask the user to do it or something.
I hope this helps, even if it's very abstract. This had to be done :)
Now I'll give my own (biased) opinion: If you already have a website running, use that to add a single page for such a thing. Otherwise I'd recommend a free hosting that will allow you to set up a website using flask. I'd recommend that because it would be very easy to get it running in no time, plus it will allow you to use the great ORM sqlAlchemy. Also, I wouldn't bother with more than telling the user there is a new version and let them figure out where to get it. That's unless you are only distributing it in one standard way all over.

Categories