Custom Import in AWS developer console, beautifulsoup4 - python

This question was asked before but never answered well:
I need beautifulsoup4 to scrape through a websites HTML and get information. I want to use that information in my Alexa-skill.
How do I import/use bs4 in my Alexa developer console?
I've already read how to make a deployment package (https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html), but I don't understand how to download/zip bs4.
I am new to Python, AWS and Alexa developer console, so I am sorry if that question is very easy to answer.
I tried to create a ziped folder named lamda and upload it under upload code, but running my skill wuth import bs4 just errors

Related

Executing JavaScript on a page with urllib

I'm trying to create an automation for a site I need to access daily.
I can't install any python libraries at the moment.
I need to set a value into an input. And click on a button on this page.
I already managed to get the page content with urllib. But I can't figure out how to control the page.
It feels like if I could execute JavaScript code on the page my problem will be solved.
Is there a way to execute JavaScript with urllib?
Or any other way to automate a site without external libraries?

Convert PDF to HTML without losing any format

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe.
I tried several things so far:
the pdfminer.six library, produced messy HTML,
trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML
finally I came across pdf2htmlEX (https://github.com/pdf2htmlEX/pdf2htmlEX) which produced exactly what I wanted.
Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly. The project is deprecated and the documentation is limited and terrible. The problem has something to do with broken dependencies.
So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?
Thanks a lots.
if anyone is willing to help me getting the pdf2htmlEX to work on heroku, leave a comment and I will post more details in a different post
This is not going to be trivial. But I'll give some pointers.
You need an app.json in which you define your buildpacks.
https://devcenter.heroku.com/articles/app-json-schema#buildpacks
If this project is available via apt it's going to be easy. You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install. Example
Then it installs it automatically and you are done.
If it is not available as a package you will need to create your own buildpack.
https://devcenter.heroku.com/articles/buildpack-api
Example used here.
Another solution is to dockerize your project and execute it as a docker container.

Use beautifulsoup4 in Alexa developer console

I need beautifulsoup4 to scrape through a websites HTML and get information. I want to use that information in my Alexa-skill.
How do I import/use bs4 in my Alexa developer console?
I've already read how to make a deployment package (https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html), but I don't understand how to download/zip bs4.
I am new to Python, AWS and Alexa developer console, so I am sorry if that question is very easy to answer.
Kind regards,
Dany
I think you'll find all you need here at this documentation url: https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html

Python AWS Lambda Web scraper

I need to scrape a URL to send content to my Lambda function. I am trying to achieve it by packaging BeautifulSoup with my Lambda function, but I am getting import errors like: cannot import name 'CharsetMetaAttributeValue' etc. I am not sure if bs4 should be used in AWS environment or not.
Any suggestions would be helpful.
I am able to use compiled lxml package with Lambda function for web scraper. Compiled package is available at github this link: https://github.com/JFox/aws-lambda-lxml/tree/master/3.6.4

download a file from a website which requires authetication using python

I am trying to download a CSV file from a website using python 2.7. I already found some posting about how to retrieve files.
http://www.blog.pythonlibrary.org/2012/06/07/python-101-how-to-download-a-file/
How do I download a file over HTTP using Python?
However, the site that I am trying to access requires authentication: id and password. I was wondering if anyone out there might share an example of how to download a file with authentication barrier.
You can use requests module, and its documentation.

Categories