I need my program (Python) to upload files (large reports) to services like (rapidshare, megaupload or easyshare) and grab the URL the site gives me (to them forward to the user)
What's easiest way ( I think Selenium, but maybe it's overkill) ?
What's the fastest ( can I do it with mechanize? ) ?
How would you do it?
Thanx in advance.
I would attack this with Selenium, even it beeing really heavy, I think the easy aspect of it is worth it.
I would do what you need to do (upload file to service) by hand while on the FireFox plugin SeleniumIDE would be recording it. Them, just export as Python and you have your code.
SeleniumIDE:
Selenium is a bit to slow, but the simplicity I showed you is well worth it (IMHO).
You might check first whether the sites in questions have an API meant for this sort of thing. easy-share for example does (the others are blocked to me at the moment, so haven't checked those): http://www.easy-share.com/be/developers.html (and they even have a ready-made python module available)
Related
What even am I doing
So, as Minecraft Java has been slowly switching over to using Microsoft based accounts instead of solely Mojang accounts, I have been trying to put together an authentication method for a small launcher project I've been working on.
The First Issue.
I've been following a piece of documentation here, which had instructions on what GET and POST requests to send to which URLs, and how to parse them, etc. It's worked pretty well, except for The First Issue.
It was a dark and stormy night, and the Microsoft Authentication URL used Javascript for redirects, so the Requests library I was using in Python could not follow the redirects. There might be a way to parse the HTML content and find the redirections or something, but that is way above my head, because I am still new to even Python.
So I looked around for a solution that would let me follow the JavaScript redirects, and the best solution (in concept) looked to be using a headless browser. This led me down a long path until I came face to face with The Second Issue.
The Second Issue.
I looked around for a headless browser that I could use, and I found a couple:
Selenium, or
PyQT WebEngine or WebKit
(I know there are lots of others but I chose these and used them for examples)
From here, the issue isn't so much an issue to fix, but the issue of I don't know what I'm doing.
I looked into Selenium, and it looked promising, but the fact that I had to download a WebDriver confused me in terms of how I would package that, since this is going to be used for a distributed application.
I then looked into PyQT WebEngine, and it just confused me in all respects, so basically I just need some info on maybe how to use it. I also don't need to have to use PyQT to launch a window, or design my UI, or anything else. I already am planning to use Kivy for the GUI. I just need a headless browser or some other solution to follow Javascript redirects when sending a POST request to a certain URL.
So,
From here I just want to ask advice on which route I should take, since there seems to be a broad amount of options I could use. I've already mentioned what I need, so any advice on how or what I should use, in terms of headless browsers, libraries, etc.
Also if anyone has any other suggestions for how to authenticate a Microsoft account, please let me know.
I'm almost done
If there is anything I could answer or clarify, just let me know. I will highly appreciate all advice or suggestions.
Thanks,
Pyrotex7
Well to resolve this - I just went with PyQt in the end after messing around for a while.
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?
I have been wondering about this since I could really benefit from a program that makes actions on websites that I use for my job that require the same command over and over again.
I know some python and I love to learn new things.
I tried looking for it on google but I guess I'm not sure how to find it.
I would love it if you could direct me to a guide or something like that.
Thank you very much!
Selenium interacts with a web browser directly, although you can hide the browser window in the code (look up Selenium in --headless mode). This is a good choice for filling out a lot of forms or interacting with graphical user interface elements.
However, if you need to request information from websites, you don't always need to interact with the web browser directly. You can use the package called Requests. This doesn't depend on any web browsers and can run silently in the background.
I think you can do it with Python and some packages like selenium. Also you need some html knowledge to search in the html source code of the specific wegpage.
I found an interesting use case, maybe that helps you:
https://towardsdatascience.com/controlling-the-web-with-python-6fceb22c5f08
I'm trying to use Python to automatically upload, submit, and retrieve files on websites that do sequence processing.
Example: https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi
Does anyone know the best way to do this, whether it be specific modules or tutorials? Would this work with the requests module? Thanks a bunch in advance.
The example looks like an older system but if at all possible, I would suggest adding automating via an API due to your "retrieval" requirements prior to considering Selenium.
However, if you find yourself using Python with Selenium Webdriver, save yourself some setup effort and check out SeleniumBase.
Also, something worth checking out if there is a budget associated with this project and you want vendor support, UiPath RPA.
I suggest you to use Selenium.
You can use it in different web browsers.
As the task is sequence processing it should be simple.
Regards!!!
I am trying to develop a chrome extension in which i have coded all my
logic in 'python' . Now on browser action i want to pass parameters
and execute that .py file and return results to popup which will open
on browser action. To call a .py file from JavaScript i know i will
need to code a NPAPI plugin. But am confused between which approach i
should take. I have come across few options and am trying to choose
the easiest way to do it ..
Pyjamas Python Javascript Compiler : is a Python-to-Javascript
compiler which works as a language translator but on FAQ's of there
site last question suggest it will not run on Chrome. ( http://pyjs.org/)
FireBreath : FireBreath is a framework that allows easy creation of
powerful browser plugins. ( http://www.firebreath.org )
pyplugin - Python NPAPI plugin for XULRunner : It allows you to
build cross-platform graphical user interfaces using XUL and Python.
( http://pyplugin.com )
Plz Guide me to easiest way which will allow me to pass parameters and
execute that .py file and receive returned results from .py.
Thanx
Well, Pyjamas Python Javascript Compiler will not be complete -- not all python features are available in javascript, so it's impossible to convert all python to javascript. This may or may not do what you want, but I don't think it happens "on the fly", I think you have to write things on the desktop and run it through the "compiler" to get javascript out the other side.
FireBreath is the most awesomely amazing thing to ever hit the Internet -- I should know since I wrote it -- and it will absolutely allow you to do what you want, but you'll have to know how to tie into Python with C++ in order to do what you want. That said, you could probably use boost.python, which is included in the subset of boost that comes with FireBreath, but I've never used it so I don't know. You can pretty much do anything you want with an NPAPI plugin but you'll want to be real careful of security concerns.
A quick glance at pyplugin makes it look like pyplugin is basically what you'd be writing in FireBreath, but just a raw npapi plugin. If this will do what you want, it's probably the easiest way to go. It's designed to be used with XUL, which may be a problem since Chrome doesn't support XUL. You might also be able to modify it (since it's GPL) to do what you want. Of course, if you weren't planning to release your source, that could also be a problem.
The quickest way to solve your problem? Well, you'll have to decide; it'll take some more research, but I hope this is enough to at least get you started. Good luck!