I'm a developer for a big GUI app and we have a web site for bug tracking. Anybody can submit a new bug to the bug tracking site. We can detect certain failures from our desktop app (i.e. an unhandled exception) and in such cases we would like to open the submit-new-bug form in the user predefined browser, adding whatever information we can gather about the failure to some form fields. We can either retrieve the submit-new-bug form using GET or POST http methods and we can provide default field values to that form. So from the http server side everything is pretty much OK.
So far we can successfully open a URL passing the default values as GET parameters in the URL using the webbrowser module from the Python Standard Library. There are, however, some limitations of this method such as the maximum allowed length of the URL for some browsers (specially MS IE). The webbrowser module doesn't seem to have a way to request the URL using POST. OTOH there's the urllib2 module that provides the type of control we want but AFAIK it lacks the possibility of opening the retrieved page in the user preferred browser.
Is there a way to get this mixed behavior we want (to have the fine control of urllib2 with the higher level functionallity of webbrowser)?
PS: We have thought about the possibility of retreiving the URL with urllib2, saving its content to a temp file and opening that file with webbrowser. This is a little nasty solution and in this case we would have to deal with other issues such as relative URLs. Is there a better solution?
This is not proper answer. but it also work
import requests
import webbrowser
url = "https://www.facebook.com/login/device-based/regular/login/?login_attempt=1&lwv=110"
myInput = {'email':'mymail#gmail.com','pass':'mypaass'}
x = requests.post(url, data = myInput)
y = x.text
f = open("home.html", "a")
f.write(y)
f.close()
webbrowser.open('file:///root/python/home.html')
I don't know of any way you can open the result of a POST request in a web browser without saving the result to a file and opening that.
What about taking an alternative approach and temporarily storing the data on the server. Then the page can be opened in the browser with a simple id parameter, and the saved partially filled form would be shown.
You could use tempfile.NamedTemporaryFile():
import tempfile
import webbrowser
import jinja2
t = jinja2.Template('hello {{ name }}!') # you could load template from a file
f = tempfile.NamedTemporaryFile() # deleted when goes out of scope (closed)
f.write(t.render(name='abc'))
f.flush()
webbrowser.open_new_tab(f.name) # returns immediately
A better approach if the server can be easily modified is to make POST request with partial parameters using urllib2 and open url generated by server using webbrowser as suggested by #Acorn.
Related
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
So I'm trying to generate a PDF of a view that I have in a django web application. This view is protected, meaning the user has to be logged in and have specific permission to view the page. I also have some attachments (stored in the database as FileFields) that I would like to append to the end of the PDF.
I've read most of the posts I could find on how to generate PDFs from a webpage using pdfkit or reportlab, but all of them fail for me for some reason or another.
Currently, the closest I've gotten is successfully generating a PDF of the page using pdfkit, but this requires me to remove the restrictions that require the user to be logged in and have page permissions, which really isn't an option long term. I found a couple posts that discuss printing pdfs on protected pages and providing login information, but I couldn't get any of that to work.
I haven't found anything on how to include attachments, and don't really know where to start with that.
I'm more than happy to update this question with more information or snippets of code if need be, but there's quite a few moving parts here and I don't want to flood people with useless information. Let me know if there's any other information I should provide, and thanks in advance for any help.
I got it working! Through a combination of PyPDF2 and pdfkit, I got this to work pretty simply. It works on protected pages because django takes care of getting the complete html as a string, which I just pass to pdfkit. It also supports appending attachments, but I doubt (though I haven't tested) that it works with anything other than pdfs.
from django.template.loader import get_template
from PyPDF2 import PdfFileWriter, PdfFileReader
import pdfkit
def append_pdf(pdf, output):
[output.addPage(pdf.getPage(page_num)) for page_num in range(pdf.numPages)]
def render_to_pdf():
t = get_template('app/template.html')
c = {'context_data': context_data}
html = t.render(c)
pdfkit.from_string(html, 'path/to/file.pdf')
output = PdfFileWriter()
append_pdf(PdfFileReader(open('path/to/file.pdf', "rb")), output)
attaches = Attachment.objects.all()
for attach in attaches:
append_pdf(PdfFileReader(open(attach.file.path, "rb")), output)
output.write(open('path/to/file_with_attachments.pdf', "wb"))
If you just want to secure it, you could write a custom Authentication Backend that lets your server spoof users. Way over-kill but it would solve your problem and at least you get to learn about custom auth backends! (Note: You should be using HTTPS.)
https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#writing-an-authentication-backend
Create auth backend in app/auth_backends.py
Add app.auth_backends.SpoofAuthBackend backend to settings.py that takes a shared_secret and user_id.
Create a URL route like url(r'^spoof-user/(?P<user_id>\d+)/$', 'app.views.spoof_user', name="spoof-user")
Add the view spoof_user that must invoke both django.contrib.auth.authenticate (which invokes backend in #1 above) and after getting user from authenticate(...) you pad the request with the user django.contrib.auth.login(request, user). Finally, this view should return HttpResponseForbidden if the shared secret is wrong or HttpResponseRedirect to the PDF URL you actually want (after logging in to spoof user programmatically via authenticate and login).
You would probably want to create a random secret key each request using something like cache.set('spoof-user-%s' % user_id, RANDOM_STRING, 30) which persists shared secret for 30 seconds to allow time for request. Then perform pdf_response = requests.get("%s?shared_secret=1a2b3c&redirect_uri=/path/to/pdf/" % reverse('spoof-user', kwargs={'user_id': 1234})). Your new view will test the provided shared_secret in auth backend, login user to request and perform redirect to request.GET.get('redirect_uri').
You can use pdfkit to do that. You can retrieve the page using the url and pdfkit will handle the rest:
pdfkit.from_url('http://website.com/somepage', 'somepage.pdf')
You will have to properly access the page using the appropriate headers for it is protected of course:
options = {
'cookie': [
('cookie-name1', 'cookie-value1'),
('cookie-name2', 'cookie-value2'),
]
}
pdfkit.from_url('http://website.com/somepage', 'somepage.pdf')
`
I'm trying to use scrapy to scrape a site that uses javascript extensively to manipulate the document, cookies, etc (but nothing simple like JSON responses). For some reason I can't determine from the network traffic, the page I need comes up as an error when I scrape but not when viewed in the browser. So what I want to do is use webkit to render the page as it appears in the browser, and then scrape this. The scrapyjs project was made for this purpose.
To access the page I need, I had to have logged in previously, and saved some session cookies. My problem is that I cannot successfully provide the session cookie to webkit when it renders the page. There are two ways I could think to do this:
use scrapy page requests exclusively until I get to the page that needs webkit, and then pass along the requisite cookies.
use webkit within scrapy (via a modified version of scrapyjs), for the entire session from login until I get to the page I need, and allow it to preserve cookies as needed.
Unfortunately neither approach seems to be working.
Along the lines of approach 1, I tried the following:
In settings.py --
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.middleware.WebkitDownloader': 701, #to run after CookiesMiddleware
}
I modified scrapyjs to send cookies: scrapyjs/middleware.py--
import gtk
import webkit
import jswebkit
#import gi.repository import Soup # conflicting static and dynamic includes!?
import ctypes
libsoup = ctypes.CDLL('/usr/lib/i386-linux-gnu/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')
def process_request( self, request, spider ):
if 'renderjs' in request.meta:
cookies = request.headers.getlist('Cookie')
if len(cookies)>0:
cookies = cookies[0].split('; ')
cookiejar = libsoup.soup_cookie_jar_new()
libsoup.soup_cookie_jar_set_accept_policy(cookiejar,0) #0==ALWAYS ACCEPT
up = urlparse(request.url)
for c in cookies:
sp=c.find('=') # find FIRST = as split position
cookiename = c[0:sp]; cookieval = c[sp+1:];
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'None',-1))
session = libwebkit.webkit_get_default_session()
libsoup.soup_session_add_feature(session,cookiejar)
webview = self._get_webview()
webview.connect('load-finished', self.stop_gtk)
webview.load_uri(request.url)
...
The code for setting the cookiejar is adapted from this response. The problem may be with how imports work; perhaps this is not the right webkit that I'm modifying -- I'm not too familiar with webkit and the python documentation is poor. (I can't use the second answer's approach with from gi.repository import Soup because it mixes static and dynamic libraries. I also can't find any get_default_session() in webkit as imported above).
The second approach fails because sessions aren't preserved across requests, and again I don't know enough about webkit to know how to make it persist in this framework.
Any help appreciated!
Actually, the first approach does work, but with one modification. The path to the cookies needs to be '/' (at least in my application), and not 'None' as in the code above. Ie, the line should be
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'/',-1))
Unfortunately this only pushes the question back a bit. Now the cookies are saved properly, but the full page (including the frames) is still not being loaded and rendered with webkit as I had expected, and so the DOM is not complete as I see it in within the browser. If I simply request the frame that I want, then I get the error page instead of the content that is shown in a real browser. I'd love to see how to use webkit to render the whole page, including frames. Or how to achieve the second approach, completing the entire session in webkit.
Not knowing complete work flow of Ithe application, you need to make sure setting the cookie jar happens before any other network activity is done by webkit. http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-Global-functions.html#webkit-get-default-session. In my experience, this practically means even before instantiating the web view.
Another thing to check for is if your frames are from same domain.Cookie policies will not allow cookies across different domain.
Lastly, you can probably inject the cookies. See http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-webkitwebview.html#WebKitWebView-navigation-policy-decision-requested or resource-request-starting and then set the cookies on actual soup message.
I have a link like this, direct to a mp3 file. So when I put it in my browser, basically asks me if I want to download the file, however when I do the same thing with python by the following code :
> data = urllib2.urlopen("http://www23.zippyshare.com/d/44123087/497548/Lil%20Wayne%20ft.%20Eminem%20-%20Drop%20The%20World.mp3".read())
I will redirected to another link like this. Therefore, instead of the MP3 data, I am getting the html code for
'http://www23.zippyshare.com/v/44123087/file.html'
any ideas ?
thanks
urllib2 handles redirection transparently. You might want to see what the server is actually doing when it is presenting such a redirection as well allowing you to download. You might want to subclass the redirect handler and see which property of the header is giving you the url and use urlretrieve to download that.
Setting the cookies, trying explicitly might be a good try as well.
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open('yourmp3filelink')
Your link redirects to an HTML webpage, most likely because your download request is timing out. That's often how these download websites work: you never get a static link to the download, only a temporarily assigned link.
My guess is that there's no way to get that static link using that website. You'd have to know where that file was actually coming from.
So no, nothing is wrong with your python code; just your sources.
I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)