I was just wondering if it's possible to open a headless browser with the webbrowser module? I'm new to programming and have virtually no experience and don't even know where to look. I heard this is a good site to start. I wanted to use the webbrowser module because I'm planning to run the program on other computers and the average person doesn't have special software like chrome drivers installed on their computers, also webbrowser doesn't require a PATH to open a browser window. So I wanted to use it. If anyone knows any other alternative modules that can open common browsers without needing a PATH please say so.
Most modules have a so-called API documentation. For the webbrowser module, it can be found here: https://docs.python.org/3.6/library/webbrowser.html
If you come across a module of which you cannot find any documentation, try help() in iPython:
import webbrowser
help(webbrowser) # help for module
help(webbrowser.get) # help for function
browser = webbrowser.get()
help(browser) # help for browser object
There one can see, that this is no documented feature for the webbrowser module. Nevertheless, there are other modules that you might want to look into - this list seems to be a good start https://github.com/dhamaniasad/HeadlessBrowsers
Btw. to respond to Basile Starynkevitch (I have not yet enough reputation to add a comment under other posts): A headless browser might process JavaScript and follow HTML forwarding. You will not get the same from the software you mentioned.
Wrong terminology: a headless browser should be more generally called some HTTP client. Read much more about HTTP and take time to understand what should happen in the HTTP clients and what should happen in the HTTP servers. Be also aware of HTML5, JavaScript, AJAX and other web technologies. They are related in their usage within a usual browser such as Firefox, but conceptually independent.
Of course, your typical browser is an HTTP client, but there are many other HTTP clients (e.g. wget or any program using libcurl, which is a good free software HTTP client library or web crawlers).
Some browsers (e.g. links) can be much more crude than your typical one, but all browsers are HTTP clients. They might not even know about JavaScript or CSS (or not even show any image). They still deserve to be called "browsers". Some programs (e.g. selenium) reproduce many functions of typical browsers (even JavaScript or CSS) but don't show anything on a screen. You might call them headless browsers but they might not even claim being one.
And Python includes some HTTP client (and also HTTP server) functions.
You could find other HTTP server libraries, such as libonion.
Many programs use HTTP (outside of browsing, e.g. as inter-process communication). Be aware of web services.
PS. That is the first time I read about headless browsers, so I don't think this terminology is very common.
Related
I've been recently working on a project in which I need to access a asp.net web API in order to get some data. The way I've been gaining access to this API so far is by manually setting the cookies manually within the code and then using requests to get the information that I need. My task now is to automate this process. I get the cookies by using the Chrome developer tools, in the network tab. Now obviously the cookies change every once in a while so I've been trying to make something that will automatically change the cookies inside.
I should mention that the network at which this is being done is air-gaped and getting python libraries inside is kind of tedious, so I am trying to avoid that. It is also the reason why getting code examples here is very complicated.
The way the log-in process works in this web app is as follows (data from chrome dev tools):
Upon entering the URL there are a bunch of redirects which seem to do nothing.
A request is made to /login.aspx which returns a "set-cookie: 'sessionId=xyz'" header and redirects to /LandingPage.aspx
A request is made to /LandingPage.aspx with said cookie which returns a "set-cookie" header with a bunch of cookies (ASP.NET etc'). These are the cookies that I need in order to make the python script access the API.
What's written above is the browser way of doing things, when I try to imitate this in python requests, I get the first cookie from /login.aspx but when it redirects to /LandingPage.aspx, I get a 401 Unauthorized with the following headers:
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
After having done some reading I understood that these response headers are related to NTLM and Kerberos protocols (side question: if it responds with both headers does it mean that I need to provide both authentications or that either one will suffice?).
Quick google search yielded that after these mentioned responses should follow a request with the Kerberos/NTLM token (which I have no idea how to acquire) in order to get a 200 response. I find this pretty weird considering the browser doesn't make any of these requests and the web app just gives it the cookies without it seemingly transferring any NTLM or Kerberos data.
I've thought of a few ways to overcome this and hopefully you could help me figure out whether this would work.
Trying to get the requests-kerberos or requests-ntlm libraries for python and using those to overcome this problem. I would like your opinion to whether this would work. I am reluctant to use this method though, because of what was mentioned above.
Somehow using PowerShell to get these tokens and then somehow using these tokens in python requests without the above mentioned libraries. But I have no idea if this would work either.
I would very much appreciate anyone who could maybe further explain the process that's happening here in general, and of course would greatly appreciate any help with solving this.
Thank you very much!
Trying to get the requests-kerberos or requests-ntlm libraries for python and using those to overcome this problem. I would like your opinion to whether this would work. I am reluctant to use this method though, because of what was mentioned above.
Yes, requests-kerberos would work. HTTP Negotiate means Kerberos almost 100% of the time.
For Linux I'd slightly prefer requests-gssapi, which is based on a more maintained 'gssapi' backend, but at the moment it's limited to Unix-ish systems only ā while requests-kerberos has the advantage of supporting Windows through the 'winkerberos' backend. But it doesn't really matter; both will do the job fine.
Don't use NTLM if you can avoid it. Your domain admins will appreciate being able to turn off NTLM domain-wide as soon as they can.
Somehow using PowerShell to get these tokens and then somehow using these tokens in python requests without the above mentioned libraries. But I have no idea if this would work either.
Technically it's possible, but doing this via PowerShell (or .NET in general) is going the long way around. You can achieve exactly the same thing using Python's sspi module, which talks directly to the actual Windows SSPI interface that handles Kerberos ticket acquisition (and NTLM, for that matter).
(The gssapi module is the Linux equivalent, and the spnego module is a cross-platform wrapper around both.)
You can see a few examples here ā OP has a .NET example, the answer has Python.
But keep in mind that Kerberos tokens contain not only the service ticket but also a one time use authenticator (to prevent replay attacks), so you need to get a fresh token for every HTTP request.
So don't reinvent the wheel and just use requests-kerberos, which will automatically call SSPI to get a token whenever needed.
it says that in order for requests-kerberos to work there has to be a TGT cached already on the PC. This program is supposed to run for weeks without being interfered with and to my understanding these tickets expire after about 10 hours.
That's typical for all Kerberos use, not just requests-kerberos specifically.
If you run the app on Windows, from an interactive session, then Windows will automatically renew Kerberos tickets as needed (it keeps your password cached in LSA memory for that purpose). However, don't run long-term tasks in interactive sessions...
If you run the app on Windows, as a service, then it will use the "machine credentials" aka "computer account" (see details), and again LSA will keep the tickets up-to-date.
If you run the app on Linux, then you can create a keytab that stores the client credentials for the application. (This doesn't need domain admin rights, you only need to know the app account's password.)
On Linux there are at least 4 different ways to use a keytab for long-term jobs: k5start (third-party, but common); KRB5_CLIENT_KTNAME (built-in to MIT Kerberos, but only in recent versions); gss-proxy (from RedHat, might already be part of the OS); or a basic cronjob that just re-runs kinit to acquire new tickets every 4-6 hours.
I find this pretty weird considering the browser doesn't make any of these requests and the web app just gives it the cookies without it seemingly transferring any NTLM or Kerberos data.
It likely does, you might be overlooking it.
Note that some SSO systems use JavaScript to dynamically probe for whether the browser has Kerberos authentication properly set up ā if the main page really doesn't send a token, then it might be an iframe or an AJAX/XHR request that does.
I was making some test http requests using Python's request library. When searching for Walmart's Canadian site (www.walmart.ca), I got this:
How do servers like Walmart's detect that my request is being made programatically? I understand browsers send all sorts of metadata to the server. I was hoping to get a few specific examples of how this is commonly done. I've found a similar question, albeit related to Selenium Web Driver, here where it claims that there are some vendors that provide this service but I was hoping to get something a bit more specific.
Appreciate any insights, thanks.
As mentioned in the comments, a real browsers send many different values - headers, cookies, data. It reads from server not only HTML but also images, CSS, JS, fonts. Browser can also run JavaScript which can get other information about browser - version, extensions, data in local storage, etc (i.e how you move mouse). And real human loads/visits pages with random delays and in rather in random order. And all these elements can be used to detect a script. Servers may use very complex systems even Machine Learning (Artificial Intelligence) and use data from few mintues or hours to compare your behavior.
I made a program that works with selenium, and it automates for posting comment to the some blogs' contents. I'm not familiar with the requests module of python. (working on it for just a week) The thing that I'm wondering is, my program with selenium is a bit slow for page loading, and it loads everything from ads to the images/videos. If I'd made my program with requests module, would it save data and a bit faster according to the selenium module?
I searched this issue at some forum-sites, generally they say request modules a bit faster, but not all. Also I couldn't find any info about saving data by comparing this modules?
Plz don't give me directly the thumbs down. I need this answer with details.
Selenium is used for web automation via clicking in web elements and sending keys to input boxes.
To speed up selenium, use headless mode, so that the visual components like ads are not loaded and the work is fast , go to selenium's documentation to learn more about headless mode.
While requests is used for HTTP methods
Like GET, POST etc. Learn more about requests from here
If the blogging site has a public api, then you can use requests module.
If you are new to API , I recommend watching this YouTube video
https://youtu.be/GZvSYJDk-us
For example to create issues on GitHub you can use GitHub API.
But to comment on a blogging site which has no public api, you need to use selenium.
Requests directly send and receive data from the server which hosts a particular service, so it is fast.
But selenium interacts with the web browser.
When you are using requests , you can do an action directly, without having to perform a bunch of clicks or send keys.
Selenium allows you to control a browser and execute actions on a webpage.
requests library is for making HTTP requests.
So, if you know how to write your program for posting comments with just using HTTP API then Iād go with requests, Selenium would be an overhead in this case
If you are proficient with HTTP requests and verb (know how to make a POST request to a server with requests library), then choose requests. If you want to test your script, use selenium or BeautifulSoup.
I'm coding some automation tool for a specific web site, and having some problems. The web site needs to access (such as pushing buttons) within a browser to get JSON format responses.
(I'm familiar with Python but not regarding network traffic such things. and sorry for my poor explanation, English is not my first language)
i have to listen to JSON format response from the web site, AFAIK, local proxy(127.0.0.1) is needed to fetch the traffic. I've found a code(http://luugiathuy.com/2011/03/simple-web-proxy-python/) that fetches data from port 80 (HTTP). however, the code needs to change network setting of my PC. is there any way not to change whole pc's proxy setting to get traffic data? it slows my pc. i want to run this code "independently?". i've tried to emulate independent web browser to handle this but i had hard time to figure out setting independent local proxy.
followed by question #1, the website needs some button action triggered by mouse. as i mentioned above(independently operatable), it has not to interfere actual mouse. is there any library i can use for this purpose? i've tried to create some "virtual mouse" to achieve this goal but sadly failed.
i have more detail questions but shorten down to most crucial ones.
I'm attempting to automate tests of Adobe Analytics (aka Omniture) instrumentation of a web app by implementing test scripts with the Selenium Python package.
If correctly instrumented, HTTP requests are made from the browser with certain expected query parameters. Is there a Python package that would allow me to capture those outgoing HTTP requests? Right now, we do it manually with the Chrome dev tools in the Network -> Images section.
This application is also available as a native app across nearly twenty other platforms (including Smart TVs and game consoles), and I'll need to perform similar tests across those. Although, unfortunately, I won't be able to automate the script, I'd still like to capture and store the HTTP calls. I'm currently using HTTPScoop to do this manually.
I'm most comfortable with Python, but if there's a simple way of doing this in another language, I'm all ears.
I was recently working on a similar task so I can share my experience and what I've learnt on the way (rather than give you the solution).
First you need to run a proxy on your machine (e.g. http://bmp.lightbody.net/). Then I needed to run manually a few commands ( https://github.com/lightbody/browsermob-proxy#rest-api). Once the proxy was running I wrote a small script following example here https://github.com/lightbody/browsermob-proxy#using-with-selenium. Finally you simply loop over the har entries as captured on the proxy and check if an analytics request is present (you can check for URL params if needed).
I have this ready in form of a unit test for FF and Chrome (for a given URL). To be able to run this test on different devices/OS/platforms one would probably need to run the code through selenium remote webdriver https://code.google.com/p/selenium/wiki/RemoteWebDriver using service like https://www.browserstack.com/ in the cloud. I contacted them but they don't have any documentation ready but suggested I refer to online resources. That's where I am now.
Hope it helps