Intercept http traffic with selenium webdriver in python

Intercept http traffic with selenium webdriver in python - python

I am building an web application testing tool in Selenium, using the chrome webdrive in Python 3.5. So far the application is working properly, but the marketing team is telling me that it is affecting web analytics metrics. As I am crawling the pages, it is sending requests to our web analytics platform.
What would be the best approach to collect that the web analytics tag is being trigged (the request itself), but actually not send the request ?
Is using a proxy to intercept and block the request from being send a possible solution?
Edit:
The metric system is Google Analytics, the call looks like the following
https://www.google-analytics.com/r/collect?
After the ? comes the parameters to be sent to google analytics. But every time this url is called, a page view is registered.

You could also use the Chrome-Developer-Protocoll for request interception. It desn't use a proxy.
You can filter specific requests or responses by url or type to get paused
Have a look at https://stackoverflow.com/a/75067388/20443541.

Related

Scraping a website that is locked behind discord oauth (Trying to automate logging in with oauth with python requests)

I'm trying to automate a login on a popular website. This website uses Discord oauth.
I have gotten to the stage where I have monitored the requests being made to discord (which contains the sites call-back URL.
However, the issue I am facing is that Discord's authorize button doesn't return the oauth code via requests. Instead when the button is clicked there is some obfuscated JS file which redirects the user to the oauth call-back URL with a generated code.
Unfortunately I do not know of a way to get this code since it cannot be monitored in network tab.
Is there a way I can get around this? For example initializing the JS file (Simulating that I clicked the authorize button in some way or another?)
I know I could use selenium, but selenium isn't great for performance, as well as websites constantly changing UI. Api endpoints are a much better way of doing it.
I'm using python httpx module.
An example login URL for this is:
https://discord.com/oauth2/authorize?client_id=896549597550358548&redirect_uri=https://www.monkeebot.xyz/oauth/discord&response_type=code&scope=identify%20guilds
when you click authorize it sends you via a callbackURL to the site on question. The goal is to automate logging in via this link by using python requests only.

Selenium Python get data from HTTP request

I am running automation with Selenium and Python on Opera web driver, when I enter the specific page that I need, a request is sent to the server, it is authenticated with anti-content which blocks me from requesting it, then the only solution is to get the returned JSON after sending the request, I had checked selenium-wire, but I think it doesn't fit my needs, I thought if there is another way to do that, any suggestions?

You can try to use Titanium Web Proxy. It is a proxy server and can be installed via Nuget package and used with Selenium.
string body = await e.GetResponseBodyAsString();
Reference:
https://github.com/justcoding121/Titanium-Web-Proxy/issues/176
https://www.automatetheplanet.com/webdriver-capture-modify-http-traffic/#tab-con-9

Hello there are some pages which is created to be impossible automatize the request.
That rule works in JavaScript and there are companies which makes this detection and close the access for a bot.
So I am sorry to cannot solve your problem, I tried to do the same as You and there are not way.

Comparing request module vs selenium in Python

I made a program that works with selenium, and it automates for posting comment to the some blogs' contents. I'm not familiar with the requests module of python. (working on it for just a week) The thing that I'm wondering is, my program with selenium is a bit slow for page loading, and it loads everything from ads to the images/videos. If I'd made my program with requests module, would it save data and a bit faster according to the selenium module?
I searched this issue at some forum-sites, generally they say request modules a bit faster, but not all. Also I couldn't find any info about saving data by comparing this modules?
Plz don't give me directly the thumbs down. I need this answer with details.

Selenium is used for web automation via clicking in web elements and sending keys to input boxes.
To speed up selenium, use headless mode, so that the visual components like ads are not loaded and the work is fast , go to selenium's documentation to learn more about headless mode.
While requests is used for HTTP methods
Like GET, POST etc. Learn more about requests from here
If the blogging site has a public api, then you can use requests module.
If you are new to API , I recommend watching this YouTube video
https://youtu.be/GZvSYJDk-us
For example to create issues on GitHub you can use GitHub API.
But to comment on a blogging site which has no public api, you need to use selenium.
Requests directly send and receive data from the server which hosts a particular service, so it is fast.
But selenium interacts with the web browser.
When you are using requests , you can do an action directly, without having to perform a bunch of clicks or send keys.

Selenium allows you to control a browser and execute actions on a webpage.
requests library is for making HTTP requests.
So, if you know how to write your program for posting comments with just using HTTP API then I’d go with requests, Selenium would be an overhead in this case

If you are proficient with HTTP requests and verb (know how to make a POST request to a server with requests library), then choose requests. If you want to test your script, use selenium or BeautifulSoup.

Simulate active session on a website with python

I'm looking for a way to simulate an active session on a website using Python. What I mean by that, is I want to create a software, which makes the website think that an actual user with an actual browser has the website open. I've found urllib3 and it's request.urlopen methon, but it seems that this only reads the content provided from url and closes the connection. Thanks for any suggestions

You can try simulate browser requests to get necessary cookies for authentication. Google Chrome Dev Tools and requests python lib will do the job.
Some websites have another way to handle sessions, but I believe the majority is using cookies set through post requests.

Python scraping data that can only be accessed through Google OAuth login

I want to scrape some data from a website which uses Google OAuth for the authentication. Some data can only be accessed if I perform a login.
Basically, when you open the website (mamikos.com) and click login, there is no option for a normal login form, it will give you options to log in with facebook or google. It then will redirect you to the google login page. After login with google account, you then be redirected to website's homepage and all the data will easily be accessible with a simple click.
I am basically a noob, only know some basic coding and googling. I have looked everywhere but it seems like I am looking in the wrong place. I tried to write code with selenium to automate the click, passing username/password, and perform the login, but apparently, selenium is not the right tool for this as it will open a browser and do the stuff.
Is it possible to do this login and authentication process in the background? I got over a hundred thousand URL of pages in which I need the data from. Using selenium will crash my computer and will take a long time to finish.
Can someone here please show or at least point me into the right tools/library/method. or idk is it even possible?
Thanks

Please note that this answer is currently a work in progress - I'm working on (almost) the exact same problem, (different site, and I'll be using go), but I can provide a rough workaround for getting started, and when my solution matures I will update this.
Reiteration of problem statement
What you are asking for is a way for your scraper (third party client) to authenticate with a website (Resource Server) via google oauth (Authorization Server), to access resources that your specific account (Resource owner) has permission to see.
This sounds like three legged ouath.
"OAuth2 Simplified" is a nicely written article by Aaron Parecki giving a broad overview of the roles of client, resource owner, resource server, and authorization server in the three legged oauth process.
Another requirement (from what I can infer) is that the client you're implmenting/authenticating with is not trusted by the Authorization Server or the Resource Server.
This is significant, as it does preclude certain oauth flows from being usable, and may mean that various oauth client libraries/packages are not viable, as they may not implement flows for untrusted clients.
Workaround (Rough pass)
You identified selenium as a potential workaround for achieving authentication.
You accurately identified that selenium is not a great solution for large-scale scraping, as it is very heavyweight, relatively slow, & uses a lot of resources.
This being said, you only need to use selenium once in this process - To automate the oauth process to obtain an access token to use the website.
Once you get a token, you can discard your selenium instance & use your favourite high performance scraping library to carry out the rest of your tasks. From there, you can attach the token to your requests and recieve access.
This blog post describes this approach broadly, using a JS selenium API (Under "Use a automated UI Test to get the access token via Authorization Code Grant" )
I will provide more specifics once I implement them.

I understand it can be tough to scrape data from such websites which are behind log-in pages. You will need to learn the basic replication of request that is being sent to the server through requests library of Python. It can be daunting in the beginning but you can learn it step by step here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.