I'm looking for a way to simulate an active session on a website using Python. What I mean by that, is I want to create a software, which makes the website think that an actual user with an actual browser has the website open. I've found urllib3 and it's request.urlopen methon, but it seems that this only reads the content provided from url and closes the connection. Thanks for any suggestions
You can try simulate browser requests to get necessary cookies for authentication. Google Chrome Dev Tools and requests python lib will do the job.
Some websites have another way to handle sessions, but I believe the majority is using cookies set through post requests.
Related
I am running automation with Selenium and Python on Opera web driver, when I enter the specific page that I need, a request is sent to the server, it is authenticated with anti-content which blocks me from requesting it, then the only solution is to get the returned JSON after sending the request, I had checked selenium-wire, but I think it doesn't fit my needs, I thought if there is another way to do that, any suggestions?
You can try to use Titanium Web Proxy. It is a proxy server and can be installed via Nuget package and used with Selenium.
string body = await e.GetResponseBodyAsString();
Reference:
https://github.com/justcoding121/Titanium-Web-Proxy/issues/176
https://www.automatetheplanet.com/webdriver-capture-modify-http-traffic/#tab-con-9
Hello there are some pages which is created to be impossible automatize the request.
That rule works in JavaScript and there are companies which makes this detection and close the access for a bot.
So I am sorry to cannot solve your problem, I tried to do the same as You and there are not way.
I made a program that works with selenium, and it automates for posting comment to the some blogs' contents. I'm not familiar with the requests module of python. (working on it for just a week) The thing that I'm wondering is, my program with selenium is a bit slow for page loading, and it loads everything from ads to the images/videos. If I'd made my program with requests module, would it save data and a bit faster according to the selenium module?
I searched this issue at some forum-sites, generally they say request modules a bit faster, but not all. Also I couldn't find any info about saving data by comparing this modules?
Plz don't give me directly the thumbs down. I need this answer with details.
Selenium is used for web automation via clicking in web elements and sending keys to input boxes.
To speed up selenium, use headless mode, so that the visual components like ads are not loaded and the work is fast , go to selenium's documentation to learn more about headless mode.
While requests is used for HTTP methods
Like GET, POST etc. Learn more about requests from here
If the blogging site has a public api, then you can use requests module.
If you are new to API , I recommend watching this YouTube video
https://youtu.be/GZvSYJDk-us
For example to create issues on GitHub you can use GitHub API.
But to comment on a blogging site which has no public api, you need to use selenium.
Requests directly send and receive data from the server which hosts a particular service, so it is fast.
But selenium interacts with the web browser.
When you are using requests , you can do an action directly, without having to perform a bunch of clicks or send keys.
Selenium allows you to control a browser and execute actions on a webpage.
requests library is for making HTTP requests.
So, if you know how to write your program for posting comments with just using HTTP API then I’d go with requests, Selenium would be an overhead in this case
If you are proficient with HTTP requests and verb (know how to make a POST request to a server with requests library), then choose requests. If you want to test your script, use selenium or BeautifulSoup.
I want to scrape some data from a website which uses Google OAuth for the authentication. Some data can only be accessed if I perform a login.
Basically, when you open the website (mamikos.com) and click login, there is no option for a normal login form, it will give you options to log in with facebook or google. It then will redirect you to the google login page. After login with google account, you then be redirected to website's homepage and all the data will easily be accessible with a simple click.
I am basically a noob, only know some basic coding and googling. I have looked everywhere but it seems like I am looking in the wrong place. I tried to write code with selenium to automate the click, passing username/password, and perform the login, but apparently, selenium is not the right tool for this as it will open a browser and do the stuff.
Is it possible to do this login and authentication process in the background? I got over a hundred thousand URL of pages in which I need the data from. Using selenium will crash my computer and will take a long time to finish.
Can someone here please show or at least point me into the right tools/library/method. or idk is it even possible?
Thanks
Please note that this answer is currently a work in progress - I'm working on (almost) the exact same problem, (different site, and I'll be using go), but I can provide a rough workaround for getting started, and when my solution matures I will update this.
Reiteration of problem statement
What you are asking for is a way for your scraper (third party client) to authenticate with a website (Resource Server) via google oauth (Authorization Server), to access resources that your specific account (Resource owner) has permission to see.
This sounds like three legged ouath.
"OAuth2 Simplified" is a nicely written article by Aaron Parecki giving a broad overview of the roles of client, resource owner, resource server, and authorization server in the three legged oauth process.
Another requirement (from what I can infer) is that the client you're implmenting/authenticating with is not trusted by the Authorization Server or the Resource Server.
This is significant, as it does preclude certain oauth flows from being usable, and may mean that various oauth client libraries/packages are not viable, as they may not implement flows for untrusted clients.
Workaround (Rough pass)
You identified selenium as a potential workaround for achieving authentication.
You accurately identified that selenium is not a great solution for large-scale scraping, as it is very heavyweight, relatively slow, & uses a lot of resources.
This being said, you only need to use selenium once in this process - To automate the oauth process to obtain an access token to use the website.
Once you get a token, you can discard your selenium instance & use your favourite high performance scraping library to carry out the rest of your tasks. From there, you can attach the token to your requests and recieve access.
This blog post describes this approach broadly, using a JS selenium API (Under "Use a automated UI Test to get the access token via Authorization Code Grant" )
I will provide more specifics once I implement them.
I understand it can be tough to scrape data from such websites which are behind log-in pages. You will need to learn the basic replication of request that is being sent to the server through requests library of Python. It can be daunting in the beginning but you can learn it step by step here.
Generally, whenever a page is loaded, it sends several requests, which can be recorded in the Network tab in chrome developer tools.
My prime motive is to log all the network requests whenever page is loaded using python script. A sample screen-shot is attached, to illustrate what all request I am trying to collect.
Image for Network Hit Logs
I am trying to achieve the same using urllib library in python, however, I am not exactly sure of the usage.
Looking forward for your responses. Thanks in advance.
You can't do this with the urllib family of libraries. To capture AJAX Requests you need to use something that has Javascript support... like a browser.
So your best option in this case is to use Selenium to write a script that uses the Selenium Web Driver to drive whatever browser you're using and then capture/log the AJAX requests being pushed out.
I am currently scraping some websites(legitimately), using urllib2 on python with socksipy for Tor. I have changed the header fields to be identical to TBB. However, some websites seem to detect me as a spider/scraper. How are they able to do so?
Most likely these sites sets cookie and use JavaScript which urllib does not execute.
Use a real browser (TBB) with WebDriver.