Never done this but, I'm trying to build a program, that would scrape a google classroom site specific to the user that's logged in. Even when logged in the main browser google denies the request and instead gives me authentication error (I need to login in other words) how can I be logged in, in the program so that google accepts my request and grants me to scrape classroom sites.
Tried this solution but without luck: Logging into Google using Python
It was published a while ago and google could have changed the requirements for these kind of program authentication.
What I desire is to get into the section only available for me when I'm logged in, e.g. content of my classroom and grab some text from it, is it even possible?
It would be expensive to try and implement a log-in mechanism, especially with all the 2FA requirements of Google solutions today.
What would be quicker and usually works in software automation today is to have a manually logged in session and then start the browser with the user data directory pointed to it. This is how it's usually achieved and the relogin is done manually from time to time, only when needed. More info on how to set up a user data directory here.
This gets you up and running pretty fast.
Related
I am having trouble authenticating against a web service that has Oauth provided by google.
Basically, I want to login with my google account to a web page to do some scraping on it.
As the web service is not mine, I don't have the app secret_key, only the clientID, redirect_URL and scope that I could recover from seeing the parameters of request method used while being logged in.
Once authenticated, the web page only requieres a cookie named SID (Session ID I would guess) to answer back as an authenticated user. There is no Bearer token, just the SID cookie.
Is it possible to automate this type of authentication? I've read many topics related but they all need the secret_key which I don't have because I'm not the owner of the app.
(Cannot comment due to rep)
Yes, what you're asking is possible. You could theoretically follow and match all the requests to authenticate yourself successfully to get the SID and perform scraping, albeit this would be a very difficult task for some basic web-scraping, it's like programming a full-blown scientific calculator to do 5 + 5. What you are asking is a really difficult task, you're going to run into all sorts of security issues and be asked for phone/authenticator app/email verification when attempting to login to your account with Python requests and then you'd need to keep track of those security cookies and keeping them updated, it's a real mess and would be extremely difficult for anyone.
I think the better method would be to manually authenticate yourself and get the SID cookie and hard-code that into your scraper within the cookie HTTP header.
I understand this brings up the concern of what to do when the SID cookie expires. Since you haven't said the site, It would be hard for me to imagine a site that makes you authenticate yourself with Google often rather than having their own internal SID/JWT refreshing system to keep you logged in.
My recommendations would be:
Check the expiration of the SID cookie, if it's viable to manually copy-and-paste it after authenticating yourself, do that.
If the SIDs expire soon, check if there's an API request anywhere to get yourself a new SID (Without going through the OAuth again), in your Network panel look for the set-cookie response header setting a new SID, you might need to change and keep track of these inside your program but it'll be much easier than writing a program to login to Google.
If there's no way to refresh the SID and they expire often and you need to do long-term web scraping and sitting there getting a new cookie manually every 30 minutes isn't enough, I'd recommend looking into doing this with Puppeteer/Chromium as it'll be much easier than doing it via Python HTTP requests.
I want to scrape some data from a website which uses Google OAuth for the authentication. Some data can only be accessed if I perform a login.
Basically, when you open the website (mamikos.com) and click login, there is no option for a normal login form, it will give you options to log in with facebook or google. It then will redirect you to the google login page. After login with google account, you then be redirected to website's homepage and all the data will easily be accessible with a simple click.
I am basically a noob, only know some basic coding and googling. I have looked everywhere but it seems like I am looking in the wrong place. I tried to write code with selenium to automate the click, passing username/password, and perform the login, but apparently, selenium is not the right tool for this as it will open a browser and do the stuff.
Is it possible to do this login and authentication process in the background? I got over a hundred thousand URL of pages in which I need the data from. Using selenium will crash my computer and will take a long time to finish.
Can someone here please show or at least point me into the right tools/library/method. or idk is it even possible?
Thanks
Please note that this answer is currently a work in progress - I'm working on (almost) the exact same problem, (different site, and I'll be using go), but I can provide a rough workaround for getting started, and when my solution matures I will update this.
Reiteration of problem statement
What you are asking for is a way for your scraper (third party client) to authenticate with a website (Resource Server) via google oauth (Authorization Server), to access resources that your specific account (Resource owner) has permission to see.
This sounds like three legged ouath.
"OAuth2 Simplified" is a nicely written article by Aaron Parecki giving a broad overview of the roles of client, resource owner, resource server, and authorization server in the three legged oauth process.
Another requirement (from what I can infer) is that the client you're implmenting/authenticating with is not trusted by the Authorization Server or the Resource Server.
This is significant, as it does preclude certain oauth flows from being usable, and may mean that various oauth client libraries/packages are not viable, as they may not implement flows for untrusted clients.
Workaround (Rough pass)
You identified selenium as a potential workaround for achieving authentication.
You accurately identified that selenium is not a great solution for large-scale scraping, as it is very heavyweight, relatively slow, & uses a lot of resources.
This being said, you only need to use selenium once in this process - To automate the oauth process to obtain an access token to use the website.
Once you get a token, you can discard your selenium instance & use your favourite high performance scraping library to carry out the rest of your tasks. From there, you can attach the token to your requests and recieve access.
This blog post describes this approach broadly, using a JS selenium API (Under "Use a automated UI Test to get the access token via Authorization Code Grant" )
I will provide more specifics once I implement them.
I understand it can be tough to scrape data from such websites which are behind log-in pages. You will need to learn the basic replication of request that is being sent to the server through requests library of Python. It can be daunting in the beginning but you can learn it step by step here.
I am currently learning how to use django. I have a standalone python script that I want to communicate with my django app. However, I have no clue how to go about doing this. My django app has a login function and a database with usernames and passwords. I want my python script to talk to my app and verify the persons user name and password and also get some account info like the person's name. How do I go about doing this? I am very new to web apps and I am not really sure where to begin.
Some Clarifications: My standalone python program is so that the user can access some information about their account. I am not trying to use the script for login functionality. My django app already handles this. I am just trying to find a way to verify that they have said account.
For example: If you have a flashcards web app and you want the user to have a program locally on their computer to access their flashcards, they need to login and download the cards from the web app. So wouldn't the standalone program need to communicate with the app to get login information and access to the cards on that account somehow? That's what I am trying to accomplish.
If I understand you correctly, you're looking to have an external program communicate with your server. To do this, the server needs to expose an API (Application Interface) that communicates with the external program. That interface will receive a message and return a response.
The request will need to have two things:
identifying information for the user - usually a secret key - so that other people can't access the user's data.
a query of some sort indicating what kind of information to return.
The server will get the request, validate the user's secret key, process the query, and return the result.
It's pretty easy to do in Django. Set up a url like /api/cards and a view. Have the view process the request and return the response. Often, these days, these back and forth messages are encoded in JSON - an easy way to encapsulate and send data. Google around with the terms django, api, and json and you'll find a lot of what you need.
I'm in process of building an app for Facebook using Python and Django. I'm investigating different solutions for integration with Facebook authentication API.
So far I've found the two viable solutions:
django-social-auth
python-sdk
I've already tried the first one and it seems to work nicely. I've just read about the second one and it seems to use Facebook JavaScript SDK.
My question is: are those two libraries doing authentication differently? Do I understand correctly that the first one uses OAuth directly to communicate with Facebook and get an authentication token from there, whereas the second one just displays some JavaScript enriched intermediate sites that request the authentication token from the level of a web browser?
In general: are there different ways of going about facebook authentication (JavaScript SDK vs something else)? Why is JavaScript SDK a recommended approach? And is the "something else" approach incapable of producing cookies and therefore less efficient in any way...
When you use a backend implementation (python, PHP, Perl, etc), you generally have to use URL redirects (Graph API) to interact with Facebook and the user. Personally, I don't think this is a good user experience.
Using the javascript SDK, you can do everything inline. Which means the user never has to leave your page to grant permissions, post to wall, send requests, etc. You can still use the backend libs to do other things. And you would need to if you are doing any "offline" activity or subscribing to real time events.
In the end, you end up with the same authorization rights. Both are making similar calls to Facebook to get a valid, authorized session. So either one, or both works.
Is it possible for web application that is created by the same owner as facebook application to have access to facebook application without going through a explicit session opening exercise?
Most of the work is done on server side and I need to access facebook application directly from backend server. Each time the website loads I do not want user to go through the facebook connect experience as data to be displayed does not require his facebook profile/data access.
Let me know if its possible?
Although its not related to language, I would be grateful if help is provided keeping python in mind. Thx
The opening of a window for facebook auth is the way facebook set up their authentication for facebook connect.
I don't think they offer another way of authenticating users, and I doubt you'd be able to work-around/circumvent this method without breaking their terms of use
Sorry I don't have better news for you :/