I am creating a project in python that scrubs websites and sends links to customers. I currently have functioning versions that simply relay information, but it would be much better if I could make it easier by sending post requests to the server. For example, I currently send a link to my users to a certain product. I want to know if, instead, I can perform post requests in my program and send them a link that is part of the same session where that POST has been done already (apologies for poor use of terms). Basically completing actions for them and sending it through a link?
Related
I was making some test http requests using Python's request library. When searching for Walmart's Canadian site (www.walmart.ca), I got this:
How do servers like Walmart's detect that my request is being made programatically? I understand browsers send all sorts of metadata to the server. I was hoping to get a few specific examples of how this is commonly done. I've found a similar question, albeit related to Selenium Web Driver, here where it claims that there are some vendors that provide this service but I was hoping to get something a bit more specific.
Appreciate any insights, thanks.
As mentioned in the comments, a real browsers send many different values - headers, cookies, data. It reads from server not only HTML but also images, CSS, JS, fonts. Browser can also run JavaScript which can get other information about browser - version, extensions, data in local storage, etc (i.e how you move mouse). And real human loads/visits pages with random delays and in rather in random order. And all these elements can be used to detect a script. Servers may use very complex systems even Machine Learning (Artificial Intelligence) and use data from few mintues or hours to compare your behavior.
I am writing a Telegram bot that notifies me when a web page changes. I provide a URL, which is periodically fetched via requests.get and hashed. If the hash is different from the previous one, I'm notified.
I am willing to open it to the rest of the community, but then I need to guard against malicious usage and abuse. So far I guarded against users providing links to gigantic files but, given my minimal knowledge of the subject, I suspect there is more to it than that.
My questions are then:
What kinds of attacks am I exposing myself to?
How do I defend?
I am retrieving temperature data from a sensor continuously. Now I want to display them in a webpage hosted by a node.js webserver. I struggle to understand how these data are beeing send to my html webpage because there are many ways doing that without making any way clear for me. I read in this context terms like REST, AJAX, POST and GET.
Can someone make it clear for me which would be the easiest choice in that case.
All those terms are connected with one another:
REST is a software architecture used for creating web-services that allows a requesting system (e.g. your browser) to access and/or manipulate data on the server.
GET and POST are two HTTP methods that define what you want to do to the data on the server (get it, change it, add something, ...).
Ajax is used on the client-side to retrieve data from RESTfull services.
In your case, you would create a GET endpoint in node.js (with e.g. express) and then connect to this endpoint via Ajax to retrieve the data and display it to your website.
I want to scrape some data from a website which uses Google OAuth for the authentication. Some data can only be accessed if I perform a login.
Basically, when you open the website (mamikos.com) and click login, there is no option for a normal login form, it will give you options to log in with facebook or google. It then will redirect you to the google login page. After login with google account, you then be redirected to website's homepage and all the data will easily be accessible with a simple click.
I am basically a noob, only know some basic coding and googling. I have looked everywhere but it seems like I am looking in the wrong place. I tried to write code with selenium to automate the click, passing username/password, and perform the login, but apparently, selenium is not the right tool for this as it will open a browser and do the stuff.
Is it possible to do this login and authentication process in the background? I got over a hundred thousand URL of pages in which I need the data from. Using selenium will crash my computer and will take a long time to finish.
Can someone here please show or at least point me into the right tools/library/method. or idk is it even possible?
Thanks
Please note that this answer is currently a work in progress - I'm working on (almost) the exact same problem, (different site, and I'll be using go), but I can provide a rough workaround for getting started, and when my solution matures I will update this.
Reiteration of problem statement
What you are asking for is a way for your scraper (third party client) to authenticate with a website (Resource Server) via google oauth (Authorization Server), to access resources that your specific account (Resource owner) has permission to see.
This sounds like three legged ouath.
"OAuth2 Simplified" is a nicely written article by Aaron Parecki giving a broad overview of the roles of client, resource owner, resource server, and authorization server in the three legged oauth process.
Another requirement (from what I can infer) is that the client you're implmenting/authenticating with is not trusted by the Authorization Server or the Resource Server.
This is significant, as it does preclude certain oauth flows from being usable, and may mean that various oauth client libraries/packages are not viable, as they may not implement flows for untrusted clients.
Workaround (Rough pass)
You identified selenium as a potential workaround for achieving authentication.
You accurately identified that selenium is not a great solution for large-scale scraping, as it is very heavyweight, relatively slow, & uses a lot of resources.
This being said, you only need to use selenium once in this process - To automate the oauth process to obtain an access token to use the website.
Once you get a token, you can discard your selenium instance & use your favourite high performance scraping library to carry out the rest of your tasks. From there, you can attach the token to your requests and recieve access.
This blog post describes this approach broadly, using a JS selenium API (Under "Use a automated UI Test to get the access token via Authorization Code Grant" )
I will provide more specifics once I implement them.
I understand it can be tough to scrape data from such websites which are behind log-in pages. You will need to learn the basic replication of request that is being sent to the server through requests library of Python. It can be daunting in the beginning but you can learn it step by step here.
I'm wondering how to achieve that nice feature I see on many websites today: when having conversations on social networks like Facebook or Linkedin, you can always answer an online message or status (which is not an email), by answering the email notification you receive. How do they achieve that?
As far as I can tell, I'd see two options:
Configure a mail server to fetch the emails and transmit the information to a Python (in my case) script to handle the data and save a database record that can be simply displayed afterwards on the website
Have a Python script running in the background, checking the mail server for incoming emails every few seconds (via pop3 or something)
Is there any other option? Has somebody already implemented this? What are the main pitfalls to look at? I'm asking this because I'd like to implement something similar on a web application I'm currently working on.
Thanks!
J
EDIT: I found this link which partially answers my question already: Django/Python: email reply is updated to site
I too am working on something similar and finding this https://github.com/pinax/django-notification useful. Check it out, you'll get an idea how to implement what you want.