I've been reading about beautifulSoup, http headers, authentication, cookies and something about mechanize.
I'm trying to scrape my favorite art websites with python. Like deviant art which I found a scraper for. Right now I'm trying to login but the basic authentication code examples I try don't work.
So question, How do I find out what type of authentication a site uses so that I know I'm trying to login the correct way? Including things like valid user-agents when they try to block bots.
Bear with my ignorance as I'm new to HTTP, python, and scraping.
It's very unlikely that any of the sites you are interested in use basic auth. You will need a library like mechanize that manages cookies and you will need to submit the login information to the site's login page.
Related
I want to scrape data from a website which has an initial log on (where I have working credentials). It is not possible to inspect the code for this, at is a log on that pops up before visiting the site. I tried searching around, but did not find any answer - perhaps I do not know what to search for.
This is what you get when going to the site:
Log on
Any help is appreciated :-)
The solution is to use the public REST API for the site.
If the web site does not provide a REST API for interacting with it you should not be surprised that your attempt at simulating a human is difficult. Web scraping is generally only possible for pages that do not require authentication or utilize the standard HTTP 401 status response to tell the client that it should prompt the user to respond with the correct credentials. If the site is using a different mechanism, most likely based on AJAX, then the solution is going to be specific to that web site or other sites using the same mechanism. Which means that no one can answer your question since you did not tell us which web site you are interacting with.
Based on your screenshot this looks like it is just using Basic Auth.
Using the library "requests":
import requests
session = requests.Session()
r = session.get(url, auth=requests.auth.HTTPDigestAuth('user', 'pass'))
Should get you there.
I couldn't get Tom's answer to work but I found a work around:
from selenium import webdriver
driver = webdriver.Chrome('path to chromedriver')
driver.get('https://user:password#webaddress.com/')
This worked :)
I want to read the HTML contents of a site on Google's Play Store developer backend from Python.
The Url is
https://play.google.com/apps/publish/?dev_acc=1234567890#AppListPlace
The site is of course only accessibly if you're logged in.
I naively tried:
response = requests.get(url, auth=HTTPBasicAuth('username#gmail.com', 'mypassword'))
which yielded only the default 'you need to be logged in to view this page' html content.
Any way to do this?
Trying to read the HTML contents of the page is not the way to go.
Basic HTTP authentication is not something you will see very often these days. It's the kind which pops up a browser alert message asking you for your username and password. Google, like most other websites, uses their own more sophisticated system. That system is not designed to be accessed by anyone but humans. Not to mention that storing your Google account password in your source code is a terrible idea.
Instead, you should look into the Google Play Developer API, which is designed to be accessed by machines, and uses OAuth2 authentication.
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes
I've looked at a lot of questions and libs and didn't found exactly what I wanted. Here's the thing, I'm developing an application in python for a user to get all sorts of things from social networks accounts. I'm having trouble with facebook. I would like, if possible, a step-by-step tutorial on the code and libs to use to get a user's information, from posts to photos information (with the user's login information, and how to do it, because I've had a lot of problem with authentication).
Thank you
I strongly encourage you to use Facebook's own APIs.
First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp).
Once you grasp the concepts, start using this API. For Python, there are at several alternatives:
facebook/python-sdk https://github.com/facebook/python-sdk
pyFaceGraph https://github.com/iplatform/pyFaceGraph/
It is also semitrivial to write a simple HTTP client that uses the graph API
I would suggest you to check out the Python libraries, try out the examples in their documentation and see if they are working and do the stuff you need.
Only as a last resort, would I write a scraper and try to extract data with screenscraping (it is much more painful and breaks more easily).
I have not used this with Facebook, but in the past when I had to scrape a site that required login I used Mechanize to handle the login and scraping and Beautiful Soup to parse the resulting HTML.
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.