Logging on to site to web scrape in Python - python

I want to scrape data from a website which has an initial log on (where I have working credentials). It is not possible to inspect the code for this, at is a log on that pops up before visiting the site. I tried searching around, but did not find any answer - perhaps I do not know what to search for.
This is what you get when going to the site:
Log on
Any help is appreciated :-)

The solution is to use the public REST API for the site.
If the web site does not provide a REST API for interacting with it you should not be surprised that your attempt at simulating a human is difficult. Web scraping is generally only possible for pages that do not require authentication or utilize the standard HTTP 401 status response to tell the client that it should prompt the user to respond with the correct credentials. If the site is using a different mechanism, most likely based on AJAX, then the solution is going to be specific to that web site or other sites using the same mechanism. Which means that no one can answer your question since you did not tell us which web site you are interacting with.

Based on your screenshot this looks like it is just using Basic Auth.
Using the library "requests":
import requests
session = requests.Session()
r = session.get(url, auth=requests.auth.HTTPDigestAuth('user', 'pass'))
Should get you there.

I couldn't get Tom's answer to work but I found a work around:
from selenium import webdriver
driver = webdriver.Chrome('path to chromedriver')
driver.get('https://user:password#webaddress.com/')
This worked :)

Related

How to scrape infinitely scrolling websites with login using python request (or similar)

I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.

Python Requests status_code 401 UNAUTHORIZED

I'm trying to web scrape from WITHIN a secure network. Security is already tight and I have a username and password- but if you open the site I'm trying to get on with my program, you wouldn't be prompted to login (because you're inside the network). I'm having trouble with the authentication here...
import requests
url = "http://theinternalsiteimtryingtoaccess.com"
r = requests.get(url, auth=('myusername', 'mypass'))
print(r.status_code)
>>>401
I've tried HTTPBasicAuth, but that didn't work either. Are there any ways with requests to get around this?
Just another note, the 'urlopen' command will open this site on command without any authentication being required...Please help! Thanks!
EDIT: After finding this question- (How to scrape URL data from intranet site using python?), I tried the following:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://theinternalsiteimtryingtoaccess.aspx",auth=HttpNtlmAuth('NEED DOMAIN HERE\\usr','pass'))
print(r.status_code)
>>>401 #still >:/
RESOLVED: Make sure that if you're having this problem, and you're trying to access an internal site, that in the code you specify your particular domain. I was trying to login but the computer didn't know where to log me into. You can find the domain you're on by going to control panel>>>system and domain should be listed there. Thank you!
It's extremely unlikely we can give you the exact solution to your problem, but I would guess the intranet uses some sort of corporate proxy. I would think your requests need to be directed to the proxy and not as if it's hitting an external public site.
For more information on this check out the official docs.
http://docs.python-requests.org/en/master/user/advanced/#proxies

How do I make Python urlib2 to cleverly avoid the security check while trying to log into a site?

I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes

Download from Megaupload with login - Python

It's my first question here.
Today, I've done a little application using wxPython: a simple Megaupload Downloader, but yet, it doesn't support premium accounts.
Now I would like to know how to download from MU with a login (free or premium user).
I'm very new to Python, so please don't be specific and "professional".
I used to download files with urlretrieve but, but is there a way to pass "arguments" or something to be able to log in as a premium user ?
Thank you. :D
EDIT =
News: new help needed xD
After trying with PyCUrl, htmllib2 and mechanize, I've done the login with urllib2 and cookiejar (the requested html says the username).
But when I start download a file, surely the server doesn't keep my login, in fact the downloaded file seems corrupted (I changed wait time from 45 to 25 seconds).
How can I download a file from MegaUpload keeping my previously done login? Thanks for your patient :D
Questions like this are usually frowned upon, they are very broad, and there are already an abundance of answers if you just search on google.
You can use urllib, or mechanize, or any library you can make an http post request with.
megaupload looks to have the form values
login:1
redir:1
username:
password:
just post those values at http://megaupload.com/?c=login
all you should have to do is set your username and password to the correct values!
For logging in using Python follow the following steps.
Find the list of parameters to be sent in the POST request and the url where the request has to be made by viewing the source of the login form. You may use a browser with "Inspect Element" feature to find it easily. [parameter name examples - userid, password]. Just check the tags name attribute.
Most of the sites set a cookie on logging in and the cookie has to be sent along with subsequent requests. To handle this download httllib2 (http://code.google.com/p/httplib2/ ) and read the wiki page on the link given. It has shown how to login with examples.
Now you can make subsequent requests for files, the cookies etc. will be handled automatically by httplib2.
i do alot of web stuff with python, i perfer using pycurl you can get it here
it is very simple to post data and login with curl, i've used it accross many languages such as PHP, python, and C++, hope this helps
You can use urllib this is a good example

Scraping Ajax - Using python

I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.

Categories