Scrape Facebook in Python

Scrape Facebook in Python - python

I'm interested in getting the number of friends each of my friends on Facebook has. Apparently the official Facebook API does not allow getting the friends of friends, so I need to get around this (somehwhat sensible) limitation somehow. I tried the following:
import sys
import urllib, urllib2, cookielib
username = 'me#example.com'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'email' : username, 'pass' : password})
request = urllib2.Request('https://login.facebook.com/login.php')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Fedora/3.6.12-1.fc14 Firefox/3.6.12')
opener.open(request, login_data)
resp = opener.open('http://facebook.com')
print resp.read()
but I only end up with a captcha page. Any idea how FB is detecting that the request is not from a "normal" browser? I could add an extra step and solve the captcha but that would add unnecessary complexity to the program so I would rather avoid it. When I use a web browser with the same User-Agent string I don't get a captcha.
Alternatively, does anyone have any saner ideas on how to accomplish my goal, i.e. get a list of friends of friends?

Have you tried tracing and comparing HTTP transactions with Fiddler2 or Wireshark? Fiddler can even trace https, as long as your client code can be made to work with bogus certs.

I did try a lot of ways to scrape facebook and the only way that worked for me is :
To install selenium , the firefox plugin, the server and the python client library.
Then with the firefox plugin, you can record the actions you do to login and export as a python script, you use this as a base for your work and it will work. Basically I added to this script a request to my webserver to fectch a list of things to inspect on FB and then at the end of the script I send the results back to my server.
I could NOT find a way to do it directly from my server with a browser simulator like mechanize or else ! I guess It needs to be done from a client browser.

Related

AttributeError: module 'http' has no attribute 'client'

I am trying to write a script that tells me the number of unread emails.
But I'm getting AttributeError.
My Code
import requests
from bs4 import BeautifulSoup
class email:
def unread():
url = 'https://mail.google.com/mail/u/0/#inbox'
headers_A = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l; rv:78.0) Gecko/20100101 Firefox/78.0'}
site = requests.get(url, headers=headers_A)
info = BeautifulSoup(site, 'html.parser')
unread = info.find('div', attrs={'class', 'bsU'}).text
return unread
email = email()
unread = email.unread()
print(unread)
The error code
AttributeError: module 'http' has no attribute 'client'
Thanks!

To solve this problem with a webscraper is an obvious idea because you would normally check your email through a browser.
Yet, to make a webscraper like this takes more than just one HTTP get request to the url for the inbox.
You would need to go through googles online authentication which is quite complicated as far as I know, then you would need to manage the session and cookies to stay logged in when you do your scraping.
Actually there is a much easier way to solve this problem.
In webscraping you are using the HTTP protocol, but when emailing, another protocol is used which much simpler and totally independent of HTTP. it's IMAP.
This means you don't need to use bs4 or requests, you can just connect to Googles IMAP server.
Here is a tutorial which will explain you how you can make a simple python IMAP client.
You will also need to allow less secure apps on your google account for this to work. You can do that here

building Web browser in Python and issue regarding cookies

I know this sounds weird, but I have got no choice, I searched the google and I found nothing, So..
I'm following a video tutorial https://www.youtube.com/watch?v=JEW50aEVi4k on 'building a webbrowser in python', I was wondering if cookies can be saved, So is it possible ?
If yes, then could you give some suggestions.

Cookies are not a problem - you can use mechanize (https://pypi.python.org/pypi/mechanize/) which saves and sends the cookies automatically.
import mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
response = browser.open('http://www.youtube.com')
#Headers are handled automatically. You can access them:
headers = browser.request.header_items()
>>> headers
[('Host', 'www.youtube.com'), ('Cookie', 'YSC=cNcoiHG71bY; VISITOR_INFO1_LIVE=uLHsDODGalg; PREF=f1=50000000'), ('User-agent', 'Python-urllib/2.7')]
It is very hard to write a browser with Javascript support. If you need javasctipt then i suggest you to use selenium with PhantomJS which acts just like a real browser.

Fetching cookie enabled page in python

I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!
This is how I go over it now:
import requests
import cookielib
cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text
Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.
The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?
I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either.
So, it is interesting to understand how to make it efficiently! Any ideas?!

You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:
hardcoding the cookies before making first request,
requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),

I think the winner here might be to use requests's session framework, which takes care of the cookies for you.
That would look something like this:
import requests
import cookielib
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)
r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text
Try that and see if that works?

Python CookieJar saves cookie, but doesn't send it to website

I am trying to login to website using urllib2 and cookiejar. It saves the session id, but when I try to open another link, which requires authentication it says that I am not logged in. What am I doing wrong?
Here's the code, which fails for me:
import urllib
import urllib2
import cookielib
cookieJar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))
# Gives response saying that I logged in succesfully
response = opener.open("http://site.com/login", "username=testuser&password=" + md5encode("testpassword"))
# Gives response saying that I am not logged in
response1 = opener.open("http://site.com/check")

Your implementation seems fine... and should work.
It should be sending in the correct cookies, but I see it as the case when the site is actually not logging you in.
How can you say that its not sending the cookies or may be cookies that you are getting are not the one that authenticates you.
Use : response.info() to see the headers of the responses to see what cookies you are receiving actually.
The site may not be logging you in because :
Its having a check on User-agent that you are not setting, since some sites open from 4 major browsers only to disallow bot access.
The site might be looking for some special hidden form field that you might not be sending in.
1 piece of advise:
from urllib import urlencode
# Use urlencode to encode your data
data = urlencode(dict(username='testuser', password=md5encode("testpassword")))
response = opener.open("http://site.com/login", data)
Moreover 1 thing is strange here :
You are md5 encoding your password before sending it over. (Strange)
This is generally done by the server before comparing to database.
This is possible only if the site.com implements md5 in javascript.
Its a very rare case, since only may be 0.01 % websites do that..
Check that - that might be the problem, and you are providing the hashed form and not the actual password to the server.
So, server would have been again calculating a md5 for your md5 hash.
Check out.. !!
:)

I had a similar problem with my own test server, which worked fine with a browser, but not with the urllib2.build_opener solution.
The problem seems to be in urllib2. As these answers suggest, it's easy to use more powerful mechanize library instead of urllib2:
cookieJar = cookielib.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(cookieJar)
opener = mechanize.build_opener(*browser.handlers)
And the opener will work as expected!

Python - The request headers for mechanize

I am looking for a way to view the request (not response) headers, specifically what browser mechanize claims to be. Also how would I go about manipulating them, eg setting another browser?
Example:
import mechanize
browser = mechanize.Browser()
# Now I want to make a request to eg example.com with custom headers using browser
The purpose is of course to test a website and see whether or not it shows different pages depending on the reported browser.
It has to be the mechanize browser as the rest of the code depends on it (but is left out as it's irrelevant.)

browser.addheaders = [('User-Agent', 'Mozilla/5.0 blahblah')]

You've got an answer on how to change the headers, but if you want to see the exact headers that are being used try using a proxy that displays the traffic. e.g. Fiddler2 on windows or see this question for some Linux altenatives.

you can modify referer too...
br.addheaders = [('Referer', 'http://google.com')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape Facebook in Python - python

Have you tried tracing and comparing HTTP transactions with Fiddler2 or Wireshark? Fiddler can even trace https, as long as your client code can be made to work with bogus certs.

Related

AttributeError: module 'http' has no attribute 'client'

building Web browser in Python and issue regarding cookies

Fetching cookie enabled page in python

Python CookieJar saves cookie, but doesn't send it to website

Python - The request headers for mechanize

Categories

Resources