For a personal project, I'm trying to get a full friends list of a user (myself for now) from Facebook using Requests and BeautifulSoup.
The main friends page however displays only 20, and the rest are loaded with Ajax when you scroll down.
The request url looks something like this (method is GET):
https://www.facebook.com/ajax/pagelet/generic.php/AllFriendsAppCollectionPagelet?dpr=1&data={"collection_token":"1244314824:2256358349:2","cursor":"MDpub3Rfc3RydWN0dXJlZDoxMzU2MDIxMTkw","tab_key":"friends","profile_id":1244214828,"overview":false,"ftid":null,"order":null,"sk":"friends","importer_state":null}&__user=1364274824&__a=1&__dyn=aihaFayfyGmagngDxfIJ3G85oWq2WiWF298yeqrWo8popyUW3F6wAxu13y78awHx24UJi28cWGzEgDKuEjKeCxicxabwTz9UcTCxaFEW58nVV8-cxnxm1typ9Voybx24oqyUf9UgC_UrQ4bBv-2jAxEhw&__af=o&__req=5&__be=-1&__pc=EXP1:DEFAULT&__rev=2677430&__srp_t=1474288976
My question is, is it possible to recreate the dynamically generated tokens such as the __dyn, cursor, collection_token etc. to send manually in my request? Is there some way to figure out how they are generated or is it a lost cause?
I know that the current Facebook API does not support viewing a full friends list. I also know that I can do this with Selenium, or some other browser simulator, but that feels way too slow, ideally I want to scrape thousands of friends lists (of users whose friends lists are public) in a reasonable time.
My current code is this:
import requests
from bs4 import BeautifulSoup
with requests.Session() as S:
requests.utils.add_dict_to_cookiejar(S.cookies, {'locale': 'en_US'})
form = {}
form['email'] = 'myusername'
form['pass'] = 'mypassword'
response = S.post('https://www.facebook.com/login.php?login_attempt=1&lwv=110', data=form)
# Im logged in
page = S.get('https://www.facebook.com/yoshidakai/friends?source_ref=pb_friends_tl')
Any help will be appreciated, including other methods to achieve this :)
As of this writing, you can extract this information by parsing the page and then get the next cursor for latter pages by parsing the preceding ajax response. However, as Facebook regularly makes updates to its backend, I have had more stable results using selenium to drive a Chrome headless browser to scroll through the page, and then parsing the resulting HTML.
Related
im just trying to log into vrv and get the list of shows from the crunchyroll page so i can just open the site later, but when i try to get back the parsed website after logging in. Theres a lot of info missing like titles and images and its incomplete. This is the code i have up to now. Obviously my email and password isnt email and password, i just changed them to post it here.
import requests
import pyperclip as p
def enterVrv():
s = requests.Session()
dataL = {'email': 'email', 'password': 'password'}
s.post('https://static.vrv.co/vrvweb/build/common.6fb25c4cff650ac4e6ae.js', data=dataL)
crunchy = s.get('https://vrv.co/crunchyroll/browse')
p.copy(str(crunchy.content))
exit(0)
Ive tried posting from the normal 'https://vrv.co' site, i tried from the 'https://vrv.co.signin' link, and i tried the link you currently see in the code that i got from the networks pane in the developers tool. After i ran the code i would take the copied html and replace the current one on a webbrowser to see if its pulling up correctly, but it all comes in incomplete.
It looks like your problem is that you're trying to get data from a web page that's being loaded dynamically. Indeed, if you navigate to https://vrv.co/crunchyroll/browse in your browser you'll likely notice there's a delay in between the page loading and the cruncyroll titles being displayed.
It also looks like vrv does not expose an API for you to programmatically access this data either.
To get around this you could try accessing the page via a web automation tool such as selenium and scraping the data that way. As for just making a basic request to the site though, you're probably out of luck.
I'm doing a personal project where I am trying to scrape HTML tables from a financial data website using Python. I am able to successfully use the requests package in Python to access public websites and extract any information (using BeautfulSoup4 afterwards for processing), but the code I am using is shown below:
# import requests
import requests
# access website
url = 'https://financial-data-url.ezproxy1.library.uniname.edu.com/path/to/financial/data'
headers = example_header
page = requests.get(url, headers = headers)
However, trying to access the website normally requires login through my University's library database through an EZproxy server (shown in example url). When I attempt to request the URL of the financial data webpage after getting access through the library database, it returns what seems to be the University library EZproxy webpage. This is where I need to click "login" before being directed to the financial data webpage.
Is there some credential provision that I may be missing in the request function, or potentially a different way of passing the proxy server to the URL so that the request does not end up on the proxy server login page?
I found that the fastest and most effective work-around for this problem is to use the Selenium web-based automation package (https://selenium-python.readthedocs.io/)
Selenium makes it very easy to replicate a login as well as navigation within the browser just as a person would. IMO, the simplicity of it may far outweigh the benefits of calling the web-page directly depending on the use-case (not efficient when speed and efficiency is the primary goal, however if that is not a major constraint it works quite well)
I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!
I am working to scrape the reports from this site, I hit the home page, enter report date and hit submit, it's Ajax enabled and I am not getting how to get that report table . Any help will be really appreciated.
https://www.theice.com/marketdata/reports/176
I tried sending get and post using requests module, but failed as Session Time out or Report not Available.
EDIT:
Steps Taken so far:
URL = "theice.com/marketdata/reports/datawarehouse/..."
with requests.Session() as sess:
f = sess.get(URL,params = {'selectionForm':''}) # Got 'selectionForm' by analyzing GET requests to URL
data = {'criteria.ReportDate':--, ** few more params i got from hitting submit}
f = sess.post(URL,data=data)
f.text # Session timeout / No Reports Found –
Since you've already identified that the data you're looking to scrape is hidden behind some AJAX calls, you're already on your way to solving this problem.
At the moment, you're using python-requests for HTTP, but that is pretty much all it does. It does not handle executing JavaScript or any other items that involve scanning the content and executing code in another language runtime. For that, you'll need to use something like Mechanize or Selenium to load those websites, interact with the JavaScript, and then scrape the data you're looking for.
I am signing into my account at www.goodreads.com to scrape the list of books from my profile.
However, when I go to the goodreads page, even if I am logged in, my scraper gets only the home page. It cannot log in to my account. How do I redirect it to my account?
Edit:
from bs4 import BeautifulSoup
import urllib2
response=urllib2.urlopen('http://www.goodreads.com')
soup = BeautifulSoup(response.read())
[x.extract() for x in soup.find_all('script')]
print(soup.get_text())
If I run this code, I get only till the homepage, I cannot login to the my profile, even if I am already logged in to the browser.
What do I do to log in from a scraper?
Actually when you go to the site there is something called sessions that contains information about your accout ( not exactly but something like that ) and your browser can use them so every time that you go to the main page you are logged in , but you code doesn't use sessions and these things so you should do everything from the first
1) go to mainpage 2) log in 3) gathering your data
and also this question showed how to login to your account
I hope it helps.
Goodreads has an API that you might want to use instead of trying to log in and scrape the site's HTML. It's formatted in XML, so you can still use BeautifulSoup - just make sure you have lxml installed and use it as the parser. You'll need to register for a developer key, and also register your application, but then you're good to go.
You can use urllib2 or requests library to login and then scrape the response. In my experience using requests is a lot easier.
Here's a good explanation on logging in using both urllib2 and requests:
How to use Python to login to a webpage and retrieve cookies for later usage?