scrape facebook likes with python - python

I'm trying to scrape Facebook public page likes data using Python. My scraper uses the post number in order to scrape the likes data. However, some posts have more than 6000 likes and I can only scrape 6000 likes, also I have been told that this is due to Facebook restriction which doesn't allow to scrape more than 6000 per day. How can I continue scrape the likes for the post from the point the scraper stop scraping.

I am thinking maybe facebook has limited the scraping from the same address which over 6000 times. You can try to use scrapy which is a package that used to scrap webpages, it has a component which like a ip pool that can be used for this.

In tags I see facebook-graph-api, which has limitations. Why don't you use requrests + lxml? It would be such easier, and as you want to scrape public pages, you don't even have to login, so it could be easily solve.

Related

Automated webscraper for specific words

Let's say I want to make a website that automatically scrapes specific websites in order to find the ex. bike model that my customer has typed.
Customer: Wants to find this one specific bike model that is really hard to get
Customer: Finds the website www.EXAMPLE.com, the website will notify him when there is an auction on ex. ebay or amazon.
Customer: Creates free account, and makes a post.
Website: Makes an automated scraping and keeps looking for this bike on ebay and amazon.
Website: As soon as scraping succeed and finds the bike, website sends notification to the customer.
Is that possible to make in python? And will I be able to make such a website with little knowledge after learning a bit of Python?
Yes it possible, you can achieve that by using a package such as Requests for scraping and Flask to build the website, it does require however a bit of knowledge.
Feel free to post a question after diving into the two links

python web scrape but blocked

I'm trying to do web scraping with BeautifulSoup and requests libraries but I got blocked by website.
Instead of doing copy/paste from a website , I wanted to do it automatically so I tried with Python.
I just did a
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='list-xxx')
I was trying to understand the html and when I went back on the website,
I was blocked.
How come ?
I did not send 1000 of requests.
Does it mean we can do web scraping ?
Thanks
This can be for many reasons. It's possible that you have moved to a country the website does not serve. Or you have violated their terms by sending too many requests.
In such cases the behavior you describe you can take what you see as an indication that the owners of the website either do not want you to scrape their information, or they assume the requests you have sent in their frequency an attempt to perform a DDoS (Distributed Denial of Service) attack.
If they do not want to allow scraping, then it's advisable to avoid doing it. However, if they do not have a problem with scraping, it's a good idea to contact them and ask them about their policy (if that's not public already) so you can comply to that and, if scraping is allowed, you can scrape the way you are not offending them.

Any way to get instagram business profiles contact number using unofficial instagram api or scraping?

There are lot of apis that allows to scrape business profile email but I can't find anything that can scrape business profile contact number. I think the main reason would be contact option is available only in instagram app.
I am not aware of the "Unofficial APIs", talking about scraping.
Instagram treats all accounts the same, there is no easy way to filter business accounts(via URL for example).
what you can try is first scrape the list followers/following of some big player(Popular Account) of your domain for which you want the accounts for, then loop threw each of them
via
https://www.instagram.com/[Account user-id scraped]/
then for each, you can check if it's a business account(look for the special HTML thing which you find only on business accounts or any other).

Pages not processing fully

I am trying to scrape news articles from yahoo finance and to do so, i want to use their sitemap page https://finance.yahoo.com/sitemap/
The problem i have is that after following a link https://finance.yahoo.com/sitemap/2015_04_02 for example scrapy does not process the whole page - only the header. So i cannot access the links to the different articles.
Is there some internal requests that i have to sent to the page ?
I still get the whole page by deactivating javascript in my browser and i use scrapy 1.6
Thanks.
Some sites take defensive measures against robots scraping their websites. If they detect that you are non-human, they may not serve the entire page. But more than likely what is happening is there is a bunch of client-side rendering that happens when you view the page in a web browser, which is not being executed when you request that same page in scrapy.
Yahoo! Finance has a API. Using that will probably get you more reliable results.

My script fetches few of the content among several

I've written a script to get all the reviews, reviewer names and ratings from yelp using their api. My below script can produce the three reviews, reviewer names and ratings from that api. However, I can see 44 of such reviews in that landing page where I collected their api from using chrome dev tools. How can i get all of them?
link to the landing page
This is my try:
import requests
res = requests.get("https://eatstreet.com/api/v2/restaurants/40225?yelp_site=")
name = res.json()['name']
for texualreviews in res.json()['yelpReviews']:
reviewtext = texualreviews['message']
revname = texualreviews['reviewerName']
rating = texualreviews['rating']
print(f'{name}\n{reviewtext}\n{revname}\n{rating}\n')
As I said earlier: my above script can produce three of the reviews whereas there are 44 of them. How can i grab them all?
Screenshot of those reviews (location to find them in that landing page).
Yelp's own API doesn't allow for query of more than 3 reviews; for whatever reason they limit the amount of reviews you can get (the same way Google limits their API to displaying only 5 reviews). If you are scraping, scrape the Yelp page directly; the site which you are hitting is using the API to display 3 reviews (the max) with a call back directly to that locations Yelp site (where all the reviews are shown); there is sadly no native way to extract all the reviews from Yelp;
The API URL You queried off Google's Developer Tools inspector in Chrome (https://eatstreet.com/api/v2/restaurants/40225?yelp_site=) is calling on Fusion's (Yelp's API) to pull the yelpReviews array in the JSON; limited to 3 by default, even if you were to register your own Fusion app you won't be able to pull more than 3 reviews, that's a hard cap set by Yelp.
You could search for some makeshift scripts out there though, I've seen many people make attempts to create libraries for pulling review data where the API's are limited. A good example is one I wrote here: https://github.com/ilanpatao/Yelp-Reviews-API
Best,
Ilan

Categories