Web-crawler for facebook in python

Web-crawler for facebook in python - python

I am tring to work with web-Crawler in python to print the number of facebook recommenders. for example in this article from sky-news(http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine)
there are about 60 facebook reccomends. I want to print this number in the python program with web-crawler.
i tried to do this, but it doesn't print anything:
import requests
from bs4 import BeautifulSoup
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'pluginCountTextDisconnected'}):
try:
print(item_name.string)
except:
print("error")
get_single_item_data("http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine")

The Facebook recommends loads in an iframe. You can follow the iframe src attribute to that page, and then load the span.pluginCountTextDisconnected's text:
import requests
from bs4 import BeautifulSoup
url = 'http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine'
r = requests.get(url) # get the page through requests
soup = BeautifulSoup(r.text) # create a BeautifulSoup object from the page's HTML
url = soup('iframe')[0]['src'] # search for the iframe element and get its src attribute
r = requests.get('http://' + url[2:]) # get the next page from requests with the iframe URL
soup = BeautifulSoup(r.text) # create another BeautifulSoup object
print(soup.find('span', class_='pluginCountTextDisconnected').string) # get the directed information
The second requests.get is written as such due to the src attribute returning //www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnews.sky.com%2Fstory%2F1330046&send=false&layout=button_count&width=120&show_faces=false&action=recommend&colorscheme=light&font=arial&height=21. I added the http:// and ignored the leading //.
BeautifulSoup documentation
Requests documentation

Facebook recommends are loaded dynamically from javascript, so they won't be available to your HTML parser. You will need to use the Graph API and FQL to get your answer directly from Facebook.
Here is a web console where you can explore queries once you have generated an access token.

Related

Webscraping / Beautifulsoup / sometimes None-return?

i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)

The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.

The website blocks scrapers, check the title:
print(soup.find("title"))
To bypass this you must use a real browser which can run JavaScript.
A tool called Selenium can do that for you.

Beautifulsoup extracting urls from a given website menu

Hello every one I'm new to beautifulsoup, I'm trying to write a function that will be able to extract second level urls from a given website.
For example if I have this website url : https://edition.cnn.com/ my function should be able to return
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
first I have tried this code to retrieve all links starting with the string of the url:
from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
response = requests.get(url)
data = response.text
soup = bs4(data, 'lxml')
links = []
for link in soup.find_all('a', href=re.compile(str(url))):
links.append(link.get('href'))
return links
But then again the actual output is giving me all the links even links of articles which is not I'm looking for. is there a method that I can use to get what I want using regular expression or others.

The links are inside <nav> tag, so using CSS selector nav a[href] will select only links inside <nav> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('nav a[href]'):
if a['href'].count('/') > 1 or '#' in a['href']:
continue
print(url + a['href'])
Prints:
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk
...and so on.

Problem with scraping data from website with BeautifulSoup

I am trying to take a movie rating from the website Letterboxd. I have used code like this on other websites and it has worked, but it is not getting the info I want off of this website.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://letterboxd.com/film/avengers-endgame/")
soup = BeautifulSoup(page.content, 'html.parser')
final = soup.find("section", attrs={"class":"section ratings-histogram-
chart"})
print(final)
This prints nothing, but there is a tag in the website for this class and the info I want is under it.

The reason behind this, is that the website loads most of the content asynchronously, so you'll have to look at the http requests it sends to the server in order to load the page content after loading the page layout. You can find them in "network" section in the browser (F12 key).
For instance, one of the apis they use to load the rating is this one:
https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/

You can get the weighted average from another tag
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/film/avengers-endgame/')
soup = bs(r.content, 'lxml')
print(soup.select_one('[name="twitter:data2"]')['content'])
Text of all histogram
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/')
soup = bs(r.content, 'lxml')
ratings = [item['title'].replace('\xa0',' ') for item in soup.select('.tooltip')]
print(ratings)

Links from BeautifulSoup without href or <a>

I am trying to create a bot that scrapes all the image links from a site and store them somewhere else so I can download the images after.
from selenium import webdriver
import time
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.artstation.com/artwork?sorting=trending'
page = requests.get(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
soup = bs(driver.page_source, 'html.parser')
gallery = soup.find_all(class_="image-src")
data = gallery[0]
for x in range(len(gallery)):
print("TAG:", sep="\n")
print(gallery[x], sep="\n")
if page.status_code == 200:
print("Request OK")
This returns all the links tags i wanted but I can't find a way to remove the html or copy only the links to a new list. Here is an example of the tag i get:
<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>
So, how do i get only the links within the gallery[] list?
What i want to do after is to take this links and edit the /smaller-square/ directory to /large/, which is the one that has the high resolution image.

The page loads it's data through AJAX, so through network inspector we see, where the call is made. This snippet will obtain all the image links found on page 1, sorted by trending:
import requests
import json
url = 'https://www.artstation.com/projects.json?page=1&sorting=trending'
page = requests.get(url)
json_data = json.loads(page.text)
for data in json_data['data']:
print(data['cover']['medium_image_url'])
Prints:
https://cdna.artstation.com/p/assets/images/images/012/272/796/medium/ben-zhang-brigitte-hero-concept.jpg?1533921480
https://cdna.artstation.com/p/assets/covers/images/012/279/572/medium/ham-sung-choul-braveking-140823-1-3-s3-mini.jpg?1533959982
https://cdnb.artstation.com/p/assets/covers/images/012/275/963/medium/michael-vicente-orb-gem-thumb.jpg?1533933774
https://cdnb.artstation.com/p/assets/images/images/012/275/635/medium/michael-kutsche-piglet-by-michael-kutsche.jpg?1533932387
https://cdna.artstation.com/p/assets/images/images/012/273/384/medium/ben-zhang-unnamed.jpg?1533923353
https://cdnb.artstation.com/p/assets/covers/images/012/273/083/medium/michael-vicente-orb-guardian-thumb.jpg?1533922229
... and so on.
If you print the variable json_data, you will see other information the page sends (like icon image url, total_count, data about the author etc.)

You can access the attributes using key-value.
Ex:
from bs4 import BeautifulSoup
s = '''<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>'''
soup = BeautifulSoup(s, "html.parser")
print(soup.find("div", class_="image-src")["image-src"])
#or
print(soup.find("div", class_="image-src").attrs['image-src'])
Output:
https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301

Returning webSITE links Python Beautifulsoup

I am using Python 3.5 with beautifulsoup (bs4) and urllib. The code I will append returns all the links for ONE page.
How do I loop this so it runs across all pages in the website, using the links found on each page to dictate which pages are to be scraped next. As I don't know how many hops I need to go.
I have tried looping it of course, but it never stops as pages contain links to pages I have already scanned. I have tried creating sets of the links I have scanned putting in IF not in set ... but again it just runs forever.
import bs4
import re
import urllib.request
website = 'http://elderscrolls.wikia.com/wiki/Skyrim'
req = urllib.request.Request(website)
with urllib.request.urlopen(req) as response:
the_page = response.read()#store web page html
dSite = bs4.BeautifulSoup(the_page, "html.parser")
links = []
for link in dSite.find_all('a'):#grab all links on page
links.append(link.get('href'))
siteOnly = re.split('/', website)
validLinks = set()
for item in links:
if re.search('^/' +siteOnly[3] + '/', str(item)):#filter links to local website
newLink = 'http://' + str(siteOnly[2]) + str(item)
validLinks.add(newLink)
print(validLinks)

import bs4, requests
from urllib.parse import urljoin
base_url = 'http://elderscrolls.wikia.com/wiki/Skyrim'
response = requests.get(base_url)
soup = bs4.BeautifulSoup(response.text, 'lxml')
local_a_tags = soup.select('a[href^="/wiki/"]')
links = [a['href']for a in local_a_tags]
full_links = [urljoin(base_url, link) for link in links]
print (full_links)
out:
http://elderscrolls.wikia.com/wiki/The_Elder_Scrolls_Wiki
http://elderscrolls.wikia.com/wiki/Portal:Online
http://elderscrolls.wikia.com/wiki/Quests_(Online)
http://elderscrolls.wikia.com/wiki/Main_Quest_(Online)
http://elderscrolls.wikia.com/wiki/Aldmeri_Dominion_Quests
http://elderscrolls.wikia.com/wiki/Daggerfall_Covenant_Quests
http://elderscrolls.wikia.com/wiki/Ebonheart_Pact_Quests
http://elderscrolls.wikia.com/wiki/Category:Online:_Side_Quests
http://elderscrolls.wikia.com/wiki/Factions_(Online)
http://elderscrolls.wikia.com/wiki/Aldmeri_Dominion_(Online)
http://elderscrolls.wikia.com/wiki/Daggerfall_Covenant
http://elderscrolls.wikia.com/wiki/Ebonheart_Pact
http://elderscrolls.wikia.com/wiki/Classes_(Online)
http://elderscrolls.wikia.com/wiki/Dragonknight
http://elderscrolls.wikia.com/wiki/Sorcerer_(Online)
http://elderscrolls.wikia.com/wiki/Nightblade_(Online)
http://elderscrolls.wikia.com/wiki/Templar
http://elderscrolls.wikia.com/wiki/Races_(Online)
http://elderscrolls.wikia.com/wiki/Altmer_(Online)
http://elderscrolls.wikia.com/wiki/Argonian_(Online)
http://elderscrolls.wikia.com/wiki/Bosmer_(Online)
http://elderscrolls.wikia.com/wiki/Breton_(Online)
http://elderscrolls.wikia.com/wiki/Dunmer_(Online)
http://elderscrolls.wikia.com/wiki/Imperial_(Online)
http://elderscrolls.wikia.com/wiki/Khajiit_(Online)
http://elderscrolls.wikia.com/wiki/Nord_(Online)
http://elderscrolls.wikia.com/wiki/Orsimer_(Online)
http://elderscrolls.wikia.com/wiki/Redguard_(Online)
http://elderscrolls.wikia.com/wiki/Locations_(Online)
http://elderscrolls.wikia.com/wiki/Regions_(Online)
http://elderscrolls.wikia.com/wiki/Category:Online:_Realms
http://elderscrolls.wikia.com/wiki/Category:Online:_Cities
http://elderscrolls.wikia.com/wiki/Category:Online:_Dungeons
http://elderscrolls.wikia.com/wiki/Category:Online:_Dark_Anchors
http://elderscrolls.wikia.com/wiki/Wayshrines_(Online)
http://elderscrolls.wikia.com/wiki/Category:Online:_Unmarked_Locations
http://elderscrolls.wikia.com/wiki/Combat_(Online)
http://elderscrolls.wikia.com/wiki/Skills_(Online)
http://elderscrolls.wikia.com/wiki/Ultimate_Skills
http://elderscrolls.wikia.com/wiki/Synergy
http://elderscrolls.wikia.com/wiki/Finesse
http://elderscrolls.wikia.com/wiki/Add-ons
First, use requests instead of urllib.
Than, use Beautifulsoup CSS selector can to filter the href base on the start, you can refer the document to learn more.
Finally, use urljoin to convert relative url to absoulte url
>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl/%7Eguido/FAQ.html'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web-crawler for facebook in python - python

Facebook recommends are loaded dynamically from javascript, so they won't be available to your HTML parser. You will need to use the Graph API and FQL to get your answer directly from Facebook. Here is a web console where you can explore queries once you have generated an access token.

Related

Webscraping / Beautifulsoup / sometimes None-return?

Beautifulsoup extracting urls from a given website menu

Problem with scraping data from website with BeautifulSoup

Links from BeautifulSoup without href or <a>

Returning webSITE links Python Beautifulsoup

Categories

Resources