Can Beautiful Soup parse hidden attributes?

Can Beautiful Soup parse hidden attributes? - python

So I used Beautiful Soup in python to parse a page that displays all my facebook friends.Here's my code:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.facebook.com/xxx.xxx/friendspnref=lhc")
soup=BeautifulSoup(r.content)
for link in soup.find_all("a"):
print link.get('href')
The thing is it displays a lot of links but none of them are links to my friends' profiles,which are displayed normally on the webpage.
On doing Inspect element I fount this
<div class="hidden_elem"><code id="u_0_2m"><!--
The code continues,and the links to their profiles are commented within an li tag in the div tag.
Two questions mainly:
(1.)What does this mean and why can't Beautiful Soup read them?
(2.)Is there a way to read them?
I really don't plan to achieve anything by this ,just curious.

Related

Website scraping with Python, how do I know where to reference in the html?

I'm a complete beginner who has only built basic Python projects. Right now I'm building a scraper in Python with bs4 to help me read success stories off of a website. These success stories are all in a table, so I thought I would find an html tag that said table and would encompass the entire table.
However, it is all just <div and <span class, and when I use soup.find("div") or ("span") it returns only the single word "div" or "span". This is what I have so far, and I know it isn't right or set up correctly but I'm too inexperienced to know why yet.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
req = Request('https://www.calix.com/about-calix/success-stories.html', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "lxml")
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
print('div')
I have watched several tutorials on how to use bs4 and I have successfully scraped basic websites, but all I can do for this one is get ALL of the html, not the chunks I need (just the success stories).

You are printing 'div' make sure to be printing soup as soup gets updated whenever you find something within it.
You should have a look at the bs4 documentation.

soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
Here you're calling soup.find() but you're not saving the results into a variable, so the results are lost.
print('div')
And here you're printing the literal string div. I don't think that's what you intended.
Try something like this:
div = soup.find("div", {"id": "..."})
print(div)

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D

This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Extract links from html page using BeautifulSoup

I need to extract some articles from the Piography website.
so from this page http://www.biography.com/people I need all the sublinks.
for example:
/people/ryan-seacrest-21095899
/people/edgar-allan-poe-9443160
but I have two problems:
1- when I am trying to a find all < a >. I couldn't find the href that I need.
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.biography.com/people"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
divs = soup.findAll('a')
for div in divs:
print(div)
2- There are a "see more" button. so how I can take all the links for all the people in the website. not just that appear in the first page?

On site what you show, use angular and part of content generate with JS. BeautifulSoup not execute JS. You need to use http://selenium-python.readthedocs.io/ or another like instrument. Or you may pry in ajax need for you GET(or may be POST) method, and give data through him.

BeautifulSoup (bs4) does not find all tags

I'm using Python 3.5 and bs4
The following code will not retrieve all the tables from the specified website. The page has 14 tables but the return value of the code is 2. I have no idea what's going on. I manually inspected the HTML and can't find a reason as to why it's not working. There doesn't seem to be anything special about each table.
import bs4
import requests
link = "http://www.pro-football-reference.com/players/B/BradTo00.htm"
htmlPage = requests.get(link)
soup = bs4.BeautifulSoup(htmlPage.content, 'html.parser')
all_tables = soup.findAll('table')
print(len(all_tables))
What's going on?
EDIT: I should clarify. If I inspect the soup variable, it contains all of the tables that I expected to see. How am I not able to extract those tables from soup with the findAll method?

this page is rendered by javascript, and if you disable the javascrip in you broswer, you will notice that this page only hava two table.
i recommend to use selenium for this situation.

How to use Beautiful soup to return destination from HTML anchor tags

I am using python 2 and Beautiful soup to parse HTML retrieved using the requests module
import requests
from bs4 import BeautifulSoup
site = requests.get("http://www.stackoverflow.com/")
HTML = site.text
links = BeautifulSoup(HTML).find_all('a')
Which returns a list containing output which looks like Navigate
The content of the attribute href for each anchor tag can be in several forms, for example it could be a javascript call on the page, it could be a relative address to a page with the same domain(/next/one/file.php), or it could be a specific web address (http://www.stackoverflow.com/).
Using BeautifulSoup is it possible to return the web addresses of both the relative and specific addresses to one list, excluding all javascript calls and such, leaving only navigable links?

From the BS docs:
One common task is extracting all the URLs found within a page’s <a> tags:
for link in soup.find_all('a'):
print(link.get('href'))

You can filter out the href="javascript:whatever()" cases like this:
hrefs = []
for link in soup.find_all('a'):
if link.has_key('href') and not link['href'].lower().startswith('javascript:'):
hrefs.append(link['href'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can Beautiful Soup parse hidden attributes? - python

Related

Website scraping with Python, how do I know where to reference in the html?

BeautifulSoup doesn't find all spans or children

Extract links from html page using BeautifulSoup

BeautifulSoup (bs4) does not find all tags

How to use Beautiful soup to return destination from HTML anchor tags

Categories

Resources