Can I scrape a website to identify headings and text under it? - python

I want to create a web scraper so that it identifies the headings and text related to it on the web page. Can anyone help in how can that be done?
Demo Image
For example, here in the image attached, "Prerequisites" is the heading and the text below is "corresponding text".

You should use python and BeautifulSoup, a library made for web scraping.
For a given url you extract the actual content of the page using request the following way :
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
Once you have the object soup you can find all headings the following way :
headings = list()
for i in range(1, 7):
# <h1> to <h6>
headings.extend(soup.findAll(f'h{i}'))
headings now contains all the headings from h1 to h6. Now to extract the text you just proceed as follows :
text_content = soup.text

Related

Scrape tab href value from a webpage by python Beautiful Soup

I have code that extracts links from the main page and navigates through each page in the list of links, the new link has a tab page that is represented as follows in the source:
<Li Class=" tab-contacts" Id="contacts"><A Href="?id=448&tab=contacts"><Span Class="text">Contacts</Span>
I want to extract the href value and navigate to that page to get some information, here is my code so far:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get(link_to_the_website)
data = r.content
soup = BeautifulSoup(data, "html.parser")
links = []
for i in soup.find_all('div',{'class':'leftInfoWrap'}):
link = i.find('a',href=True)
if link is None:
continue
links.append(link.get('href'))
for link in links:
soup = BeautifulSoup(link,"lxml")
tabs = soup.select('Li',{'class':' tab-contacts'})
print(tabs)
However I am getting an empty list with 'print(tabs)' command. I did verify the link variable and it is being populated. Thanks in advance
Looks like you are trying to mix find syntax with select.
I would use the parent id as an anchor then navigate to the child with css selectors and child combinator.
partial_link = soup.select_one('#contacts > a')['href']
You need to append the appropriate prefix.

How to find all Elements of a specific Type with the new Requests-HTML library

I wanna find all specific fields in a HTML, in Beautiful soup everything is working with this code:
soup = BeautifulSoup(html_text, 'html.parser')
urls_previous = soup.find_all('h2', {'class': 'b_algo'})
but how can I make the same search with the requests library or can requests only find a single element in a HTML document, I couldn't find how to do it in the docs or examples ?
https://html.python-requests.org/
Example:
<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>
How can I find all Elements of a specific type with the requests library ?
with requests-html
from requests_html import HTML
doc = """<li class="b_algo"><h2>Vereinigte Staaten – Wikipedia</h2>https://de.wikipedia.org/wiki/Vereinigte_Staaten</div><p>U.S., I wanna have THIS text here</p></li>"""
#load html from string
html = HTML(html=doc)
x = html.find('h2')
print(x)

Extract data from span tag within a div tag

I have the html code below and I'm trying to extract the 3:40 as text to use in my python script. How would I go about grabbing that info?
example element
I would use the BeautifulSoup library. Here is how I would grap this info knowning that you already have the HTML file :
from bs4 import BeautifulSoup
with open(html_path) as html_file:
html_page = BeautifulSoup(html_file, 'html.parser')
div = html_page.find('div', class_='playbackTimeline__duration')
span = div.find('span', {'aria-hidden': 'true'})
text = span.get_text()
I'm not sure if it works, but it gives you an idea on how to do this kind of stuff. Check for "web scraping" if you want more information about that. :)

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories