Extracting text from html straight into a variable - python

I'm trying to extract a line of text from a html file straight into a variable, however, I have found no solution to the problem despite hours of searching. Beautiful Soup looks helpful, how would I be able to simply pick out a desired string as an input and then extract it from the html source right into a variable?
I've been trying to use request.text and Beutiful soup to scrape the entire page but it seems there is no function to directly do it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
def extract(url):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all('<DIV ALIGN="justify"')]
<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">
<TABLE WIDTH="75%" ALIGN="center">
<TR>
<TD>
<DIV ALIGN="center"><H1>STARTING . . . </H1></DIV>
<DIV ALIGN="justify"><P>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML.
<BR>
<P>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</P>
When I run I would like for it to return the string
<P>There are lots of ways to create web pages

Related

Python web scraping: websites from google search result

A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))

Can't scrape nested html using BeautifulSoup

I have am interested in scraping "0.449" from the following source code from http://hdsc.nws.noaa.gov/hdsc/pfds/pfds_map_cont.html?Lat=33.146425&Lon=-87.5805543.
<td class="tblInner" id="0-0">
<div style="font-size:110%">
<b>0.449</b>
</div>
"(0.364-0.545)"
</td>
Using BeautifulSoup, I currently have written:
storm=soup.find("td",{"class":"tblInner","id":"0-0"})
which results in:
<td class="tblInner" id="0-0">-</td>
I am unsure of why everything nested within the td is not showing up. When I search the contents of the td, my result is simply "-". How can I scrape the value that I want from this code?
You are likely scraping a website that uses javascript to update the DOM after the initial load.
You have a couple choices:
Find out where did the javascript code that fills the HTML page got the data from and call this instead. The data most likely comes from an API that you can call directly with CURL. That's the best method 99% of the time.
Use a headless browser (zombie.js, ...) to retrieve the HTML code after the javascript changes it. Convenient and fast, but few tools in python to do this (google python headless browser).
Use selenium or splinter to remote control a real browser (chrome, firefox, ...). It's convenient and works in python, but slow as hell
Edit:
I did not see that you posted the url you wanted to scrape.
In your particular case, the data you want comes from an AJAX call to this URL:
http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds
You now only need to understand what each parameter does, and parse the output of that instead of writing an HTML scraper.
Please excuse lack of error checking and modularity, but this should get you what you need, based on #Eloims observation:
import requests
import re
url = 'http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds'
r = requests.get(url)
response = r.text
coord_list_text = re.search(r'quantiles = (.*);', response)
coord_list = eval(coord_list_text.group(1))
print coord_list[0][0]

Is there a "easy" way to extract the dom structure of html element without using a python libary?

So far as I know BeautifulSoup and lxml are able to extract the dom structure of html element. But I would like to do it by myself because I need a high performance crawler without libary limitations. So
Is there a "easy" way to extract the dom structure of html element without using a python libary?
I ask this because I want to find a html element by only searching in the frontend of a website and then after I know which element I want then I need the dom path of this element.
For example the DOM path of the stackoverflow logo on this page is:
html > body.ask-page.new-topbar > div.container > div#header > div#logo > a
HTML is a context-free grammar and there is no guarantee that a given HTML response will be valid XML-wise (eg a clear tag hierarchy and everything having matching closing tags). The document structure is partially guessed by browsers and partially created using specific rules, if the tags are all messed up and not in a hierarchy.
If you really want to write your own HTML parsing library, and your example is not limited to a very specific kind of text you want to match (so a crude regex will not help), then consider the following HTML snippets that you can try and figure out the DOM structure for:
Let's start off with <p> guessing:
<p>blah blah
<p>blah blah
<p>blah blah
<p>blah blah <img src="a.jpg"> <!-- where is this image? -->
How about malformed closing tag order?
<img src="a.jpg"> <b>this is a cool image </b>
What about nesting wrong content types together?
<p>blah blah <div class="button"><img src="derp.png"></div></p>
In this example the <p> is closed before the <div> starts, because <p> does not accept flow content in it.
However, libraries like beautifulsoup are already equipped to parse all these terrible contraptions and more.

how to split a html page based on the presence of <p> <div> or <br> tags

I am trying to split scraped webpages into distinct parts on the basis of the position of
<p> <br> or <div> tags. So the first <p> tag would contain all the data/tags from <html> to the <p> tag in question. I have looked at something like etree from the lxml project, but it looks tangential.
The difference i see from "normal" html parsing is the number of tags selected. I want to select multiple tags and their data and save them seperately while "normal" html parsing tools offer the ability to select only one isolated tag/tags (using xpath,etc) and play with it. (I am also pretty new to web programming).
I have thought of a way where i would save file offsets and then proceed to cut and slice the input file to achieve my goal, but it seems hackish at best.
How can i go about achieving my stated goal, please help.
Thanks.
Use BeautifulSoup. It's a great python tool for parsing HTML.
Below is an example to show how easy it is to parse HTML - it prints the tag name (p) and the contents of all the <p> tags. Then it finds the element with id of "header".
This is just a snippet - BeautifulSoup provides many ways to filter HTML documents.
import sys
# add path to beautifulsoup library
sys.path.append('/usr/local/var/beautifulsoup')
# import it
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("yourfile.html"))
for tag in soup.find_all('p'):
print tag.name, tag.text
soup.find(id="header")

Removing broken tags and poorly formatted html from some text

i have a huge database of scraped forum posts that i am inserting into a website. however alot of people try to use html in their forum posts and often times do it wrong. because of this, there are always stray <strike> <b> </strike> </div> </b> tags in the posts which will end up messing up the webpage format when i add say 15 forum posts.
for now i have just been appending all possible end tags to the post just so that it might catch any open tag...is there a better way to do this short of parsing through the text and trying to manually remove each open tag. for loooooong forum posts this is an expensive transaction for a web app.
Have a look at HTML Tidy
There is a also a Python wrapper lib: µTidylib
Alternatively there is HTML Purifier
Beautiful Soup does a decent job at HTML cleanup.
Look at lxml also.

Categories