Beautiful Soup Nested Tag Search - python

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?

Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps

Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.

UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason

You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

Related

a checker of links in python

I have to create a function in python which searches links in an HTML page and then for every link, check if the link is broken or not.
Firstly I've created a function which searches every link (broken or not), no problem and stores them in a array called "g".
Secondly, I thought to create an array "a" which contains only the broken links by checking every item/element in array "g".
I don't know why, the check link by link doesn't work in the function. But if I copy a link and paste it like a parameter in function on the shell, it works.
import urllib.request
import re
import urllib.error
def check(url):
try:
f= urllib.request.urlopen(url)
except urllib.error.HTTPError :
return False
return True
def trouveurdurls(url):
f= urllib.request.urlopen(url)
b=(f.read().decode('utf-8'))
d=re.findall('<a href=".*</a>',b)
liste=[]
f.close()
for el in d:
g=re.findall('"https?://.*\.*"',el)
liste.append(g)
a=[]
for el in liste:
chaine= ''.join(map(str,el))
url=chaine.replace('\'','')
print(url)
#check(url)
b=trouveurdurls("http://127.0.0.1/")
I'd be very surprised if the line
d=re.findall('<a href=".*</a>',b)
worked correctly. The asterisk is doing a greedy match, meaning that it finds the longest possible match. If you have more than a single link in your document, this regex will gobble up everything between the first and the end of the last link.
You should probably use the non-greedy version of the asterisk, e.g. *?.
Also, you're only interested in the href attribute, so
d = re.findall('<a href=".*?"', b)
is probably what you want. This will make the following code easier; you won't need to try and look for urls starting with http:// because you'll already have them in the d array. Your code as it is would also find urls that weren't part of an a's href attribute but just part of the normal text, or part of an image reference, etc.
Note that even the regex <a href=".*?" won't constantly produce matches, because HTML is very lazy with it's syntax, so you might also encounter strings like <A HREF='.....'>, <a href = "....., <a id="something" href="......"> and so on.
You'll have to either improve the regex to match all such cases, or turn to BeautifulSoup like Evan's answer or a full-fledged HTML parser that takes care of it for you.
Here's a version that uses "requests." It scrapes the page for urls, then checks each of the urls for their Response codes (200 means it worked, etc).
import requests
from bs4 import BeautifulSoup
page = 'http://the.web.site'
r = requests.get(page)
data = r.text
soup = BeautifulSoup(data, 'lxml')
urls = []
for a in soup.find_all('a'):
urls.append(a.get('href'))
url_responses = {}
for url in urls:
temp_r = requests.get(url)
url_responses[url] = temp_r.status_code

how to extract specific csv from web page html containing multiple csv file links

I need to extract csv file from html page see below and once I get that I can do stuff with it. below is code to extract that particular line of html code from a previous assignment. The url is 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
that is test code so it breaks temporarly when it finds that line.
part of the line with my csv is href is csv/datasets/co2.csv ( unicode I think as type)
how to open the co2.csv?
sorry about any formatting issues with the question. The code has been sliced and diced by the editor.
import urllib
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *
def scrapper(url,k):
c=0
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#. Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
y= (tag.get('href', None))
#print ((y))
if y == 'csv/datasets/co2.csv':
print y
break
c= c+ 1
if c is k:
return y
print(type(y))
for w in range(29):
print(scrapper(url,w))
You're re-downloading and reparsing the full html page for all of the 30 iterations of your loop, just to get the next csv file and see if that is the one you want. That is very inefficient, and not very polite to the server. Just read the html page once, and use the loop over the tags you already had to check if the tag is the one you want! If so, do something with it, and stop looping to avoid needless further processing because you said you only needed one particular file.
The other issue related to your question is that in the html file the csv hrefs are relative urls. So you have to join them on the base url of the document they're in. urlparse.urljoin() does just that.
Not related to the question directly, but you should also try to clean up your code;
fix your indentation (the comment on line 9)
choose better variable names; y/c/k/w are meaningless.
Resulting in something like:
import urllib
import urlparse
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *
def scraper(url):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href = (tag.get('href', None))
if href.endswith("/co2.csv"):
csv_url = urlparse.urljoin(url, href)
# ... do something with the csv file....
contents = urllib.urlopen(csv_url).read()
print "csv file size=", len(contents)
break # we only needed this one file, so we end the loop.
scraper(url)

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.
The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.
For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Python BS4 crawler indexerror

I am trying to create a simple crawler that pulls meta data from websites and saves the information into a csv. So far I am stuck here, I have followed some guides but am now stuck with the error:
IndexError: list of index out of range.
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
Any help would be greatly appreciated.
The error I get is
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
Thank you.
It seems that you are not using all the power that bs4 can give you.
You are getting this error because the lenght of patFinderTitle is just one, since all html has usually only one title element per document.
A simple way to grab the title of a HTML, is using bs4 itself:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
You will probably get the same error if you try to iterate over your findPatLink in the currently way, since it has length 6. For me, it is not clear enough if you want to get all the link elements or all the anchor elements, but stickying with the first idea, you can improve your code using bs4 again:
link_href_list = [link['href'] for link in soup.find_all("link")]
And finally, since you don't want some urls, you can slice link_href_list in the way that you want. An improved version of the last expression which excludes the first and the second result could be:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

Write a python script that goes through the links on a page recursively

I'm doing a project for my school in which I would like to compare scam mails. I found this website: http://www.419scam.org/emails/
Now what I would like to do is to save every scam in apart documents then later on I can analyse them.
Here is my code so far:
import BeautifulSoup, urllib2
address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()
This saves me the whole html file in a text format, now I would like to strip the file and save the content of the html links to the scams:
01
02
03
etc.
If i get that, I would still need to go a step further and open save another href. Any idea how do I do it in one python code?
Thank you!
You picked the right tool in BeautifulSoup. Technically you could do it all do it in one script, but you might want to segment it, because it looks like you'll be dealing with tens of thousands of e-mails, all of which are seperate requests - and that will take a while.
This page is gonna help you a lot, but here's just a little code snippet to get you started. This gets all of the html tags that are index pages for the e-mails, extracts their href links and appends a bit to the front of the url so they can be accessed directly.
from bs4 import BeautifulSoup
import re
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/"))
tags = soup.find_all(href=re.compile("20......../index\.htm")
links = []
for t in tags:
links.append("http://www.419scam.org/emails/" + t['href'])
're' is a Python's regular expressions module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute match that regular expression. I chose this regular expression to get only the e-mail index pages rather than all of the href links on that page. I noticed that the index page links had that pattern for all of their URLs.
Having all the proper 'a' tags, I then looped through them, extracting the string from the href attribute by doing t['href'] and appending the rest of the URL to the front of the string, to get raw string URLs.
Reading through that documentation, you should get an idea of how to expand these techniques to grab the individual e-mails.
You might also find value in requests and lxml.html. Requests is another way to make http requests and lxml is an alternative for parsing xml and html content.
There are many ways to search the html document but you might want to start with cssselect.
import requests
from lxml.html import fromstring
url = 'http://www.419scam.org/emails/'
doc = fromstring(requests.get(url).content)
atags = doc.cssselect('a')
# using .get('href', '') syntax because not all a tags will have an href
hrefs = (a.attrib.get('href', '') for a in atags)
Or as suggested in the comments using .iterlinks(). Note that you will still need to filter if you only want 'a' tags. Either way the .make_links_absolute() call is probably going to be helpful. It is your homework though, so play around with it.
doc.make_links_absolute(base_url=url)
hrefs = (l[2] for l in doc.iterlinks() if l[0].tag == 'a')
Next up for you... how to loop through and open all of the individual spam links.
To get all links on the page you could use BeautifulSoup. Take a look at this page, it can help. It actually tells how to do exactly what you need.
To save all pages, you could do the same as what you do in your current code, but within a loop that would iterate over all links you'll have extracted and stored, say, in a list.
Heres a solution using lxml + XPath and urllib2 :
#!/usr/bin/env python2 -u
# -*- coding: utf8 -*-
import cookielib, urllib2
from lxml import etree
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open("http://www.419scam.org/emails/")
page.addheaders = [('User-agent', 'Mozilla/5.0')]
reddit = etree.HTML(page.read())
# XPath expression : we get all links under body/p[2] containing *.htm
for node in reddit.xpath('/html/body/p[2]/a[contains(#href,".htm")]'):
for i in node.items():
url = 'http://www.419scam.org/emails/' + i[1]
page = opener.open(url)
page.addheaders = [('User-agent', 'Mozilla/5.0')]
lst = url.split('/')
try:
if lst[6]: # else it's a "month" link
filename = '/tmp/' + url.split('/')[4] + '-' + url.split('/')[5]
f = open(filename, 'w')
f.write(page.read())
f.close()
except:
pass
# vim:ts=4:sw=4
You could use HTML parser and specify the type of object you are searching for.
from HTMLParser import HTMLParser
import urllib2
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
print attr[1]
address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()
parser = MyHTMLParser()
parser.feed(html)

Categories