python: extracting text from any website - python

so far i have done my work but it successfully getting text from these two websites :
http://www.tutorialspoint.com/cplusplus/index.htm
http://www.cplusplus.com/doc/tutorial/program_structure/
But I don't know where I am doing wrong and it is not getting text from other websites and it's is giving me error when i place other links such as:
http://www.cmpe.boun.edu.tr/~akin/cmpe223/chap2.htm
http://www.i-programmer.info/babbages-bag/477-trees.html
http://www.w3schools.com/html/html_elements.asp
Error:
Traceback (most recent call last):
File "C:\Users\DELL\Desktop\python\s\fyp\data extraction.py", line 20, in
text = soup.select('.C_doc')[0].get_text()
IndexError: list index out of range
My code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html" #unsuccessfull
#url = "http://www.tutorialspoint.com/cplusplus/index.htm" #doing successfully
#url = "http://www.cplusplus.com/doc/tutorial/program_structure/" #doing successfully
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract() # rip it out
# get text
#text = soup.select('.C_doc')[0].get_text()
#text = soup.select('.content')[0].get_text()
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
else:
text = soup.select('.C_doc')[0].get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
fo = open('foo.txt', 'w')
fo.seek(0, 2)
line = fo.writelines( text )
fo.close()
#writing done :)

Try using
Text = soup.findAll(text=True)
UPDATE
This is a basic text stripper you can start from.
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract()
text = soup.findAll(text=True)
for p in text:
print p

You are assuming all websites that you scrap has class name content OR C_doc.
What if the website you scrap does not have such class name C_doc?
Here is the fix:
text = ''
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
elif soup.select('.C_doc'):
text = soup.select('.C_doc')[0].get_text()
if text:
#put rest of the code.
else:
print 'text does not exists.'

Related

BeautifulSoup html parser taking time to parse html file

I'm trying to get the results from html file using BeautifulSoup:
with open(r'/home/maria/Desktop/iqyylog.html', "r") as f:
page = f.read()
soup = BeautifulSoup(page, 'html.parser')
for tag in soup.find_all('details'):
print tag
The problem here is basically iqyylog.html file contains more than 2500 nodes. While parsing, it is taking time to load the data. Is there any other way to parse HTML file with large data. When I'm using lxml parser it is taking only first 25 nodes.
Try this.
from simplified_scrapy import SimplifiedDoc, utils
html = utils.getFileContent(r'test.html')
doc = SimplifiedDoc(html)
details = doc.selects('details')
for detail in details:
print(detail.tag)
If you still have problems, try the following.
import io
from simplified_scrapy import SimplifiedDoc, utils
def getDetails(fileName):
details = []
tag = 'details'
with io.open(fileName, "r", encoding='utf-8') as file:
# Suppose the start and end tags are not on the same line, as shown below
# <details>
# some words
# </details>
line = file.readline() # Read data line by line
stanza = None # Store a details node
while line != '':
if line.strip() == '':
line = file.readline()
continue
if stanza and line.find('</' + tag + '>') >= 0:
doc = SimplifiedDoc(stanza + '</' + tag + '>') # Instantiate a doc
details.append(doc.select(tag))
stanza = None
elif stanza:
stanza = stanza + line
else:
if line.find('<' + tag) >= 0:
stanza = line
line = file.readline()
return details
details = getDetails('test.html')
for detail in details:
print(detail.tag)

parsing new lines with beautifulsoup

When parsing an html doc with BeautifulSoup, sometimes new lines are produced by html code, e.g.
<div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">
so when I extract text I miss a new line:
page = open(fname)
try:
soup = BeautifulSoup(page, 'html.parser')
except:
sys.exit("cannot parse %s" % fname)
soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
for script in soup(["script", "style"]):
script.extract() # rip it out
if not soup.body:
return
text = soup.body.get_text(separator = ' ')
lines = (clean_str(line) for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
Is there an adjustment I can add, that would break the text into lines correctly?

Select multiple elements with BeautifulSoup and manage them individually

I am using BeautifulSoup to parse a webpage of poetry. The poetry is separated into h3 for poem title, and .line for each line of the poem. I can get both elements and add them to a list. But I want to manipulate the h3 to be uppercase and indicate a line break, then insert it into the lines list.
linesArr = []
for lines in full_text:
booktitles = lines.select('h3')
for booktitle in booktitles:
linesArr.append(booktitle.text.upper())
linesArr.append('')
for line in lines.select('h3, .line'):
linesArr.append(line.text)
This code appends all book titles to the beginning of the list, then continues getting the h3 and .line items. I have tried inserting code like this:
linesArr = []
for lines in full_text:
for line in lines.select('h3, .line'):
if line.find('h3'):
linesArr.append(line.text.upper())
linesArr.append('')
else:
linesArr.append(line.text)
I'm not sure of what you are trying to do, but here with this way you can get an array with the title in upper case and all your line:
#!/usr/bin/python3
# coding: utf8
from bs4 import BeautifulSoup
import requests
page = requests.get("https://quod.lib.umich.edu/c/cme/CT/1:1?rgn=div2;view=fulltext")
soup = BeautifulSoup(page.text, 'html.parser')
title = soup.find('h3')
full_lines = soup.find_all('div',{'class':'line'})
linesArr = []
linesArr.append(title.get_text().upper())
for line in full_lines:
linesArr.append(line.get_text())
# Print full array with the title and text
print(linesArr)
# Print text here with line break
for linea in linesArr:
print(linea + '\n')

Scrape a MediaWiki website (specific html tags) using Python

I would like to scrape this specific MediaWiki website with specific tags. Here is my current code.
import urllib.request
from bs4 import BeautifulSoup
url = "https://wiki.sa-mp.com/wiki/Strfind"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
If you look at the URL, there is the description, parameters, return values and the example usage. That's what I would like to scrape. Thank you!
There may be a more efficient way to do this but the following uses css selectors to grab that information
from bs4 import BeautifulSoup
import requests as re
url ="https://wiki.sa-mp.com/wiki/Strfind"
response = re.get(url)
soup = BeautifulSoup(response.content, "lxml")
description = soup.select_one('.description').text
initial_parameters = soup.select('.parameters,.param')
final_parameters = [parameter.text for parameter in initial_parameters]
returnValues = soup.select_one('#bodyContent > .param + p + div').text
exampleUsage = soup.select_one('.pawn').text
results = [description,final_parameters,returnValues,exampleUsage]
print(results)

Scraping multiple URLs using beautiful soup

I have a dataframe with one of the columns containing over 4000 different URLs for articles. I have implemented the following code to extract all the text from the URLs, it seems to work for maybe one or two URLs but does not work for all.
for i in df.url:
http = urllib3.PoolManager()
response = http.request('GET', i)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
break
In the first for loop, you are assigning all the parsed urls to the same variable - soup. At the end of the loop, this variable will contain the parsed content of the last url and not all the urls as you expected. That's why you are seeing only one output.
You can put all your code in a single loop
for url in df.url:
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(url)
print(text)

Categories