Scraping multiple URLs using beautiful soup - python

I have a dataframe with one of the columns containing over 4000 different URLs for articles. I have implemented the following code to extract all the text from the URLs, it seems to work for maybe one or two URLs but does not work for all.
for i in df.url:
http = urllib3.PoolManager()
response = http.request('GET', i)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
break

In the first for loop, you are assigning all the parsed urls to the same variable - soup. At the end of the loop, this variable will contain the parsed content of the last url and not all the urls as you expected. That's why you are seeing only one output.
You can put all your code in a single loop
for url in df.url:
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(url)
print(text)

Related

Python Requests to parse HTML to get CSV

So I am trying to do a POST request to a website and this website will display a CSV, however, the CSV is not downloadable only there in the form it is in so can be copied and pasted.
I am trying to get the HTML from the POST request and get the CSV, export this into a CSV file, to then run a function on. I have managed to get it into CSV form as a string but there doesn't appear to be new lines i.e.
>>> print(text1)
"Heading 1","Heading 2""Item 1","Item 2"
not
"Heading 1","Heading 2"
"Item 1","Item 2"
Is this format OK?
If not how do I get it into an OK format?
Secondly, how can I write this string into a CSV file? If I try to convert text1 into bytes, I get _csv.Error: iterable expected, not int, if not I get TypeError: a bytes-like object is required, not 'str'.
My code so far:
with requests.Session() as s:
response = s.post(headers=headers, data=data, url=url)
html = response.content
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
text1 = text.replace(text[:56], '')
print(text1)
I think this will work for you, this will find the element containing the csv data (could be body, could be a div, could be a p, etc) and only extract text from there so you don't need to worry about scripts or classes getting in your data:
import csv
from bs4 import BeautifulSoup
# emulate your html format
html_string = '''
<body>
<div class="csv">"Category","Position","Name","Time","Team","avg_power","20min","Male?"<br>"A","1","James ","00:21:31.45","5743","331","5.3","1"<br>"A","2","Da","00:21:31.51","4435","377","5.0","1"<br>"A","3","Timmy ","00:21:31.52","3964","371","4.8","1"<br>"A","4","Timothy ","00:21:31.83","5229","401","5.7","1"<br>"A","5","Stefan ","00:21:31.86","2991","338","","1"<br>"A","6","Josh ","00:21:31.92","","403","5.1","1"<br></div>
<body>
'''
soup = BeautifulSoup(html_string)
for br in soup.find_all('br'):
br.replace_with('\n')
rows = [[i.replace('"', '').strip() # clean the lines
for i in item.split(',')] # splite each item by the comma
# get all the lines inside the div
# this will get the first item matching the filter
for item in soup.find('div', class_='csv').text.splitlines()]
# csv writing function
def write_csv(path, data):
with open(path, 'w') as file:
writer = csv.writer(file)
writer.writerows(data)
print(rows)
write_csv('./data.csv', rows)
Output (data.csv):
Category,Position,Name,Time,Team,avg_power,20min,Male?
A,1,James,00:21:31.45,5743,331,5.3,1
A,2,Da,00:21:31.51,4435,377,5.0,1
A,3,Timmy,00:21:31.52,3964,371,4.8,1
A,4,Timothy,00:21:31.83,5229,401,5.7,1
A,5,Stefan,00:21:31.86,2991,338,,1
A,6,Josh,00:21:31.92,,403,5.1,1
soup.find()/find_all() can isolate an html element for you to scrape from so you don't have to worry about parsing other elements.

parsing new lines with beautifulsoup

When parsing an html doc with BeautifulSoup, sometimes new lines are produced by html code, e.g.
<div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">
so when I extract text I miss a new line:
page = open(fname)
try:
soup = BeautifulSoup(page, 'html.parser')
except:
sys.exit("cannot parse %s" % fname)
soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
for script in soup(["script", "style"]):
script.extract() # rip it out
if not soup.body:
return
text = soup.body.get_text(separator = ' ')
lines = (clean_str(line) for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
Is there an adjustment I can add, that would break the text into lines correctly?

Scrape a MediaWiki website (specific html tags) using Python

I would like to scrape this specific MediaWiki website with specific tags. Here is my current code.
import urllib.request
from bs4 import BeautifulSoup
url = "https://wiki.sa-mp.com/wiki/Strfind"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
If you look at the URL, there is the description, parameters, return values and the example usage. That's what I would like to scrape. Thank you!
There may be a more efficient way to do this but the following uses css selectors to grab that information
from bs4 import BeautifulSoup
import requests as re
url ="https://wiki.sa-mp.com/wiki/Strfind"
response = re.get(url)
soup = BeautifulSoup(response.content, "lxml")
description = soup.select_one('.description').text
initial_parameters = soup.select('.parameters,.param')
final_parameters = [parameter.text for parameter in initial_parameters]
returnValues = soup.select_one('#bodyContent > .param + p + div').text
exampleUsage = soup.select_one('.pawn').text
results = [description,final_parameters,returnValues,exampleUsage]
print(results)

How to filter out unimportant text from html to text conversion?

I am currently scraping couple of websites loading their html contents and striping out all unnecessary parts (tags etc.).
It often appears that I still get a lot of unneeded words which describe the bottom of the page or other subpages of the website.
def text_extractor(url):
try:
html = requests.get(url).text
soup = BeautifulSoup(html,'lxml')
for script in soup(["script","style","p"]):
script.decompose()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text = word_tokenize(text)
url = str(url).replace('/','').replace('http:','').replace('https:','')
name = 'textextractor_'+url+'.txt'
f = open(name,'a')
f.write(str(text))
f.close()
return text
except requests.exceptions.ConnectionError:
pass
I am working with this function which makes a good job but most of the generated tokens are irrelevant. Is there a smart way how to just get the text which is relevant for the page and describing for example an event or something similar?
These are examples for unimportant tokens:
'JournalismBig', 'HollywoodNational', 'SecurityTechVideoSportsThe', 'WiresBreitbart', 'LondonBreitbart', 'JerusalemBreitbart', 'TexasBreitbart', 'CaliforniaPeopleSTORE', 'HomeSubscribe', 'advertisement'
Thanks a lot!

python: extracting text from any website

so far i have done my work but it successfully getting text from these two websites :
http://www.tutorialspoint.com/cplusplus/index.htm
http://www.cplusplus.com/doc/tutorial/program_structure/
But I don't know where I am doing wrong and it is not getting text from other websites and it's is giving me error when i place other links such as:
http://www.cmpe.boun.edu.tr/~akin/cmpe223/chap2.htm
http://www.i-programmer.info/babbages-bag/477-trees.html
http://www.w3schools.com/html/html_elements.asp
Error:
Traceback (most recent call last):
File "C:\Users\DELL\Desktop\python\s\fyp\data extraction.py", line 20, in
text = soup.select('.C_doc')[0].get_text()
IndexError: list index out of range
My code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html" #unsuccessfull
#url = "http://www.tutorialspoint.com/cplusplus/index.htm" #doing successfully
#url = "http://www.cplusplus.com/doc/tutorial/program_structure/" #doing successfully
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract() # rip it out
# get text
#text = soup.select('.C_doc')[0].get_text()
#text = soup.select('.content')[0].get_text()
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
else:
text = soup.select('.C_doc')[0].get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
fo = open('foo.txt', 'w')
fo.seek(0, 2)
line = fo.writelines( text )
fo.close()
#writing done :)
Try using
Text = soup.findAll(text=True)
UPDATE
This is a basic text stripper you can start from.
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract()
text = soup.findAll(text=True)
for p in text:
print p
You are assuming all websites that you scrap has class name content OR C_doc.
What if the website you scrap does not have such class name C_doc?
Here is the fix:
text = ''
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
elif soup.select('.C_doc'):
text = soup.select('.C_doc')[0].get_text()
if text:
#put rest of the code.
else:
print 'text does not exists.'

Categories