How to web scrab from txt file - python

Let's say I use online tool like HTML Source Code Viewer
then I input a link then they generate the HTML Source Code.
Then select only the <li> tags that I want, something like this
<li class='item'><a class='list-link' href='https://foo1.com'><img src='https://foo1.com/imgfoo1.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo2.com'><img src='https://foo1.com/imgfoo2.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo3.com'><img src='https://foo1.com/imgfoo3.jpg' /></a></li>
so yeah, sometimes it's one long line, then put them to a text name urlcontainer.txt
So, how should I scrab that?
Because when I run the code below on python using terminal
import requests
import numpy as np
from bs4 import BeautifulSoup as soup
page_html = np.genfromtxt('urlcontainer.txt',dtype='str')
page_soup = soup(page_html, "html.parser") #I got the error on this line
And this is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 225, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/usr/lib/python2.7/dist-packages/bs4/builder/_htmlparser.py", line 157, in prepare_markup
exclude_encodings=exclude_encodings)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 352, in __init__
markup, override_encodings, is_html, exclude_encodings)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 228, in __init__
self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 280, in strip_byte_order_mark
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The thing is, when I type page_html on the terminal, this is the value :
array(['<li', "class='item'><a", "class='list-link'",
"href='https://foo1.com'><img",
"src='https://foo1.com/imgfoo1.jpg'", '/></a></li><li',
"class='item'><a", "class='list-link'",
"href='https://foo2.com'><img",
"src='https://foo1.com/imgfoo2.jpg'", '/></a></li><li',
"class='item'><a", "class='list-link'",
"href='https://foo3.com'><img",
"src='https://foo1.com/imgfoo3.jpg'", '/></a></li>'],
dtype='|S34')

Just read the file as you normally would. No need to use NumPy.
with open("urlcontainer.txt") as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")
Then, carry on with your parsing activities.

Related

Python - BeautifulSoup error while scraping

UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!
I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.
I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?
The code I have so far is very simple:
from bs4 import BeautifulSoup as soup
openU = urlopen(url)
thispage = openU.read()
open.close()
pageS = soup(thispage, 'html.parser')
The error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
parser.feed(markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
"subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
Please help!
Try using
pageS = soup(thispage, 'lxml')
insted of
pageS = soup(thispage, 'html.parser')
It looks may be a problem with characters encoding using "html.parser"

BeautifulSoup (bs4), html5lib, HTMLParseError: malformed start tag, at line 1, column 11

I need to copy the source code from a website onto an html file stored locally as parsing from the url directly does not capture all of the page elements. I am hoping to extract locational elements within a table in the source code to be used for geocoding. My program goes through several pages of search results, writing the source code from each to an html file stored locally. The address elements are only about a third of the material each page so it would be nice to get rid of the additional elements to reduce the file size.
To do this, I would like the program to open a blank html doc for writing, write the current page's source code to it, close the doc, reopen it for parsing (in 'r' mode now), open a new doc for writing, and use beautiful soup to capture all of the geocoding data form the first doc and write it to the new document. The program will then close the first doc and then reopen it in 'w' mode again.
This will be done in a loop so the first doc will always get overwritten with the current page's source code while the second doc will stay open and keep having just the geocoding data written to it until there are no more pages.
Everything with looping and navigating and writing the source code to file is working fine but i can't get the parsing part figured out. I tried experimenting in an interactive env with this code:
from bs4 import BeautifulSoup
import html5lib
data = open(r"C:\GIS DataBase\web_resutls_raw_new_test.html",'r').read()
document = html5lib.parse(data)
soup = BeautifulSoup(str(document))
And I get the following error:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Python27\lib\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\bs4\builder\_htmlparser.py", line 219, in feed
raise e
HTMLParseError: malformed start tag, at line 1, column 11
So I tried the following fix:
soup = HTMLParser.handle_starttag(BeautifulSoup(str(document)))
And alas:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Python27\lib\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\bs4\builder\_htmlparser.py", line 219, in feed
raise e
HTMLParseError: malformed start tag, at line 1, column 11
I also tried with lxml, ertree and nothing seems to work. I cannot get the elements I need parsing from the url directly. I need to parse from the html file.
Pass data directly to BeautifulSoup as :
soup = BeautifulSoup(data,'html.parser')

Finding specific text using BeautifulSoup

I'm trying to grab all the winner categories from this page:
http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013
I've written this in sublime:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013"
page = urllib2.urlopen(url)
soup_package = BeautifulSoup(page)
page.close()
#find everything in the div class="bestOfItem). This works.
all_categories = soup_package.findAll("div",class_="bestOfItem")
# print(all_categories)
#this part breaks it:
soup = BeautifulSoup(all_categories)
winner = soup.a.string
print(winner)
When I run this in terminal, I get the following error:
Traceback (most recent call last):
File "winners.py", line 12, in <module>
soup = BeautifulSoup(all_categories)
File "build/bdist.macosx-10.9-intel/egg/bs4/__init__.py", line 193, in __init__
File "build/bdist.macosx-10.9-intel/egg/bs4/builder/_lxml.py", line 99, in prepare_markup
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 249, in encodings
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 304, in find_declared_encoding
TypeError: expected string or buffer
Any one know what's happening there?
You are trying to create a new BeautifulSoup object from a list of elements.
soup = BeautifulSoup(all_categories)
There is absolutely no need to do this here; just loop over each match instead:
for match in all_categories:
winner = match.a.string
print(winner)

How to recover a document from un properly closed tags in python?

Here is my problem
I have a sample text like
text="""<!--translated from:
The Dutch Royal Library
"""
now I tried to strip this text from tags, but I always get this error using this code
t = html.fromstring(text)
ctext = t.text_content()
and my error is
Traceback (most recent call last):
File "test.py", line 31, in <module>
t = html.fromstring(text)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
I traced the bug I found that removing unclosed
I already tried using BeautifulSoup
and here is my code
soup = BeautifulSoup(text)
print soup.prettify()
but no use, so can anyone help me?
Try removing the unclosed tag:
soup = BeautifulSoup(text[4:])
print soup.prettify()
Then BeautifulSoup will be able to find the content. You can have more information about this library at their documentation page

information lost when using beautifulsoup to parse a html page

I'm writing a web spider to get some information from a website. when I parse this page http://www.tripadvisor.com/Hotels-g294265-oa120-Singapore-Hotels.html#ACCOM_OVERVIEW
, I find that some information are lost, I print the html doc using soup.prettify(),and the html doc is not the same with the doc I get using urllib2.openurl(), something is lost. Codes are as following:
htmlDoc = urllib2.urlopen(sourceUrl).read()
soup = BeautifulSoup(htmlDoc)
subHotelUrlTags = soup.findAll(name='a', attrs={'class' : 'property_title'})
print len(subHotelUrlTags)
#if len(subHotelUrlTags) != 30:
# print soup.prettify()
for hotelUrlTag in subHotelUrlTags:
hotelUrls.append(website + hotelUrlTag['href'])
I try to using HtmlParser to do the same thing, it prints out the following errors:
Traceback (most recent call last):
File "./spider_new.py", line 47, in <module>
hotelUrls = getHotelUrls()
File "./spider_new.py", line 40, in getHotelUrls
hotelParser.close()
File "/usr/lib/python2.6/HTMLParser.py", line 112, in close
self.goahead(1)
File "/usr/lib/python2.6/HTMLParser.py", line 164, in goahead
self.error("EOF in middle of construct")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: EOF in middle of construct, at line 3286, column 1
Download and install lxml..
It can parse such "faulty" webpages. (The HTML is probably broken in some weird way, and Python's HTML parser isn't great at understanding that sort of thing, even with bs4's help.)
Also, you don't need to change your code if you install lxml, BeautifulSoup will automatically pick up lxml and use it to parse the HTML instead.

Categories