Python and Beautifulsoup

Python and Beautifulsoup - python

Hi I'm trying to parse html from this website
However it takes for ever for the soup to load the whole html (about 17 seconds to print to terminal), I do realize this is only because of the website itself (as other directories seem to load instantly), but here is my code just in case:
import urllib2
from bs4 import BeautifulSoup
url1 = 'http://www.ukpets.co.uk/ukp/?sf=1716769780&rtn=temp87_224_76_126_at_1456&display_profile=&section=Commercial&sub=Search_&rws=&method=search&tb=comdir1_8&class=comdir1_8&search_form=on&rf=coname&st=Food'
soup = BeautifulSoup(urllib2.urlopen(url1), 'lxml')
print soup
So my question, is there any other parser that could get this job done faster or can i use something along with bs
P.S. also tried selenium

I dunno what the problem is for you but this series of statements executed in the blink of an eye on my old computer. You could try doing this.
>>> from bs4 import BeautifulSoup
>>> from urllib.request import urlopen
>>> URL = 'http://www.ukpets.co.uk/ukp/?sf=1716769780&rtn=temp87_224_76_126_at_1456&display_profile=&section=Commercial&sub=Search_&rws=&method=search&tb=comdir1_8&class=comdir1_8&search_form=on&rf=coname&st=Food'
>>> HTML = urlopen ( URL )
>>> soup = BeautifulSoup ( HTML )
C:\Python34\lib\site-packages\bs4\__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))

Related

bs4 won't open locally stored html page correctly

When I attempt to parse a locally stored copy of a webpage, beautifulsoup returns gibberish to me. I don't understand why as I've never faced this problem when using the requests and bs4 modules together for scraping tasks.
here's my code
import requests
from bs4 import BeautifulSoup as BS
import os
url_2 = r'/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/'
os.chdir(url_2)
f = open('re_2.html')
soup = BS(url_2, "lxml")
f.close()
print soup
this code returns the following :
<html><body><p>/Users/davidferreira/Documents/coding_2/ak_screen_scraping/bmra/</p></body></html>
I wasn't able to find a similar problem online so I've posted it here. any help would be much appreciated.

You are passing the path (which you named url_2) to BeautifulSoup so it treats that as a web page text and returns it, neatly wrapped in some minimal HTML. Seems fine.
Try constructing the BS from the file's contents instead. See here how it works: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
soup = BS(f)
should do...

Python BS4 with SDMX

I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code
import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")
which gives me an empty object.
Is BS4 the wrong tool, or (more likely) what am I doing wrong?
Thanks in advance

soup.findAll("bbk:series") would return the result.
In fact, in this case, even you use lxml as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series") works. See Other parser problems from the official doc.
If you want to parse it as xml, use soup = BeautifulSoup(html_source, 'xml') instead. It also uses lxml since lxml is the only xml parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series") to get the result as beautifulSoup will strip the namespace part bbk.

Accessing a website in python

I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?

In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.

The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.

You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])

You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...

"soup.prettify()" gives just URL

I'm using Python3, BeautifulSoup4
When I run code below, it gives just url "www.google.com" not XML.
I couldn't find it What is wrong.
from bs4 import BeautifulSoup
import urllib
html = "www.google.com"
soup = BeautifulSoup(html)
print (soup.prettify())

You need to use urllib2 or a similar library to fetch the HTML
import urllib2
html = urllib2.urlopen("www.google.com")
soup = BeautifulSoup(html)
print (soup.prettify())
EDIT: Just as a side note to clarify why I suggested urllib2. If you read the urllib documentation, you'll find "The urlopen() function has been removed in Python 3 in favor of urllib2.urlopen()." Given that you have tagged Python3, urllib2 would probably be your best option.

Python web scraping involving HTML a tag

I've been trying to scrap the names in a table from a website using bsoup script but the program is returning nothing or "[]". I would appreciate if any one can help me pointing what I'm doing wrong.
Here is what I'm trying to run:
from bs4 import BeautifulSoup
import urllib2
url="http://www.trackinfo.com/entries-race.jsp?raceid=GBM$20140228E02"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=soup.findAll('a',{'href':'href="dog.jsp?runnername=[^.]*'})
for eachname in names:
print eachname.string
And here is one of the elements that I'm trying to get:
<a href="dog.jsp?runnername=PG+BAD+GRANDPA">
PG BAD GRANDPA
</a>

See the documentation for BeautifulSoup, which says that if you want to give a regular expression in a search, you need to pass in a compiled regular expression.
Taking your variables, this is what you want:
import re
names = soup.find_all("a",{"href":re.compile("dog")})

A different approach, this one using Requests instead of urllib2. Matter of preference, really. Main point is that you should clean up your code, especially the indentation on the last line.
from bs4 import BeautifulSoup as bs
import requests
import re
url = "http://www.trackinfo.com/entries-race.jsp?raceid=GBM$20140228E02"
r = requests.get(url).content
soup = bs(r)
soup.prettify()
names = soup.find_all("a", href=re.compile("dog"))
for name in names:
print name.get_text().strip()
Let us know if this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python and Beautifulsoup - python

Related

bs4 won't open locally stored html page correctly

Python BS4 with SDMX

Accessing a website in python

"soup.prettify()" gives just URL

Python web scraping involving HTML a tag

Categories

Resources