I'm using Python BS4/lxml to parse an xml-formatted RSS feed (specifically https://itch.io/games/on-sale.xml). I'm finding that in the transition from Requests receiving the page data and BS4 reading it from text, the name of the link field is being changed. Specifically, res.text contains ...</saleends><link>https://foo.itch.io/bar</link><description>... but reading it into BS4/lxml and printing that results in ...</saleends><link/>https://foo.itch.io/bar<description>..., which BS4 is unable to parse correctly. My code is available here, line 237.
I can provide a stripped-down version of the project without the login and logging pieces for easy testing.
Edit with simplified code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml')
print(soup.item.link)
Expected behavior: Prints "https://itch.io/s/12345/foobar" (whatever the most recent link in the RSS is)
Actual behavior: Prints "<link/>"
lxml is lxml's HTML parser and lxml-xml and xml are lxml's XML parser. (Refer this answer which points to this documentation)
So, instead of using lxml parser, you should use lxml-xml or xml parser.
import requests
from bs4 import BeautifulSoup
res = requests.get("https://itch.io/feed/sales.xml")
soup = BeautifulSoup(res.text, 'lxml-xml')
print(soup.item.link.text)
Output:
https://itch.io/s/38593/halloween-event-sale
Related
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
Hi I'm trying to parse html from this website
However it takes for ever for the soup to load the whole html (about 17 seconds to print to terminal), I do realize this is only because of the website itself (as other directories seem to load instantly), but here is my code just in case:
import urllib2
from bs4 import BeautifulSoup
url1 = 'http://www.ukpets.co.uk/ukp/?sf=1716769780&rtn=temp87_224_76_126_at_1456&display_profile=§ion=Commercial&sub=Search_&rws=&method=search&tb=comdir1_8&class=comdir1_8&search_form=on&rf=coname&st=Food'
soup = BeautifulSoup(urllib2.urlopen(url1), 'lxml')
print soup
So my question, is there any other parser that could get this job done faster or can i use something along with bs
P.S. also tried selenium
I dunno what the problem is for you but this series of statements executed in the blink of an eye on my old computer. You could try doing this.
>>> from bs4 import BeautifulSoup
>>> from urllib.request import urlopen
>>> URL = 'http://www.ukpets.co.uk/ukp/?sf=1716769780&rtn=temp87_224_76_126_at_1456&display_profile=§ion=Commercial&sub=Search_&rws=&method=search&tb=comdir1_8&class=comdir1_8&search_form=on&rf=coname&st=Food'
>>> HTML = urlopen ( URL )
>>> soup = BeautifulSoup ( HTML )
C:\Python34\lib\site-packages\bs4\__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code
import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")
which gives me an empty object.
Is BS4 the wrong tool, or (more likely) what am I doing wrong?
Thanks in advance
soup.findAll("bbk:series") would return the result.
In fact, in this case, even you use lxml as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series") works. See Other parser problems from the official doc.
If you want to parse it as xml, use soup = BeautifulSoup(html_source, 'xml') instead. It also uses lxml since lxml is the only xml parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series") to get the result as beautifulSoup will strip the namespace part bbk.
I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...
My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.
#! /usr/bin/python
import urllib.request
import urllib.error
from io import StringIO
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly
With that working, I tried to feed it a page via urllib:
# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes
Trying to deal with the error message, I tried:
# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file
I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:
attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer
This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.
python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.
BeautifulSoup can use the lxml parser directly, no need to go to these lengths.
BeautifulSoup(doc, 'lxml')