I'm trying to learn Python. My only experience is Applescripting and it's not so easy to learn.. so far anyway.
I'm trying to parse an xml weather site and so far I have the data I need but I can't figure out how to get it into a list to process it further. Can anyone help?
from BeautifulSoup import BeautifulSoup
import xml.etree.cElementTree as ET
from xml.etree.cElementTree import parse
import urllib2
url = "http://www.weatheroffice.gc.ca/rss/city/ab-52_e.xml"
response = urllib2.urlopen(url)
local_file = open("\Temp\weather.xml", "w")
local_file.write(response.read())
local_file.close()
invalid_tags = ['b', 'br']
tree = parse("\Temp\weather.xml")
stuff = tree.findall("channel/item/description")
item = stuff[1]
parsewx = BeautifulSoup(stuff[1].text)
for tag in invalid_tags:
for match in parsewx.findAll(tag):
match.replaceWithChildren()
print parsewx
Since XML is structured data, BeautifulSoup returns a tree of Tags.
The documentation has extensive information on how to search and navigate in that tree.
Related
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
im not a coder but i need to implement a simple HTML parser.
After a simple research i was able to implement as a given example:
from lxml import html
import requests
page = requests.get('https://URL.COM')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
How can i use tree.xpath to parse all words ending with ".com.br" and starting with "://"
As #nosklo pointed out here, you are looking for href tags and the associated links. A parse tree will be organized by the html elements themselves, and you find text by searching those elements specifically. For urls, this would look like so (using the lxml library in python 3.6):
from lxml import etree
from io import StringIO
import requests
# Set explicit HTMLParser
parser = etree.HTMLParser()
page = requests.get('https://URL.COM')
# Decode the page content from bytes to string
html = page.content.decode("utf-8")
# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)
# Call this function and pass in your tree
def get_links(tree):
# This will get the anchor tags <a href...>
refs = tree.xpath("//a")
# Get the url from the ref
links = [link.get('href', '') for link in refs]
# Return a list that only ends with .com.br
return [l for l in links if l.endswith('.com.br')]
# Example call
links = get_links(tree)
For example, to read an RSS feed, this doesn't work because of the silly {http://purl.org ...} namespaces that get inserted before 'item':
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import urllib, urllib.request
url = "http://some/rss/feed"
response = urllib.request.urlopen(url)
xml_text = response.read().decode('utf-8')
xml_root = ET.fromstring(xml_text)
for e in xml_root.findall('item'):
print("I found an item!")
Now that findall() has been rendered useless because of the {} prefixes, here's another solution, but this is ugly:
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import urllib, urllib.request
url = "http://some/rss/feed"
response = urllib.request.urlopen(url)
xml_text = response.read().decode('utf-8')
xml_root = ET.fromstring(xml_text)
for e in xml_root:
if e.tag.endswith('}item'):
print("I found an item!")
Can I get ElementTree to just trash all the prefixes?
You need to handle namespaces as clearly explained at:
Parsing XML with namespace in Python via 'ElementTree'
But, what if instead, you'll use a specialized library for reading RSS feeds, like feedparser:
>>> import feedparser
>>> url = "http://some/rss/feed"
>>> feed = feedparser.parse(url)
Though I would personally use an XMLFeedSpider Scrapy spider. As a bonus, you'll get all other Scrapy web-scraping framework features.
I've been trying to get information from a site, and recently found out that is stored in childNodes[0].data.
I'm pretty new to python and never tried scripting against websites.
Somebody told me I could make a tmp.xml file, and extract the information from there, but as it's only getting the source code(which I think is of no use for me), I don't get any results.
Current code:
response = urllib2.urlopen(get_link)
html = response.read()
with open("tmp.xml", "w") as f:
f.write(html)
dom = parse("tmp.xml")
name = dom.getElementsByTagName("name[0].firstChild.nodeValue")
I've also tried using 'dom = parse(html)' without better result.
getElementsByTagName() takes an element name, not an expression. It is highly unlikely that there are tags in the page you are loading that contain <name[0].firstChild.nodeValue> tags.
If you are loading HTML, use a HTML parser instead, like BeautifulSoup. For XML, using the ElementTree API is a lot easier than using the (archaic and very verbose) DOM API.
Neither approach requires that you first save the source to disk, both APIs can parse directly from the response object returned by urllib2.
# HTML
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen(get_link)
soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))
print soup.find('title').text
or
# XML
import urllib2
from xml.etree import ElementTree as ET
response = urllib2.urlopen(get_link)
tree = ET.parse(response)
print tree.find('elementname').text
I would like make table of chosen physical properties of elements (for example atomization enthalpy, vaporization enthalpy, heat of vaporization, boiling point), which are accessible on this page.
It is a huge pain to do it by hand, and I didn't find any other machine-processing-friendly source of such data on the internet.
I was trying to learn how to to do it in Python (because I want to use this data for my other code written in Python / NumPy / Pandas).
I was able to download the webpage HTML code with urllib2, and I was trying to learn how to use some HTML/XML parser like ElementTree or MiniDom. However I have no experience with web programing and HTML/XML processing.
Using lxml's xpath support, you can parse the data easily. Here's an example parsing the atomization enthalpy
import lxml.html
import urllib2
html = urllib2.urlopen("http://http://environmentalchemistry.com/yogi/periodic/W.html").read()
doc = lxml.html.document_fromstring(html)
result = doc.xpath("/html/body/div[2]/div[2]/div[1]/div[1]/ul[7]/li[8]")
You could dynamically generate the xpath string for the different elements, and use a dict to parse the require fields.
Thank you, raphonic
It was necessary modify your code slightly to get it work, but thanks for kickstart. This code is working:
import lxml.html
import lxml.etree
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://environmentalchemistry.com/yogi/periodic/W.html')
html = infile.read()
doc = lxml.html.document_fromstring(html)
result = doc.xpath("/html/body/div[2]/div[1]/div[1]/div[1]/ul[7]/li[8]")
print lxml.etree.tostring(result[0])
but probably it is not the best one
Anyway. Because the structure of the page for different elements is not exactly the same, I would probably use just simple string.find() and regular expersion. Like this
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://environmentalchemistry.com/yogi/periodic/W.html')
page = infile.read()
i = page.find("Heat of Vaporization")
substr = page[i:i+50]
print substr
import re
non_decimal = re.compile(r'[^\d.]+')
print non_decimal.sub('', substr)