I am trying to print only not null values but I am not sure why even the null values are coming up in the output:
Input:
from lxml import html
import requests
import linecache
i=1
read_url = linecache.getline('stocks_url',1)
while read_url != '':
page = requests.get(read_url)
tree = html.fromstring(page.text)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage != None:
print percentage
i = i + 1
read_url = linecache.getline('stocks_url',i)
Output:
$ python test_null.py
['76%']
['76%']
['80%']
['92%']
['77%']
['71%']
[]
['50%']
[]
['100%']
['67%']
You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty.
Use a boolean test:
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Empty lists (and None) test as false in a boolean context. I opted to print out the first element from the XPath result, you appear to only ever have one.
Note that linecache is primarily aimed at caching Python source files; it is used to present tracebacks when an error occurs, and when you use inspect.getsource(). It isn't really meant to be used to read a file. You can just use open() and loop over the file without ever having to keep incrementing a counter:
with open('stocks_url') as urlfile:
for url in urlfile:
page = requests.get(read_url)
tree = html.fromstring(page.content)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Change this in your code and it should work:
if percentage != []:
Related
I'm trying to figure out in lxml and python how to replace an element with a string.
In my experimentation, I have the following code:
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
xref = topicroot2.xpath('//*/xref')
xref_attribute = xref[0].attrib['browsertext']
print href_attribute
The result is: 'something here'
This is the browser text attribute I'm looking for in this small sample. But what I can't seem to figure out is how to replace the entire element with the attribute text I've captured here.
(I do recognize that in my sample I could have multiple xrefs and will need to construct a loop to go through them properly.)
What's the best way to go about doing this?
And for those wondering, I'm having to do this because the link actually goes to a file that doesn't exist because of our different build systems.
Thanks in advance!
Try this (Python 3):
from lxml import etree as et
docstring = '<p>The value is permitted only when that includes <xref linkend=\"my linkend\" browsertext=\"something here\" filename=\"A_link.fm\"/>, otherwise the value is reserved.</p>'
# Get the root element.
topicroot = et.XML(docstring)
topicroot2 = et.ElementTree(topicroot)
# Get the text of the root element. This is a list of strings!
topicroot2_text = topicroot2.xpath("text()")
# Get the xref elment.
xref = topicroot2.xpath('//*/xref')[0]
xref_attribute = xref.attrib['browsertext']
# Save a reference to the p element, remove the xref from it.
parent = xref.getparent()
parent.remove(xref)
# Set the text of the p element by combining the list of string with the
# extracted attribute value.
new_text = [topicroot2_text[0], xref_attribute, topicroot2_text[1]]
parent.text = "".join(new_text)
print(et.tostring(topicroot2))
Output:
b'<p>The value is permitted only when that includes something here, otherwise the value is reserved.</p>'
I am trying to use JSON to search through googlemapapi. So, I give location "Plymouth" - in googlemapapi it is showing 6 resultset but when I try to parse in Json, I am getting length of only 2. I tried with multiple cities too, but all I am getting is resultset of 2 rather.
What is wrong below?
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1:
location = js1["results"][count]["formatted_address"]
lat = js1["results"][count]["geometry"]["location"]["lat"]
lng = js1["results"][count]["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
Simply replace for result in js1: with for result in js1['results']:
By the way, as posted in a comment in the question, no need to use a counter. You can rewrite your for loop as:
for result in js1['results']:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
print('lat',lat,'lng',lng)
print(location)
If you look at the json that comes in, you'll see that its a single dict with two items ("results" and "status"). Add print('result:', result) to the top of your for loop and it will print result: status and result: results because all you are iterating the the keys of that outer dict. That's a general debugging trick in python... if you aren't getting the stuff you want, put in a print statement to see what you got.
The results (not surprisingly) and in a list under js1["results"]. In your for loop, you ignore the variable you are iterating and go back to the original js1 for its data. This is unnecessary and in your case, it hid the error. Had you tried to reference cities off of result you would gotten an error and it may have been easier to see that result was "status", not the array you were after.
Now a few tweaks fix the problem
import urllib.request as UR
import urllib.parse as URP
import json
url = "http://maps.googleapis.com/maps/api/geocode/json?address=Plymouth&sensor=false"
uh = UR.urlopen(url)
data = uh.read()
count = 0
js1 = json.loads(data.decode('utf-8') )
print ("Length: ", len(js1))
for result in js1["results"]:
location = result["formatted_address"]
lat = result["geometry"]["location"]["lat"]
lng = result["geometry"]["location"]["lng"]
count = count + 1
print ('lat',lat,'lng',lng)
print (location)
I am currently trying to parse an xml file online and obtain the data I need from this file. My code is displayed below:
import urllib2
from xml.dom.minidom import parse
import pandas as pd
import time
page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextXml.php?sid=KBFI&num=360')
page_content = page.read()
with open('KBFI.xml', 'w') as fid:
fid.write(page_content)
data = []
xml = parse('KBFI.xml')
percp = 0
for station in xml.getElementsByTagName('station'):
for ob in xml.getElementsByTagName('ob'):
# Convert time sting to time_struct ignoring last 4 chars ' PDT'
ob_time = time.strptime(ob.getAttribute('time')[:-4],'%d %b %I:%M %p')
for variable in xml.getElementsByTagName('variable'):
if variable.getAttribute('var') == 'PCP1H':
percp = True
# UnIndent if you want all variables
if variable.getAttribute('value') == 'T':
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
elif variable.getAttribute('value') >= 0:
data.append((ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
variable.getAttribute('value')))
if not percp:
# If PCP1H wasn't found add as 0
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
print data
Unfortunately I cannot post an image of the xml file, but a version of it will be saved into your current directory if my script is run.
I would like the code to simply check for the existence of the 'variable' PCPH1 and print the 'value' if it exists (only one entry per 'ob'). If it doesn't exist or provides a value of 'T', I would like it to print '0' for that particular hour. Currently the output (the script I provided can be run to see the output) contains completely incorrect values and there are six entries per hour instead of one. What is wrong with my code?
Main issue in your code , in each for loop, you are getting the elements using -
xml.getElementsByTagName('ob')
This actually starts the search from the xml element, which in your case in the root element, same in case of xml.getElementsByTagName('variable') , this starts the search at the root element, so each time you are getting all the elements with tag variable , this is why you are getting 6 entries per hour, instead of one (since there are 6 of them in the complete xml).
You should instead get using -
ob.getElementsByTagName('variable')
And the ob element using -
station.getElementsByTagName('ob')
So that we only check inside the particular element we are currently iterating over (not the complete xml document).
Also, another side issue , you are doing -
elif variable.getAttribute('value') >= 0:
If I am not wrong, getAttribute() returns string, so this check would always be true, irrespective of what the actual value is. In the xml, I see that value has strings as well as numbers, so not really sure what you want that condition to be (Though this is not the main issue, the main issue is the one I described above).
Example code changes -
import urllib2
from xml.dom.minidom import parse
import pandas as pd
import time
page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextXml.php?
sid=KBFI&num=360')
page_content = page.read()
with open('KBFI.xml', 'w') as fid:
fid.write(page_content.decode())
data = []
xml = parse('KBFI.xml')
percp = 0
for station in xml.getElementsByTagName('station'):
for ob in station.getElementsByTagName('ob'):
# Convert time sting to time_struct ignoring last 4 chars ' PDT'
ob_time = time.strptime(ob.getAttribute('time')[:-4],'%d %b %I:%M %p')
for variable in ob.getElementsByTagName('variable'):
if variable.getAttribute('var') == 'PCP1H':
percp = True
# UnIndent if you want all variables
if variable.getAttribute('value') == 'T':
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
elif variable.getAttribute('value') >= 0:
data.append((ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
variable.getAttribute('value')))
if not percp:
# If PCP1H wasn't found add as 0
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
print data
I've got a diff file and I want to handle adds/deletions/modifications to update an SQL database.
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
With a Python script, I first detect two following lines with a regular expressions to handle modifications like B.
re.compile(r"^-(.+?)\|(.*?)\|(.+?)\n\+(.+?)\|(.*?)\|(.+?)(?:\n|\Z)", re.MULTILINE)
I delete all the matching lines, and then rescan my file and then handle all of them like additions/deletions.
My problem is with lines like D & E. For the moment I treat them like two deletions, then two additions, and I've got consequences of CASCADE DELETE in my SQL database, as I should treat them as modifications.
How can I handle such modifications D & E?
The diff file is generated by a bash script before, I could handle it differently if needed.
Try this:
>>> a = '''
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
'''
>>> diff = {}
>>> for row in a.splitlines():
if not row:
continue
s = row.split('|')
name = s[0][1:]
data = s[1:]
if row.startswith('+'):
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'added'
else:
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'removed'
diff[name] = change
>>> def print_by_status(status=None):
for item, value in diff.items():
if status is not None and status == value['status'] or status is None:
print '\nStatus: %s\n%s' % (value['status'], '\n'.join(value['rows']))
>>> print_by_status(status='added')
Status: added
+NameA|InfoA1|InfoA2
>>> print_by_status(status='modified')
Status: modified
-NameD|InfoD1|InfoD2
+NameD|InfoD1|InfoD3
Status: modified
-NameE|InfoE1|InfoE2
+NameE|InfoE3|InfoE2
Status: modified,
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
In this case you will have dictionary with all collected data with diff status and rows. You can do with current dict whatever you want.
I'm parsing through a decent sized xml file, and I ran into a problem. For some reason I cannot extract data even though I have done the exact same thing on different xml files before.
Here's a snippet of my code: (rest of the program, I've tested and they work fine)
EDIT: changed to include a testing try&except block
def parseXML():
file = open(str(options.drugxml),'r')
data = file.read()
file.close()
dom = parseString(data)
druglist = dom.getElementsByTagName('drug')
with codecs.open(str(options.csvdata),'w','utf-8') as csvout, open('DrugTargetRel.csv','w') as dtout:
for entry in druglist:
count = count + 1
try:
drugtype = entry.attributes['type'].value
print count
except:
print count
print entry
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
In case you're wondering what the XML file's schema roughly looks like, here's a rough god-awful sketch of the levels:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
Those that I typed in here, I need to extract from the XML file and stick it in csv files (as the code above shows), and the code has worked for different xml files before, not sure why it's not working on this one. I've gotten KeyError on 'type', I've also gotten indexing errors on line that extracts drugid even though EVERY drug has a drugid. What am I screwing up here?
EDIT: the stuff I'm extracting are guaranteed to be in each drug.
For anyone who cares, here's the link to the XML file I'm parsing:
http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip
EDIT: After implementing a try & except block (see above) here's what I found out:
In the schema, there are sections called "drug interactions" that also have a subfield called drug. So like this:
<drugs>
<drug type='something' ...>
<drugbank-id>
<name>
...
<targets>
<target partner='something'>
<drug-interactions>
<drug>
I think that my line druglist = dom.getElementsByTagName('drug') is unintentionally picking those up as well -- I don't know how I could fix this... any suggestions?
Basically when parsin an xml, you can't rely on the fact that you know the structure. It's a good practise to find out structure in code.
So everytime you access elements or attributes, check before if there any. In your code it means following:
Make sure there's an attribute 'type' on a drug element:
drugtype = entry.attributes['type'].value if entry.attributes.has_key('type') else 'defaulttype'
Make sure getElementsByTagName doesn't returns empty array before accessing its elements:
drugbank-id = entry.getElementsByTagName('drugbank-id')
drugidObj = drugbank-id[0] if drugbank-id else None
Also before accessing childnodes make sure there are any:
if drugidObj.hasChildNodes:
drugid = drugidObj.childNodes[0].nodeValue
Or use for loop to loop through them.
And when you call getElementsByTagName on the drugs elemet it returns all elements including the nested ones. To get only drug elements, which are dirrect children of drugs element you have to use childNodes attribute.
I had a feeling that maybe there was something weird happening due to running out of memory or something, so I rewrote the parser using an iterator over each drug and tried it out and got the program to complete without raising an exception.
Basically what I'm doing here is, instead of loading the entire XML file into memory, I parse the XML file for the beginning and end of each <drug> and </drug> tag. Then I parse that with the minidom each time.
The code might be a little fragile as I assume that each <drug> and </drug> pair are on their own lines. Hopefully it helps more than it harms though.
#!python
import codecs
from xml.dom import minidom
class DrugBank(object):
def __init__(self, filename):
self.fp = open(filename, 'r')
def __iter__(self):
return self
def next(self):
state = 0
while True:
line = self.fp.readline()
if state == 0:
if line.strip().startswith('<drug '):
lines = [line]
state = 1
continue
if line.strip() == '</drugs>':
self.fp.close()
raise StopIteration()
if state == 1:
lines.append(line)
if line.strip() == '</drug>':
return minidom.parseString("".join(lines))
with codecs.open('csvout.csv', 'w', 'utf-8') as csvout, open('dtout.csv', 'w') as dtout:
db = DrugBank('drugbank.xml')
for dom in db:
entry = dom.firstChild
drugtype = entry.attributes['type'].value
drugidObj = entry.getElementsByTagName('drugbank-id')[0]
drugid = drugidObj.childNodes[0].nodeValue
drugnameObj = entry.getElementsByTagName('name')[0]
drugname = drugnameObj.childNodes[0].nodeValue
targetlist = entry.getElementsByTagName('target')
for target in targetlist:
targetid = target.attributes['partner'].value
dtout.write((','.join((drugid,targetid)))+'\n')
csvout.write((','.join((drugid,drugname,drugtype)))+'\n')
An interesting read that might help you out further is here:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/