xml parsing Python reads file incorrectly - python

I am currently trying to parse an xml file online and obtain the data I need from this file. My code is displayed below:
import urllib2
from xml.dom.minidom import parse
import pandas as pd
import time
page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextXml.php?sid=KBFI&num=360')
page_content = page.read()
with open('KBFI.xml', 'w') as fid:
fid.write(page_content)
data = []
xml = parse('KBFI.xml')
percp = 0
for station in xml.getElementsByTagName('station'):
for ob in xml.getElementsByTagName('ob'):
# Convert time sting to time_struct ignoring last 4 chars ' PDT'
ob_time = time.strptime(ob.getAttribute('time')[:-4],'%d %b %I:%M %p')
for variable in xml.getElementsByTagName('variable'):
if variable.getAttribute('var') == 'PCP1H':
percp = True
# UnIndent if you want all variables
if variable.getAttribute('value') == 'T':
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
elif variable.getAttribute('value') >= 0:
data.append((ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
variable.getAttribute('value')))
if not percp:
# If PCP1H wasn't found add as 0
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
print data
Unfortunately I cannot post an image of the xml file, but a version of it will be saved into your current directory if my script is run.
I would like the code to simply check for the existence of the 'variable' PCPH1 and print the 'value' if it exists (only one entry per 'ob'). If it doesn't exist or provides a value of 'T', I would like it to print '0' for that particular hour. Currently the output (the script I provided can be run to see the output) contains completely incorrect values and there are six entries per hour instead of one. What is wrong with my code?

Main issue in your code , in each for loop, you are getting the elements using -
xml.getElementsByTagName('ob')
This actually starts the search from the xml element, which in your case in the root element, same in case of xml.getElementsByTagName('variable') , this starts the search at the root element, so each time you are getting all the elements with tag variable , this is why you are getting 6 entries per hour, instead of one (since there are 6 of them in the complete xml).
You should instead get using -
ob.getElementsByTagName('variable')
And the ob element using -
station.getElementsByTagName('ob')
So that we only check inside the particular element we are currently iterating over (not the complete xml document).
Also, another side issue , you are doing -
elif variable.getAttribute('value') >= 0:
If I am not wrong, getAttribute() returns string, so this check would always be true, irrespective of what the actual value is. In the xml, I see that value has strings as well as numbers, so not really sure what you want that condition to be (Though this is not the main issue, the main issue is the one I described above).
Example code changes -
import urllib2
from xml.dom.minidom import parse
import pandas as pd
import time
page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextXml.php?
sid=KBFI&num=360')
page_content = page.read()
with open('KBFI.xml', 'w') as fid:
fid.write(page_content.decode())
data = []
xml = parse('KBFI.xml')
percp = 0
for station in xml.getElementsByTagName('station'):
for ob in station.getElementsByTagName('ob'):
# Convert time sting to time_struct ignoring last 4 chars ' PDT'
ob_time = time.strptime(ob.getAttribute('time')[:-4],'%d %b %I:%M %p')
for variable in ob.getElementsByTagName('variable'):
if variable.getAttribute('var') == 'PCP1H':
percp = True
# UnIndent if you want all variables
if variable.getAttribute('value') == 'T':
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
elif variable.getAttribute('value') >= 0:
data.append((ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
variable.getAttribute('value')))
if not percp:
# If PCP1H wasn't found add as 0
data.append([ob_time.tm_mday,
ob_time.tm_hour,
ob_time.tm_min,
0])
print data

Related

Improving speed while iterating over ~400k XML files

This is more of a theoretical question to understand objects, garbage collection and performance of Python better.
Lets say i have a ton of XML files and want to iterate over each one, get all the tags, store them in a dict, increase counters for each tag etc. When i do this, the first, lets say 15k iterations, process really quick but afterwards the script slows down significantly, while the memory usage, CPU load etc. are fine. Why is that? Do i write hidden objects each iteration which are not cleaned up, can i do something to improve it? I tried to use regex instead of ElementTree but it wasnt worth the effort since i only want to extract first level tags and it would make it more complex.
Unfortunately i cannot give a reproducible example without providing the XML files, however this is my code:
import os
import datetime
import xml.etree.ElementTree as ElementTree
start_time = datetime.datetime.now()
original_implemented_tags = os.path.abspath("/path/to/file")
required_tags = {}
optional_tags = {}
new_tags = {}
# read original tags
for _ in open(original_implemented_tags, "r"):
if "#XmlElement(name =" in _:
_xml_attr = _.split('"')[1]
if "required = true" in _:
required_tags[_xml_attr] = 1 # i set this to 1 so i can use if dict.get(_xml_attr) (0 returns False)
else:
optional_tags[_xml_attr] = 1
# read all XML files from nested folder containing XML dumps and other files
clinical_trial_root_dir = os.path.abspath("/path/to/dump/folder")
xml_files = []
for root, dirs, files in os.walk(clinical_trial_root_dir):
xml_files.extend([os.path.join(root, _) for _ in files if os.path.splitext(_)[-1] == '.xml'])
# function for parsing a file and extract unique tags
def read_via_etree(file):
_root = ElementTree.parse(file).getroot()
_main_tags = list(set([_.tag for _ in _root.findall("./")])) # some tags occur twice
for _attr in _main_tags:
# if tag doesnt exist in original document, increase counts in new_tags
if _attr not in required_tags.keys() and _attr not in optional_tags.keys():
if _attr not in new_tags.keys():
new_tags[_attr] = 1
else:
new_tags[_attr] += 1
# otherwise, increase counts in either one of required_tags or optional_tags
if required_tags.get(_attr):
required_tags[_attr] += 1
if optional_tags.get(_attr):
optional_tags[_attr] += 1
# actual parsing with indicator
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files")
read_via_etree(xml)
# undoing the initial 1
for k in required_tags.keys():
required_tags[k] -= 1
for k in optional_tags.keys():
optional_tags[k] -= 1
print(f"Done parsing {len(xml_files)} documents in {datetime.datetime.now() - start_time}")
Example of one XML file:
<parent_element>
<tag_i_need>
<tag_i_dont_need>Some text i dont need</tag_i_dont_need>
</tag_i_need>
<another_tag_i_need>Some text i also dont need</another_tag_i_need>
</parent_element>
After the helpful comments i added a timestamp to my loop indicating how much time is passed since the last 1k documents and flushed the sys.stdout:
import sys
loop_timer = datetime.datetime.now()
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files in {datetime.datetime.now() - loop_timer}")
sys.stdout.flush()
loop_timer = datetime.datetime.now()
read_via_etree(xml)
I think it makes sense now since the XML files vary in size, and due the fact that the standard output stream is buffered. Thanks to Albert Winestein

Python - Reading a CSV, won't print the contents of the last column

I'm pretty new to Python, and put together a script to parse a csv and ultimately output its data into a repeated html table.
I got most of it working, but there's one weird problem I haven't been able to fix. My script will find the index of the last column, but won't print out the data in that column. If I add another column to the end, even an empty one, it'll print out the data in the formerly-last column - so it's not a problem with the contents of that column.
Abridged (but still grumpy) version of the code:
import os
os.chdir('C:\\Python34\\andrea')
import csv
csvOpen = open('my.csv')
exampleReader = csv.reader(csvOpen)
tableHeader = next(exampleReader)
if 'phone' in tableHeader:
phoneIndex = tableHeader.index('phone')
else:
phoneIndex = -1
for row in exampleReader:
row[-1] =''
print(phoneIndex)
print(row[phoneIndex])
csvOpen.close()
my.csv
stuff,phone
1,3235556177
1,3235556170
Output
1
1
Same script, small change to the CSV file:
my.csv
stuff,phone,more
1,3235556177,
1,3235556170,
Output
1
3235556177
1
3235556170
I'm using Python 3.4.3 via Idle 3.4.3
I've had the same problem with CSVs generated directly by mysql, ones that I've opened in Excel first then re-saved as CSVs, and ones I've edited in Notepad++ and re-saved as CSVs.
I tried adding several different modes to the open function (r, rU, b, etc.) and either it made no difference or gave me an error (for example, it didn't like 'b').
My workaround is just to add an extra column to the end, but since this is a frequently used script, it'd be much better if it just worked right.
Thank you in advance for your help.
row[-1] =''
The CSV reader returns to you a list representing the row from the file. On this line you set the last value in the list to an empty string. Then you print it afterwards. Delete this line if you don't want the last column to be set to an empty string.
If you know it is the last column, you can count them and then use that value minus 1. Likewise you can use your string comparison method if you know it will always be "phone". I recommend if you are using the string compare, convert the value from the csv to lower case so that you don't have to worry about capitalization.
In my code below I created functions that show how to use either method.
import os
import csv
os.chdir('C:\\temp')
csvOpen = open('my.csv')
exampleReader = csv.reader(csvOpen)
tableHeader = next(exampleReader)
phoneColIndex = None;#init to a value that can imply state
lastColIndex = None;#init to a value that can imply state
def getPhoneIndex(header):
for i, col in enumerate(header): #use this syntax to get index of item
if col.lower() == 'phone':
return i;
return -1; #send back invalid index
def findLastColIndex(header):
return len(tableHeader) - 1;
## methods to check for phone col. 1. by string comparison
#and 2. by assuming it's the last col.
if len(tableHeader) > 1:# if only one row or less, why go any further?
phoneColIndex = getPhoneIndex(tableHeader);
lastColIndex = findLastColIndex(tableHeader)
for row in exampleReader:
print(row[phoneColIndex])
print('----------')
print(row[lastColIndex])
print('----------')
csvOpen.close()

Script skips second for loop when reading a file

I am trying to read a log file and compare certain values against preset thresholds. My code manages to log the raw data from with the first for loop in my function.
I have added print statements to try and figure out what was going on and I've managed to deduce that my second for loop never "happens".
This is my code:
def smartTest(log, passed_file):
# Threshold values based on averages, subject to change if need be
RRER = 5
SER = 5
OU = 5
UDMA = 5
MZER = 5
datafile = passed_file
# Log the raw data
log.write('=== LOGGING RAW DATA FROM SMART TEST===\r\n')
for line in datafile:
log.write(line)
log.write('=== END OF RAW DATA===\r\n')
print 'Checking SMART parameters...',
log.write('=== VERIFYING SMART PARAMETERS ===\r\n')
for line in datafile:
if 'Raw_Read_Error_Rate' in line:
line = line.split()
if int(line[9]) < RRER and datafile == 'diskOne.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK ONE OK!\r\n" %int(line[9]))
elif int(line[9]) < RRER and datafile == 'diskTwo.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK TWO OK!\r\n" %int(line[9]))
else:
print 'FAILED'
log.write("WARNING: Raw_Read_Error_Rate SMART parameter is: %s. Value over threshold!\r\n" %int(line[9]))
rcode = mbox(u'Attention!', u'One or more hardrives may need replacement.', 0x30)
This is how I am calling this function:
dataOne = diskOne()
smartTest(log, dataOne)
print 'Disk One Done'
diskOne() looks like this:
def diskOne():
if os.path.exists(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl"):
os.chdir(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl")
os.system("Smartctl -a /dev/csmi0,0 > C:\Dejero\Installation-Scripts\diskOne.txt")
# Store file in variable
os.chdir(r"C:\Dejero\Installation-Scripts")
datafile = open('diskOne.txt', 'rb')
return datafile
else:
log.write('Smart utility not found.\r\n')
I have tried googling similar issues to mine and have found none. I tried moving my first for loop into diskOne() but the same issue occurs. There is no syntax error and I am just not able to see the issue at this point.
It is not skipping your second loop. You need to seek the position back. This is because after reading the file, the file offset will be placed at the end of the file, so you will need to put it back at the start. This can be done easily by adding a line
datafile.seek(0);
Before the second loop.
Ref: Documentation

Print only not null values

I am trying to print only not null values but I am not sure why even the null values are coming up in the output:
Input:
from lxml import html
import requests
import linecache
i=1
read_url = linecache.getline('stocks_url',1)
while read_url != '':
page = requests.get(read_url)
tree = html.fromstring(page.text)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage != None:
print percentage
i = i + 1
read_url = linecache.getline('stocks_url',i)
Output:
$ python test_null.py
['76%']
['76%']
['80%']
['92%']
['77%']
['71%']
[]
['50%']
[]
['100%']
['67%']
You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty.
Use a boolean test:
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Empty lists (and None) test as false in a boolean context. I opted to print out the first element from the XPath result, you appear to only ever have one.
Note that linecache is primarily aimed at caching Python source files; it is used to present tracebacks when an error occurs, and when you use inspect.getsource(). It isn't really meant to be used to read a file. You can just use open() and loop over the file without ever having to keep incrementing a counter:
with open('stocks_url') as urlfile:
for url in urlfile:
page = requests.get(read_url)
tree = html.fromstring(page.content)
percentage = tree.xpath('//span[#class="grnb_20"]/text()')
if percentage:
print percentage[0]
Change this in your code and it should work:
if percentage != []:

utilizing "isalpha" or "startswith" and/or troubleshooting "list index out of range error"

I'm a newbie in all facets (SO, python, beautifulsoup, etc), so bear with me please.
I am trying to create a variety of maps with different types of data following a tutorial found at flowingdata.com (how to make a US county thematic map using free tools).
I can duplicate the tutorial without error so no version issues I can speak of (I'm using Python 2.7.5 and BeautifulSoup 4.3.1 on Mac OS 10.8). I would like to use (more detailed) state-county maps and colorize them with different data. I have the maps (svg) and data (csv) in appropriate files. Here is the script I am currently running:
import csv
from BeautifulSoup import BeautifulSoup
totpop = {}
reader = csv.reader(open('datafile.csv', 'rU'), delimiter=",")
for row in reader:
try:
id = row[0]
pop = float( row[1].strip() )
totpop[id] = pop
except:
pass
svg = open('mapfile.svg', 'r').read()
soup = BeautifulSoup(svg, selfClosingTags=['defs', 'sodipodi:namedview', 'path'])
paths = soup.findAll('path')
colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043"]
path_style = 'fill-rule:nonzero; stroke: #ffffff; stroke-width: 5; stroke-opacity: 1; fill: '
# Colorize based on data
for p in paths:
try:
pop = totpop[p['id']]
except:
continue
if pop > 750000:
color_class = 6
elif pop > 500000:
color_class = 5
elif pop > 250000:
color_class = 4
elif pop > 125000:
color_class = 3
elif pop > 75000:
color_class = 2
elif pop > 25000:
color_class = 1
else:
color_class = 0
color = colors[color_class]
p['style'] = path_style + color
print soup.prettify()
And I'm getting the following error:
File "scriptname.py", line 54, in
color = colors[color_class]
IndexError: list index out of range
("line 54" may not match because I removed some comment lines in the sample code)
Regarding the svg file, it has both paths and groups of paths (the groups of paths are counties comprised of multiple paths). Single path counties have the county name as the "id." Multi-path counties have the county name as the group "id" however the nested paths have numeric ids. I want the style to be applied to either the path or group that matches the county name in the data file (I'm fully aware the sample code does not deal with groups right now). To test, I ran the script on a sample svg that had only paths (no groups) and it worked brilliantly...so I know something is right. I think the issue is with the groups and/or the paths (within the groups) with numeric ids.
How do I get around the error? I tried to remove the groups and change all the multi-path ids to the same thing...that didn't work either. Do the numeric ids cause problems if they're not explicitly ignored?
I'm wondering if I can run a script that either singles out the paths and/or groups that have names (no numbers/digits) using some sort of "isalpha" tool or "startswith" (any letter) to avoid the index error.
I hope that provides enough information.
Here is a link to one of the svg maps (I have stripped the clippath and state_outline from my working file) And here is a link to the corresponding datafile
If you test the files, you may have viewbox issues but I have sorted that out separately.
Thanks for any help!
From the looks of it, you're probably assuming in your following array:
colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043"]
that the elements here are indexed as 1, 2, 3, 4, 5, 6. The index is actually going to begin with 0 and not 1. So "#F1EEF6" is actually element 0 and the last element ("#980043") is number 5 in your array. In your if pop statements, you'll need to make this adjustment.
Also, you'll need to change the else statement to set your color_class to something you can use to determine whether you should attempt to grab a valid color or not. I was thinking something along the lines of:
else:
color_class = null
if color_class != null
color = colors[color_class]
p['style'] = path_style + color
I'm not familiar with the Python syntax so there may be an error in there but hopefully you get the idea I'm trying to show here.

Categories