Python , XML Index error - python

Hello I am having trouble with a xml file I am using. Now what happens is on a short xml file the program works fine but for some reason once it reaches a size ( I am thinking 1 MB)
it gives me a "IndexError: list index out of range"
Here is the code I am writing so far.
from xml.dom import minidom
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
def xml_data():
f = open('C:\opidea_2.xml', 'r')
data = f.read()
f.close()
dom = minidom.parseString(data)
ic = (dom.getElementsByTagName('logentry'))
dom = None
content = ''
for num in ic:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
if name:
content += "***Changes by:" + str(name) + "*** " + '\n\n Date: '
else:
content += "***Changes are made Anonymously *** " + '\n\n Date: '
print content
if __name__ == "__main__":
xml_data ()
Here is part of the xml if it helps.
<log>
<logentry
revision="33185">
<author>glv</author>
<date>2012-08-06T21:01:52.494219Z</date>
<paths>
<path
kind="file"
action="M">/branches/Patch_4_2_0_Branch/text.xml</path>
<path
kind="dir"
action="M">/branches/Patch_4_2_0_Branch</path>
</paths>
<msg>PATCH_BRANCH:N/A
BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:N/A
Adding the SVN log size requirement to the branch
</msg>
</logentry>
</log>
The actual xml file is much bigger but this is the general format. It will actually work if it was this small but once it gets bigger I get problems.
here is the traceback
Traceback (most recent call last):
File "C:\python\src\SVN_Email_copy.py", line 141, in <module>
xml_data ()
File "C:\python\src\SVN_Email_copy.py", line 50, in xml_data
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
IndexError: list index out of range

Based on the code provided your error is going to be in this line:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
#xml node-^
#function call -------------------------^
#list indexing ----------------------------^
#attribute access -------------------------------------^
That's the only place in the demonstrated code that you're indexing into a list. That would imply that in your larger XML Sample you're missing an <author> tag. You'll have to correct that, or add in some level of error handling / data validation.
Please see the code elaboration for more explanation. You're doing a ton of things in a single line by taking advantage of the return behaviors of successive commands. So, the num is defined, that's fine. Then you call a function (method). It returns a list. You attempt to retrieve from that list and it throws an exception, so you never make it to the Attribute Access to get to firstChild, which definitely means you get no nodeValue.
Error checking may look something like this:
authors = num.getElementsByTagName('author')
if len(authors) > 0:
name = authors[0].firstChild.nodeValue
Though there are many, many ways you could achieve that.

Related

Python: Using SSML with SAPI (comtypes)

TL;DR: I'm trying to pass an XML object (using ET) to a Comtypes (SAPI) object in python 3.7.2 on Windows 10. It's failing due to invalid chars (see error below). Unicode characters are read correctly from the file, can be printed (but do not display correctly on the console). It seems like the XML is being passed as ASCII or that I'm missing a flag? (https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85)). If it is a missing flag, how do I pass it? (I haven't figured that part out yet..)
Long form description
I'm using Python 3.7.2 on Windows 10 and trying to send create an XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) so I found a UTF-8 encoded library containing IPA pronunciations. When I try to call the SAPI, with parts of the code replaced I get the following error:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
I've been trying to debug, but when I print the pronunciations of the words the characters are boxes. However if I copy and paste them from my console, they look fine (see below).
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
Best Guess
I'm unsure whether the problem is caused by
1) I've changed versions of pythons to be able to print unicode
2) I fixed problems with reading the file
3) I had incorrect manipulations of the string
I'm pretty sure the problem is that I'm not passing it as a unicode to the comtype object. The ideas I'm looking into are
1) Is there a flag missing?
2) Is it being converted to ascii when its being passed to comtypes (C types error)?
3) Is the XML being passed incorrectly/ am I missing a step?
Sneak peek at the code
This is the class that reads the IPA dictionary and then generates the XML file. Look at _load_phonemes and _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
I've also added how the code writes to the comtype object (SAPI) in case the error is there.
def __call__(self,text,outputFile):
# https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
Thanks in advance for your help!
Try to use single quotes inside ph attrubute.
Like this
my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'
also remember to use \ to escape single quote
UPD
Also this error could mean that your ph cannot be parsed. You can check docs there: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
this example will work
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
but this doesn't
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
</voice>
</speak>

Generating instances of class in loop gives TypeError: 'list' object is not callable

I've looked through a lot of replies regarding this error, however none was helpfull for my special case and since I'm new to Python, I have difficulties applying the hints to my problem.
I have a class in a file Aheat.py that reads
class Aheat():
name = ""
time = 0
place = 0
def __init__(self,name,time,place):
self.name = name
self.time = time
self.place = place
And a file main.py where I want to read a html file, extract information, and create a list of objects of my class to work with them later on.
The (hopefully) essential part of my main.py reads
import urllib2
import re
from Aheat import Aheat
s = read something from url
ssplit = re.split('<p', s) # now every entry of ssplit contains an event
# and description and all the runners
HeatList = []
for part in ssplit:
newHeat = Aheat("foo",1,1) # of course this is just an example
HeatList.append(newHeat)
But this gives me the following error:
Traceback (most recent call last):
File "/home/username/Workspace/ECLIPSE/running/main.py", line 22, in <module>
newHeat = Aheat("foo",1,1)
TypeError: 'list' object is not callable
which is thrown when performing the second iteration.
If I take out the generation of the object of the loop, i.e.
newHeat = Aheat("foo",1,1)
for part in ssplit:
HeatList.append(newHeat)
My code executes without a problem, but this is not what I want. I'm also not sure, if I can initialize a specific number of instances a priori, since the number of objects is estimated in the loop.
I'm using Eclipse and Python 2.7.
regex is going to bite you.
<p == <pre> || <progress> || <param> || <p> || (any user created directives on a page.)
follow the links in your comments to read up on why we shouldn't parse html with regex.
Thanks, #MarkR ( btw, I was only supplementing your comment and I was agreeing with you )
Why not put the list in your class or better yet extend list functionality with your class.
class AHeat(list):
def append(self,name,time,place):
return super(AHeat,self).append([name,time,place])
# main
heatList= AHeat()
heatList.append("foo",1,2)
heatList.append("bar",3,4)
print(heatList[0])
print(heatList[1])
> ['foo', 1, 2]
> ['bar', 3, 4]
Also

Python 3 email extracting search engine

Q. Write a search engine that will take a file (like an html source page) and extract all of the email addresses. It will then print them out in an ordered list. The file may contain a lot of messy text (i.e. asda#home is not valid.. and there can be a lot of #'s in the file in roles other than emails!)
For testing purposes, this is the text file I have been using:
askdalsd
asd
sad
asd
asd
asd
ad
asd
asda
da
moi1990#gmail.com
masda#sadas
223#home.ca
125512#12451.cpm
domain#name.com
asda
sda
as
da
ketchup#ketchup##%##.com
onez!es#gomail.com
asdasda#####email.com
asda#asdasdaad.ca
moee#gmail.com
And this is what I have so far:
import os
import re
import sys
def grab_email(file):
email_pattern = re.compile(r'\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b',re.IGNORECASE)
found = set()
if os.path.isfile(file):
for line in open(file, 'r'):
found.update(email_pattern.findall(line))
for email_address in found:
print (email_address)
if __name__ == '__main__':
grab_email(sys.argv[1])
grab_email('email_addresses.txt')
Now the problem I am having is that after a certain point, the program crashes. This is the output:
125512#12451.cpm
es#gomail.com
asda#asdasdaad.ca
223#home.ca
moee#gmail.com
moi1990#gmail.com
domain#name.com
Traceback (most recent call last):
File "D:/Sheridan/Part Time/TELE26529 Linux Architecture w. Network Scripting/Python Assignment 3.5/question1.py", line 17, in <module>
grab_email('email_addresses.txt')
File "D:/Sheridan/Part Time/TELE26529 Linux Architecture w. Network Scripting/Python Assignment 3.5/question1.py", line 14, in grab_email
grab_email(sys.argv[1])
IndexError: list index out of range
What am I doing wrong here and how do I fix this? How can I more effectively handle these exceptions?
The problem is this part:
if __name__ == '__main__':
grab_email(sys.argv[1])
Your program is crashing because it is processing this inside of the grab_email function. Since you are running from the interpreter, the if statement will of course evaluate to True. Then, since you have passed no command line arguments, you are attempting a non-existing list element, causing the error you get.
To fix, just dedent! It should look like:
import os
import re
import sys
def grab_email(file):
email_pattern = re.compile(r'\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b',re.IGNORECASE)
found = set()
if os.path.isfile(file):
for line in open(file, 'r'):
found.update(email_pattern.findall(line))
for email_address in found:
print (email_address)
if __name__ == '__main__':
grab_email(sys.argv[1])
This will now run correctly from the command line (assuming you pass the file name correctly from the command line). I have also removed the extraneous function call.
Of course, if you just want this to run in the interpreter, take out the if statement and reinstate the function call I removed. You could also do this:
if __name__ == '__main__':
if len(sys.argv)>1:
grab_email(sys.argv[1])
else:
grab_email('email_addresses.txt')
Which isn't great, per se, but handles that particular error (while introducing another potential one).

How do I search for text in a page using regular expressions in Python?

I'm trying to create a simple module for phenny, a simple IRC bot framework in Python. The module is supposed to go to http://www.isup.me/websitetheuserrequested to check is a website was up or down. I assumed I could use regex for the module seeing as other built-in modules use it too, so I tried creating this simple script although I don't think I did it right.
import re, urllib
import web
isupuri = 'http://www.isup.me/%s'
check = re.compile(r'(?ims)<span class="body">.*?</span>')
def isup(phenny, input):
global isupuri
global cleanup
bytes = web.get(isupuri)
quote = check.findall(bytes)
result = re.sub(r'<[^>]*?>', '', str(quote[0]))
phenny.say(result)
isup.commands = ['isup']
isup.priority = 'low'
isup.example = '.isup google.com'
It imports the required web packages (I think), and defines the string and the text to look for within the page. I really don't know what I did in those four lines, I kinda just ripped the code off another phenny module.
Here is an example of a quotes module that grabs a random quote from some webpage, I kinda tried to use that as a base: http://pastebin.com/vs5ypHZy
Does anyone know what I am doing wrong? If something needs clarified I can tell you, I don't think I explained this enough.
Here is the error I get:
Traceback (most recent call last):
File "C:\phenny\bot.py", line 189, in call
try: func(phenny, input)
File "C:\phenny\modules\isup.py", line 18, in isup
result = re.sub(r'<[^>]*?>', '', str(quote[0]))
IndexError: list index out of range
try this (from http://docs.python.org/release/2.6.7/library/httplib.html#examples):
import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("HEAD","/index.html")
res = conn.getresponse()
if res.status >= 200 and res.status < 300:
print "up"
else:
print "down"
You will also need to add code to follow redirects before checking the response status.
edit
Alternative that does not need to handle redirects but uses exceptions for logic:
import urllib2
request = urllib2.Request('http://google.com')
request.get_method = lambda : 'HEAD'
try:
response = urllib2.urlopen(request)
print "up"
print response.code
except urllib2.URLError, e:
# failure
print "down"
print e
You should do your own tests and choose the best one.
The error means your regexp wasn't found anywhere on the page (the list quote has no element 0).

How can I get specific elements from XML data?

I have some code to retrieve XML data:
import cStringIO
import pycurl
from xml.etree import ElementTree
_API_KEY = 'my api key'
_ima = '/the/path/to/a/image'
sock = cStringIO.StringIO()
upl = pycurl.Curl()
values = [
("key", _API_KEY),
("image", (upl.FORM_FILE, _ima))]
upl.setopt(upl.URL, "http://api.imgur.com/2/upload.xml")
upl.setopt(upl.HTTPPOST, values)
upl.setopt(upl.WRITEFUNCTION, sock.write)
upl.perform()
upl.close()
xmldata = sock.getvalue()
#print xmldata
sock.close()
The resulting data looks like:
<?xml version="1.0" encoding="utf-8"?>
<upload><image><name></name><title></title><caption></caption><hash>dxPGi</hash><deletehash>kj2XOt4DC13juUW</deletehash><datetime>2011-06-10 02:59:26</datetime><type>image/png</type><animated>false</animated><width>1024</width><height>768</height><size>172863</size><views>0</views><bandwidth>0</bandwidth></image><links><original>http://i.stack.imgur.com/dxPGi.png</original><imgur_page>http://imgur.com/dxPGi</imgur_page><delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page><small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square><large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail></links></upload>
Now, following this answer, I'm trying to get some specific values from the data.
This is my attempt:
tree = ElementTree.fromstring(xmldata)
url = tree.findtext('original')
webpage = tree.findtext('imgur_page')
delpage = tree.findtext('delete_page')
print 'Url: ' + str(url)
print 'Pagina: ' + str(webpage)
print 'Link de borrado: ' + str(delpage)
I get an AttributeError if I try to add the .text access:
Traceback (most recent call last):
File "<pyshell#28>", line 27, in <module>
url = tree.find('original').text
AttributeError: 'NoneType' object has no attribute 'text'
I couldn't find anything in Python's help for ElementTree about this attribute. How can I get only the text, not the object?
I found some info about getting a text string here; but when I try it I get a TypeError:
Traceback (most recent call last):
File "<pyshell#32>", line 34, in <module>
print 'Url: ' + url
TypeError: cannot concatenate 'str' and 'NoneType' objects
If I try to print 'Url: ' + str(url) instead, there is no error, but the result shows as None.
How can I get the url, webpageanddelete_page` data from this XML?
Your find() call is trying to find an immediate child of the top of the tree with a tag named original, not a tag at any lower level than that. Use:
url = tree.find('.//original').text
if you want to find all elements in the tree with the tag named original. The pattern matching rules for ElementTree's find() method are laid out in a table on this page: http://effbot.org/zone/element-xpath.htm
For // matching it says:
Selects all subelements, on all levels beneath the current element (search the entire subtree). For example, “.//egg” selects all “egg” elements in the entire tree.
Edit: here is some test code for you, it use the XML sample string you posted I just ran it through XML Tidy in TextMate to make it legible:
from xml.etree import ElementTree
xmldata = '''<?xml version="1.0" encoding="utf-8"?>
<upload>
<image>
<name/>
<title/>
<caption/>
<hash>dxPGi</hash>
<deletehash>kj2XOt4DC13juUW</deletehash>
<datetime>2011-06-10 02:59:26</datetime>
<type>image/png</type>
<animated>false</animated>
<width>1024</width>
<height>768</height>
<size>172863</size>
<views>0</views>
<bandwidth>0</bandwidth>
</image>
<links>
<original>http://i.stack.imgur.com/dxPGi.png</original>
<imgur_page>http://imgur.com/dxPGi</imgur_page>
<delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page>
<small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square>
<large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail>
</links>
</upload>'''
tree = ElementTree.fromstring(xmldata)
print tree.find('.//original').text
On my machine (OS X running python 2.6.1) that produces:
Ian-Cs-MacBook-Pro:tmp ian$ python test.py
http://i.stack.imgur.com/dxPGi.png

Categories