TL;DR: I'm trying to pass an XML object (using ET) to a Comtypes (SAPI) object in python 3.7.2 on Windows 10. It's failing due to invalid chars (see error below). Unicode characters are read correctly from the file, can be printed (but do not display correctly on the console). It seems like the XML is being passed as ASCII or that I'm missing a flag? (https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85)). If it is a missing flag, how do I pass it? (I haven't figured that part out yet..)
Long form description
I'm using Python 3.7.2 on Windows 10 and trying to send create an XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) so I found a UTF-8 encoded library containing IPA pronunciations. When I try to call the SAPI, with parts of the code replaced I get the following error:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
I've been trying to debug, but when I print the pronunciations of the words the characters are boxes. However if I copy and paste them from my console, they look fine (see below).
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
Best Guess
I'm unsure whether the problem is caused by
1) I've changed versions of pythons to be able to print unicode
2) I fixed problems with reading the file
3) I had incorrect manipulations of the string
I'm pretty sure the problem is that I'm not passing it as a unicode to the comtype object. The ideas I'm looking into are
1) Is there a flag missing?
2) Is it being converted to ascii when its being passed to comtypes (C types error)?
3) Is the XML being passed incorrectly/ am I missing a step?
Sneak peek at the code
This is the class that reads the IPA dictionary and then generates the XML file. Look at _load_phonemes and _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
I've also added how the code writes to the comtype object (SAPI) in case the error is there.
def __call__(self,text,outputFile):
# https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
Thanks in advance for your help!
Try to use single quotes inside ph attrubute.
Like this
my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'
also remember to use \ to escape single quote
UPD
Also this error could mean that your ph cannot be parsed. You can check docs there: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
this example will work
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
but this doesn't
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
</voice>
</speak>
Related
Introduction
I have problem with python program written in python 3.4.2. At the beginning i want to say, that it's not my program.
When i connect with server by SSH and compile it, it works just fine.
Server and PC specification
:
...and from my PC:
I have different Python version, but i can't compile it at 3.4.2, because there is no typing module for this specific version, which i need. I don't know if GCC version could cause this problem, but i've tried different versions.
I've downloaded it, and tried to compile it by myself. I run it in the exactly same way.
The real problem
Traceback (most recent call last):
File "gads.py", line 28, in <module>
lists = list_working.ListWorking(files_data)
File "/home/grzesiek/googleads/lib/list_working.py", line 43, in __init__
self._acc = self._split_str_list(list_data['accepted']['content'])
File "/home/grzesiek/googleads/lib/common.py", line 69, in _split_str_list
splited = re.split(separator, content)
File "/usr/local/lib/python3.5/re.py", line 203, in split
return _compile(pattern, flags).split(string, maxsplit)
TypeError: expected string or bytes-like object
So far i know that ListWorking(files_data) passes some files which are dictionaries, and at the end when i want to use regex it throws an error. But i can't change these dictionaries to strings or lists, because then it compiles, but erase data that i provide to ListWorking()
Here is fragment of code which i've tried to change:
def __init__(self, list_data: dict) -> None:
self._acc = self._split_str_list(list_data['accepted']['content'])
self._acc = self._del_dup(self._acc)
self._ign = self._split_str_list(list_data['ignored']['content'])
self._ign = self._del_dup(self._ign)
self._pro = self._split_str_list(list_data['protected']['content'])
self._pro = self._del_dup(self._pro)
self._fign = self._split_str_list(list_data['full_ignored']['content'])
self._fign = self._del_dup(self._fign)
self._key = self._split_str_list(list_data['keywords']['content'])
self._key = self._del_dup(self._key)
self._unk = self._split_str_list(list_data['unknown']['content'])
self._unk = self._del_dup(self._unk)
self._sw = self._split_str_list(list_data['stopwords']['content'])
And where the last error occurs:
def _split_str_list(content: str, separator: str = '\n') -> list:
"""Split string to list"""
splited = re.split(separator, content)
splited = list(x.strip() for x in splited)
splited = list(filter(None, splited))
return splited
Also, in Python 3.4.2 it comes to import typing and throws an error, because there is no typing lib in this version of Python.
So - how is it possible to work fine on Linux server but it doesn't on my PC?
Well, the answer was much simpler than i thought it would be...
I just had to install correct version of enca, code author didn't wrote the specific informations if something is missing, so it was very hard to find, because whole project has about ~5000 lines of code, and enca was used only by one function.
It had nothing to do with Linux or GCC.
I'm trying to run the following code in python 3.7. I keep getting a invalid syntax error and not sure why, can someone spot what i'm doing wrong? Indent seems to be fine, my "Prints" are in correct brackets i believe but i'm totally lost on the "if" and "else" statements.
class pdfPositionHandling:
def parse_obj(self, lt_objs):
# loop over the object list
for obj in lt_objs:
if isinstance(obj, pdfminer.layout.LTTextLine):
print ("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))
# if it's a textbox, also recurse
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
self.parse_obj(obj._objs)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
self.parse_obj(obj._objs)
def parsepdf(self, filename, startpage, endpage):
# Open a PDF file.
fp = open(filename, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
i = 0
# loop over all pages in the document
for page in PDFPage.create_pages(document):
if i >= startpage and i <= endpage:
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
self.parse_obj(layout._objs)
i += 1
I get the following error:
File "C:/Users/951298/Documents/Python Scripts/PDF Scraping/untitled1.py", line 12
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
^
SyntaxError: invalid syntax
Not sure why its point at the colon at the end?
In line 9 you should have typed 3 parenthesses inthe end but you only had 2 of them.Add another parenthes and it will work fine.
You forgot to place the ending bracket on your print statement. This causes an error on the next line because the interpreter ignores newlines when reading the code inside the brackets. In fact, the only reason it threw an error on line 12 is because if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): is is not a valid argument to pass to print.
Therefore, the following code would throw an error on line 11.
bar = "a"
baz = "a"
def foo(msg, bar="\n"):
print(msg, end=bar)
if bar == baz:
foo("bar is equal to baz",
bar = baz
else: #Throws error here
foo("bar is not equal to baz")
#Not the best example, I know, sorry.
Odd, is it not? Be sure to take a look the line(s) above the line that throws an error. It gives you both context, and a potential erroneous code. You especially need to watch for these kinds of errors in programming languages that require newline terminators.
In line 9 you should have 3 ending parenthesis, but I also happened to notice that you have two if statements and one elif statement but no else, they should all be if statements. Hope I helped!
I've looked through a lot of replies regarding this error, however none was helpfull for my special case and since I'm new to Python, I have difficulties applying the hints to my problem.
I have a class in a file Aheat.py that reads
class Aheat():
name = ""
time = 0
place = 0
def __init__(self,name,time,place):
self.name = name
self.time = time
self.place = place
And a file main.py where I want to read a html file, extract information, and create a list of objects of my class to work with them later on.
The (hopefully) essential part of my main.py reads
import urllib2
import re
from Aheat import Aheat
s = read something from url
ssplit = re.split('<p', s) # now every entry of ssplit contains an event
# and description and all the runners
HeatList = []
for part in ssplit:
newHeat = Aheat("foo",1,1) # of course this is just an example
HeatList.append(newHeat)
But this gives me the following error:
Traceback (most recent call last):
File "/home/username/Workspace/ECLIPSE/running/main.py", line 22, in <module>
newHeat = Aheat("foo",1,1)
TypeError: 'list' object is not callable
which is thrown when performing the second iteration.
If I take out the generation of the object of the loop, i.e.
newHeat = Aheat("foo",1,1)
for part in ssplit:
HeatList.append(newHeat)
My code executes without a problem, but this is not what I want. I'm also not sure, if I can initialize a specific number of instances a priori, since the number of objects is estimated in the loop.
I'm using Eclipse and Python 2.7.
regex is going to bite you.
<p == <pre> || <progress> || <param> || <p> || (any user created directives on a page.)
follow the links in your comments to read up on why we shouldn't parse html with regex.
Thanks, #MarkR ( btw, I was only supplementing your comment and I was agreeing with you )
Why not put the list in your class or better yet extend list functionality with your class.
class AHeat(list):
def append(self,name,time,place):
return super(AHeat,self).append([name,time,place])
# main
heatList= AHeat()
heatList.append("foo",1,2)
heatList.append("bar",3,4)
print(heatList[0])
print(heatList[1])
> ['foo', 1, 2]
> ['bar', 3, 4]
Also
Hello I am having trouble with a xml file I am using. Now what happens is on a short xml file the program works fine but for some reason once it reaches a size ( I am thinking 1 MB)
it gives me a "IndexError: list index out of range"
Here is the code I am writing so far.
from xml.dom import minidom
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
def xml_data():
f = open('C:\opidea_2.xml', 'r')
data = f.read()
f.close()
dom = minidom.parseString(data)
ic = (dom.getElementsByTagName('logentry'))
dom = None
content = ''
for num in ic:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
if name:
content += "***Changes by:" + str(name) + "*** " + '\n\n Date: '
else:
content += "***Changes are made Anonymously *** " + '\n\n Date: '
print content
if __name__ == "__main__":
xml_data ()
Here is part of the xml if it helps.
<log>
<logentry
revision="33185">
<author>glv</author>
<date>2012-08-06T21:01:52.494219Z</date>
<paths>
<path
kind="file"
action="M">/branches/Patch_4_2_0_Branch/text.xml</path>
<path
kind="dir"
action="M">/branches/Patch_4_2_0_Branch</path>
</paths>
<msg>PATCH_BRANCH:N/A
BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:N/A
Adding the SVN log size requirement to the branch
</msg>
</logentry>
</log>
The actual xml file is much bigger but this is the general format. It will actually work if it was this small but once it gets bigger I get problems.
here is the traceback
Traceback (most recent call last):
File "C:\python\src\SVN_Email_copy.py", line 141, in <module>
xml_data ()
File "C:\python\src\SVN_Email_copy.py", line 50, in xml_data
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
IndexError: list index out of range
Based on the code provided your error is going to be in this line:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
#xml node-^
#function call -------------------------^
#list indexing ----------------------------^
#attribute access -------------------------------------^
That's the only place in the demonstrated code that you're indexing into a list. That would imply that in your larger XML Sample you're missing an <author> tag. You'll have to correct that, or add in some level of error handling / data validation.
Please see the code elaboration for more explanation. You're doing a ton of things in a single line by taking advantage of the return behaviors of successive commands. So, the num is defined, that's fine. Then you call a function (method). It returns a list. You attempt to retrieve from that list and it throws an exception, so you never make it to the Attribute Access to get to firstChild, which definitely means you get no nodeValue.
Error checking may look something like this:
authors = num.getElementsByTagName('author')
if len(authors) > 0:
name = authors[0].firstChild.nodeValue
Though there are many, many ways you could achieve that.
I'm trying to create a simple module for phenny, a simple IRC bot framework in Python. The module is supposed to go to http://www.isup.me/websitetheuserrequested to check is a website was up or down. I assumed I could use regex for the module seeing as other built-in modules use it too, so I tried creating this simple script although I don't think I did it right.
import re, urllib
import web
isupuri = 'http://www.isup.me/%s'
check = re.compile(r'(?ims)<span class="body">.*?</span>')
def isup(phenny, input):
global isupuri
global cleanup
bytes = web.get(isupuri)
quote = check.findall(bytes)
result = re.sub(r'<[^>]*?>', '', str(quote[0]))
phenny.say(result)
isup.commands = ['isup']
isup.priority = 'low'
isup.example = '.isup google.com'
It imports the required web packages (I think), and defines the string and the text to look for within the page. I really don't know what I did in those four lines, I kinda just ripped the code off another phenny module.
Here is an example of a quotes module that grabs a random quote from some webpage, I kinda tried to use that as a base: http://pastebin.com/vs5ypHZy
Does anyone know what I am doing wrong? If something needs clarified I can tell you, I don't think I explained this enough.
Here is the error I get:
Traceback (most recent call last):
File "C:\phenny\bot.py", line 189, in call
try: func(phenny, input)
File "C:\phenny\modules\isup.py", line 18, in isup
result = re.sub(r'<[^>]*?>', '', str(quote[0]))
IndexError: list index out of range
try this (from http://docs.python.org/release/2.6.7/library/httplib.html#examples):
import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("HEAD","/index.html")
res = conn.getresponse()
if res.status >= 200 and res.status < 300:
print "up"
else:
print "down"
You will also need to add code to follow redirects before checking the response status.
edit
Alternative that does not need to handle redirects but uses exceptions for logic:
import urllib2
request = urllib2.Request('http://google.com')
request.get_method = lambda : 'HEAD'
try:
response = urllib2.urlopen(request)
print "up"
print response.code
except urllib2.URLError, e:
# failure
print "down"
print e
You should do your own tests and choose the best one.
The error means your regexp wasn't found anywhere on the page (the list quote has no element 0).