My colleagues and I are working on some code to produce SQL merge strings for users of a library we're building in Python to be run in the Azure Databricks environment. These functions provide the SQL string through a custom exception that we've written called DebugMode. The issue that we've encountered and I can't find a satisfactory answer to is why when the DebugMode string is printed do the <=> characters get removed? This can be replicated with a simpler example below where I've tossed various items into the Exception string to see what would get printed and what wouldn't.
raise Exception('this is a string with the dreaded spaceship <=> < > <= >= `<=>` - = + / <random> \<rand2>')
This snippet results in the following:
What I don't understand is why the <=> character is missing in the Exception printout at the top but is present when you expand the Exception. Is there a way to get the first string to include the <=> character?
I've also included the custom DebugMode class we're using.
class DebugMode(Exception):
'''
Exception raised in the event of debug mode being enabled on any of the merge functions. It is intended to halt the merge and provide
the SQL merge string for manual review.
Attributes:
sql_string (str): The sql merge string produced by the merge function.
Methods:
None
'''
def __init__(self, sql_string, message='Debug mode was enabled, the SQL operation has halted to allow manual review of the SQL string below.'):
self.sql_string = sql_string
self.message = message
super().__init__(self.message) # overwrite the Exception base classe's message
def __str__(self):
return f'{self.message}\n{self.sql_string}'
Just providing a follow up for anyone who runs into the same thing. This was a Databricks notebook specific issue. #user2357112 had indicated, this was a problem where Databricks was parsing the html and had reserved the notation <some key work> for specific purposes (you can see some of these keywords and how they're using them here: https://docs.databricks.com/error-messages/index.html).
As #tdelaney noted, this isn't an issue in Jupyter notebooks or the python shell.
Related
I had a peewee query (against a mysql 8.0 server) working a few months ago, and now it gives me the following error:
peewee.OperationalError: (3995, "Character set 'utf8mb4_unicode_ci' cannot be used in conjunction with 'binary' in call to regexp_like.")
The line of code producing the error is:
words = (Word
.select(Word.word, Word.points)
.where(Word.word.regexp('^[aeiou]+$'))
.order_by(fn.CHAR_LENGTH(Word.word).desc(), Word.word)
a) I'm 99% sure it was working a few weeks, b) I can't see anything I might have changed, c) I'm pretty sure the resolution will be simple but I can't put my finger on it.
versions are peewee==3.15.4 and Python==3.10.9
The .regexp() translates into REGEXP BINARY operation.
Prior to MySQL 8.0.22, it was possible to use binary string arguments with these functions, but they yielded inconsistent results. In MySQL 8.0.22 and later, use of a binary string with any of the MySQL regular expression functions is rejected with ER_CHARACTER_SET_MISMATCH.
So you probably ought to switch that to .iregexp() which does not use this construction:
Word.word.iregexp('^[aeiou]+$')
If you need case-sensitivity, use fn.REGEXP_LIKE, which supports a flag for setting case-sensitivity: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-like
fn.REGEXP_LIKE(Word.word, '^[aeiou]+$', 'c')
I am new to use bioservices Python package. Now I am going to use that to retrieve PMIDs for two citations, given the specified information and this is the code I have tried:
from bioservices import EUtils
s = EUtils()
print(s.ECitMatch("pubmed",retmode="xml", bdata="proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|%0Dscience|1987|235|182|palmenberg+ac|Art2|"))
But it occurs an error:
"TypeError: ECitMatch() got multiple values for argument 'bdata'".
Could anyone help me to solve that problem?
I think the issue is that you have an unnamed argument (pubmed); if you look at the source code, you can see that the first argument should be bdata; if you provide the arguments like you do, it is, however, unclear whether bdata is "pubmed" or the named argument bdata, therefore the error you obtain.
You can reproduce it with this minimal example:
def dummy(a, b):
return a, b
dummy(10, a=3)
will return
TypeError: dummy() got multiple values for argument 'a'
If you remove "pubmed", the error disappears, however, the output is still incomplete:
from bioservices import EUtils
s = EUtils()
print(s.ECitMatch("proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|%0Dscience|1987|235|182|palmenberg+ac|Art2|"))
returns
'proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|2014248\n'
so only the first publication is taken into account. You can get the results for both by using the correct carriage return character \r:
print(s.ECitMatch(bdata="proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|\rscience|1987|235|182|palmenberg+ac|Art2|"))
will return
proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|2014248
science|1987|235|182|palmenberg+ac|Art2|3026048
I think you neither have to specify retmod nor the database (pubmed); if you look at the source code I linked above you can see:
query = "ecitmatch.cgi?db=pubmed&retmode=xml"
so seems it always uses pubmed and xml.
Two issues here: syntaxic and a bug.
The correct syntax is:
from bioservices import EUtils
s = EUtils()
query = "proc+natl+acad+sci+u+s+a|1991|88|3248|mann+bj|Art1|%0Dscience|1987|235|182|palmenberg+ac|Art2|"
print(s.ECitMatch(query))
Indeed, the underlying service related to ICitMatch has only one database (pubmed) and one format (xml) hence, those 2 parameters are not available : there are hard-coded. Therefore, only one argument is required: your query.
As for the second issue, as pointed above and reported on the bioservices issues page, your query would return only one publication. This was an issue with the special character %0D (in place of a return carriage) not being interpreted corectly by the URL request. This carriage character (either \n, \r or %0d) is now taken into account in the latest version on github or from pypi website if you use version 1.7.5
Thanks to willigot for filling the issue on bioservices page and bringing it to my attention.
disclaimer: i'm the main author of bioservices
I am extracting data from an Oracle 11g Database using python and writing it to an Excel file. During extraction, I'm using a python list of tuples (each tuple indicates each row in dataset) and the openpyxl module to write the data into Excel. It's working fine for some datasets but for some, it's throwing the exception:
openpyxl.utils.exceptions.IllegalCharacterError
This is the solution I've already tried:
Openpyxl.utils.exceptions.IllegalcharacterError
Here is my Code:
for i in range(0,len(list)):
for j in range(0,len(header)):
worksheet_ntn.cell(row = i+2, column = j+1).value = list[i][j]
Here is the error message:
raise IllegalCharacterError
openpyxl.utils.exceptions.IllegalCharacterError
I did get this error because of some hex charactres in some of my strings.
'Suport\x1f_01'
The encode\decode solutions mess with the accente words too
So...
i resolve this with repr()
value = repr(value)
That give a safe representation, with quotation marks
And then i remove the first and last charactres
value = repr(value)[1:-1]
Now you can safe insert value on your cell
The exception tells you everything you need to know: you must replace the characters that cause the exception. This can be done using re.sub() but, seeing as only you can decide what you want to replace them with — spaces, empty strings, etc. — only you can do this.
I'm working on a GUI editor for a propriety config format. Basically the editor will parse the config file, display the object properties so that users can edit from GUI and then write the objects back to the file.
I've got the parse - edit - write part done, except for:
The parsed data structure only include object properties information, so comments and whitespaces are lost on write
If there is any syntax error, the rest of the file is skipped
How would you address these issues? What is the usual approach to this problem? I'm using Python and Parsec module https://pythonhosted.org/parsec/documentation.html, however and help and general direction is appreciated.
I've also tried Pylens (https://pythonhosted.org/pylens/), which is really close to what I need except it can not skip syntax errors.
You asked about typical approaches to this problem. Here are two projects which tackle similar challenges to the one you describe:
sketch-n-sketch: "Direct manipulation" interface for vector images, where you can either edit the image-describing source language, or edit the image it represents directly and see those changes reflected in the source code. Check out the video presentation, it's super cool.
Boomerang: Using lenses to "focus" on the abstract meaning of some concrete syntax, alter that abstract model, and then reflect those changes in the original source.
Both projects have yielded several papers describing the approaches their authors took. As far as I can tell, the Lens approach is popular, where parsing and printing become the get and put functions of a Lens which takes a some source code and focuses on the abstract concept which that code describes.
Eventually I ran out of research time and have to settle with a rather manual skipping. Basically each time the parser fail we try to advance the cursor one character and repeat. Any parts skipped by the process, regardless of whitespace/comment/syntax error is dump into a Text structure. The code is quite reusable, except for the part you have to incorporate it to all the places with repeated results and the original parser may fail.
Here's the code, in case it helps anyone. It is written for Parsy.
class Text(object):
'''Structure to contain all the parts that the parser does not understand.
A better name would be Whitespace
'''
def __init__(self, text=''):
self.text = text
def __repr__(self):
return "Text(text='{}')".format(self.text)
def __eq__(self, other):
return self.text.strip() == getattr(other, 'text', '').strip()
def many_skip_error(parser, skip=lambda t, i: i + 1, until=None):
'''Repeat the original `parser`, aggregate result into `values`
and error in `Text`.
'''
#Parser
def _parser(stream, index):
values, result = [], None
while index < len(stream):
result = parser(stream, index)
# Original parser success
if result.status:
values.append(result.value)
index = result.index
# Check for end condition, effectively `manyTill` in Parsec
elif until is not None and until(stream, index).status:
break
# Aggregate skipped text into last `Text` value, or create a new one
else:
if len(values) > 0 and isinstance(values[-1], Text):
values[-1].text += stream[index]
else:
values.append(Text(stream[index]))
index = skip(stream, index)
return Result.success(index, values).aggregate(result)
return _parser
# Example usage
skip_error_parser = many_skip_error(original_parser)
On other note, I guess the real issue here is I'm using a parser combinator library instead of a proper two stages parsing process. In traditional parsing, the tokenizer will handle retaining/skipping any whitespace/comment/syntax error, making them all effectively whitespace and are invisible to the parser.
I can get the results from a one_shot query, but I can't get the full content of the _raw field.
import splunklib.client as client
import splunklib.results as results
def splunk_oneshot(search_string, **CARGS):
# Run a oneshot search and display the results using the results reader
service = client.connect(**CARGS)
oneshotsearch_results = service.jobs.oneshot(search_string)
# Get the results and display them using the ResultsReader
reader = results.ResultsReader(oneshotsearch_results)
for item in reader:
for key in item.keys():
print(key, len(item[key]), item[key])
This gives me the following for _raw:
('_raw', 120, '2013-05-03 22:17:18,497+0000 [SWF Activity attNsgActivitiesTaskList1 19] INFO c.s.r.h.s.a.n.AttNsgSmsRequestAdapter - ')
So this content is truncated at 120 characters. I need the entire value of the search result, because I need to run some string comparisons thereupon. I have not found any documentation on the ResultsReader fields or their size restrictions.
My best guess is that is caused by the insertion of special tags in the event raw data to highlight matched search terms in the Splunk UI front-end. In all likelihood, your search string specifies a matching literal term present in the raw data right at the point of truncation. This is not an appropriate default behavior for the SDK result-fetching method and there is currently a bug opened to fix this (internal reference DVPL-1519).
Fortunately, avoiding this problem is fairly trivial: One simply needs to pass segmentation='none' as an argument to the job.results() method:
(...)
oneshotsearch_results = service.jobs.oneshot(search_string,segmentation='none')
(...)
Do note that the 'segmentation' argument for the service.jobs() method is only available on Splunk 5.0 and onwards.